GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
Learn how to accelerate many small-sized linear algebra problems - from kernels to large-scale solvers. We describe techniques targeting parallelization, vectorization, and communication, which have become extremely challenging on many-core architectures/GPUs. Standard interfaces, called batched APIs, are proposed to be included in highly-optimized libraries like MAGMA that provide the most extended set of batched BLAS and LAPACK functionalities to date. We'll describe the developments as well as their use to accelerate applications from big data analytics to high-order FEM tensor computations, and low-rank approximations for solvers and preconditioners. We'll also concentrate on the GPU acceleration of a large-scale distributed-memory solver that uses a hierarchically compressed coefficient matrix.
Learn how to accelerate many small-sized linear algebra problems - from kernels to large-scale solvers. We describe techniques targeting parallelization, vectorization, and communication, which have become extremely challenging on many-core architectures/GPUs. Standard interfaces, called batched APIs, are proposed to be included in highly-optimized libraries like MAGMA that provide the most extended set of batched BLAS and LAPACK functionalities to date. We'll describe the developments as well as their use to accelerate applications from big data analytics to high-order FEM tensor computations, and low-rank approximations for solvers and preconditioners. We'll also concentrate on the GPU acceleration of a large-scale distributed-memory solver that uses a hierarchically compressed coefficient matrix.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8475
Streaming:
Download:
Share:
 
Abstract:
Learn techniques for efficient batched computations on GPUs, where small and independent computations must be grouped and executed together to obtain high performance. These problems occur very frequently in scientific applications like machine learning, data mining, dense and sparse solvers, high-order FEM, astrophysics, and more. We will consider the development of batched computations for these applications, stressing innovative GPU techniques and algorithms for uniform, as well as variable-size batches, tensor contractions, batched BLAS, and more. Batched computations can fill up the GPU with work, remove scheduling overheads and costly CPU-GPU communications to accelerate the computation often by an order of magnitude compared to non-batched approaches.
Learn techniques for efficient batched computations on GPUs, where small and independent computations must be grouped and executed together to obtain high performance. These problems occur very frequently in scientific applications like machine learning, data mining, dense and sparse solvers, high-order FEM, astrophysics, and more. We will consider the development of batched computations for these applications, stressing innovative GPU techniques and algorithms for uniform, as well as variable-size batches, tensor contractions, batched BLAS, and more. Batched computations can fill up the GPU with work, remove scheduling overheads and costly CPU-GPU communications to accelerate the computation often by an order of magnitude compared to non-batched approaches.  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6509
Streaming:
Download:
Share:
 
Abstract:
Here you will learn techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data re-use. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations -LU, QR, and Cholesky- for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general. We will show extensions to large entirely GPU solvers, review and compare against the hybrid CPU-GPU algorithms in MAGMA, analyze the pros and cons of hybrid vs. just GPU approaches on high-end systems and low-end embedded devices.
Here you will learn techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data re-use. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations -LU, QR, and Cholesky- for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general. We will show extensions to large entirely GPU solvers, review and compare against the hybrid CPU-GPU algorithms in MAGMA, analyze the pros and cons of hybrid vs. just GPU approaches on high-end systems and low-end embedded devices.  Back
 
Topics:
Developer - Algorithms, Tools & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5476
Streaming:
Download:
Share:
 
Abstract:

See the newest features integrated in MAGMA (Matrix Algebra on GPU and Multicore Architectures) to tackle the multiple GPU-based systems for numerical linear algebra. In this talk, we describe how we leveraged MAGMA to solve existing and new challenging numerical problems on multiple hardware accelerators. Using a hybridization methodology, the new multiGPU-enabled MAGMA is characterized by a representation of linear algebra algorithms as directed acyclic graphs, where nodes correspond to tasks and edges to data dependencies among them, and a dynamic runtime system environment StarPU used to schedule various computational kernels over hybrid architectures of GPUs and homogeneous multicores.

See the newest features integrated in MAGMA (Matrix Algebra on GPU and Multicore Architectures) to tackle the multiple GPU-based systems for numerical linear algebra. In this talk, we describe how we leveraged MAGMA to solve existing and new challenging numerical problems on multiple hardware accelerators. Using a hybridization methodology, the new multiGPU-enabled MAGMA is characterized by a representation of linear algebra algorithms as directed acyclic graphs, where nodes correspond to tasks and edges to data dependencies among them, and a dynamic runtime system environment StarPU used to schedule various computational kernels over hybrid architectures of GPUs and homogeneous multicores.

  Back
 
Topics:
Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2012
Session ID:
S2042
Streaming:
Download:
Share:
 
Abstract:

Learn about the excitements and challenges in optimizing CUDA kernels for the last two generations of NVIDIA GPGPUs. Autotuning, although crucially important, is merely a silver bullet to port code from one generation of GPU to another. The process required many steps: (a) architecture specific algorithms, (b) tuning algorithms, (c) finding innovative tricks to handle generic cases, (d) tweaking GPU's internal scheduling to handle partition camping, and (e) above all, the dedication of many enthusiastic programmers. We will share our experiences and discoveries through the development of MAGMABLAS - a subset of CUDA BLAS, highly optimized for NVIDIA GPGPUs.

Learn about the excitements and challenges in optimizing CUDA kernels for the last two generations of NVIDIA GPGPUs. Autotuning, although crucially important, is merely a silver bullet to port code from one generation of GPU to another. The process required many steps: (a) architecture specific algorithms, (b) tuning algorithms, (c) finding innovative tricks to handle generic cases, (d) tweaking GPU's internal scheduling to handle partition camping, and (e) above all, the dedication of many enthusiastic programmers. We will share our experiences and discoveries through the development of MAGMABLAS - a subset of CUDA BLAS, highly optimized for NVIDIA GPGPUs.

  Back
 
Topics:
Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2012
Session ID:
S2248
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next