GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
Learn how to develop fast and energy-efficient linear solvers using GPUs. Hybrid CPU-GPU techniques achieve high performance at the cost of extra power consumption. The new advancements in GPU architectures enable full GPU solutions that are high performance, energy efficient, and CPU-independent. In addition, new technologies such as half precision arithmetic (FP16) help the design of new solvers that are significantly faster and even more energy efficient. While FP16 arithmetic has been a powerful tool for deep learning applications, our designs show that it is also very useful for boosting performance and energy efficiency of linear solvers. The new developments complement the hybrid algorithms in the MAGMA library, and provide users with a wide variety of designs that fit different requirements of performance, energy efficiency, and numerical accuracy.
Learn how to develop fast and energy-efficient linear solvers using GPUs. Hybrid CPU-GPU techniques achieve high performance at the cost of extra power consumption. The new advancements in GPU architectures enable full GPU solutions that are high performance, energy efficient, and CPU-independent. In addition, new technologies such as half precision arithmetic (FP16) help the design of new solvers that are significantly faster and even more energy efficient. While FP16 arithmetic has been a powerful tool for deep learning applications, our designs show that it is also very useful for boosting performance and energy efficiency of linear solvers. The new developments complement the hybrid algorithms in the MAGMA library, and provide users with a wide variety of designs that fit different requirements of performance, energy efficiency, and numerical accuracy.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8478
Streaming:
Download:
Share:
 
Abstract:
This work presents a high performance solution for Cholesky factorization on batches of relatively small matrices. We discuss both fixed-size and variable-size batched problems. In order to handle the irregularity associated with this type of workloads, we present new optimization techniques that can maintain relatively high performance on such small matrix sizes. The proposed solution outperform most of the existing state-of-the-art techniques that can solve batched problems.
This work presents a high performance solution for Cholesky factorization on batches of relatively small matrices. We discuss both fixed-size and variable-size batched problems. In order to handle the irregularity associated with this type of workloads, we present new optimization techniques that can maintain relatively high performance on such small matrix sizes. The proposed solution outperform most of the existing state-of-the-art techniques that can solve batched problems.  Back
 
Topics:
Performance Optimization, Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6340
Download:
Share:
 
Abstract:
KBLAS is a library that provides optimized kernels for critical numerical linear algebra operations. It currently provides a subset of standard BLAS kernels. It also extends such kernels to work on multi-GPU systems. KBLAS performance is at least as good as the performance of state-of-the-art libraries, including CUBLAS, MAGMA, and CULA. Some KBLAS kernels score performance speedups that range between 20% and 90%.
KBLAS is a library that provides optimized kernels for critical numerical linear algebra operations. It currently provides a subset of standard BLAS kernels. It also extends such kernels to work on multi-GPU systems. KBLAS performance is at least as good as the performance of state-of-the-art libraries, including CUBLAS, MAGMA, and CULA. Some KBLAS kernels score performance speedups that range between 20% and 90%.  Back
 
Topics:
Performance Optimization1
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4168
Download:
Share:
 
Abstract:

Reservoir simulation involve sparse iterative solvers for linear systems that arise from implicit discretizations of coupled PDEs from high-fidelity reservoir simulators. One of the major bottlenecks in these solvers is the sparse matrix-vector product. Sparse matrices are usually compressed in some format (e.g., CSR, ELL) before being processed. In this talk, we focus on the low-level design of a sparse matrix-vector (SpMV) kernel on GPUs. Most of the relevant contributions focus on introducing new formats that suit the GPU architecture such as the diagonal format for diagonal matrices and the blocked-ELL format for sparse matrices with small dense blocks. However, we target both generic and domain-specific implementations. Generic implementations basically target the CSR and ELL formats, in order to be part of the KAUST-BLAS library. More chances for further optimizations appear when the matrix has specific structure. In the talk, we will present the major design challenges and outlines, and preliminary results. The primary focus will be on the CSR format, where some preliminary results will be shown. The other bottleneck of reservoir simulations is the preconditioning in the sparse matrix solver. We investigate the possibility of a Fast Multipole Method based technique on GPUs as a compute-bound preconditioner.

Reservoir simulation involve sparse iterative solvers for linear systems that arise from implicit discretizations of coupled PDEs from high-fidelity reservoir simulators. One of the major bottlenecks in these solvers is the sparse matrix-vector product. Sparse matrices are usually compressed in some format (e.g., CSR, ELL) before being processed. In this talk, we focus on the low-level design of a sparse matrix-vector (SpMV) kernel on GPUs. Most of the relevant contributions focus on introducing new formats that suit the GPU architecture such as the diagonal format for diagonal matrices and the blocked-ELL format for sparse matrices with small dense blocks. However, we target both generic and domain-specific implementations. Generic implementations basically target the CSR and ELL formats, in order to be part of the KAUST-BLAS library. More chances for further optimizations appear when the matrix has specific structure. In the talk, we will present the major design challenges and outlines, and preliminary results. The primary focus will be on the CSR format, where some preliminary results will be shown. The other bottleneck of reservoir simulations is the preconditioning in the sparse matrix solver. We investigate the possibility of a Fast Multipole Method based technique on GPUs as a compute-bound preconditioner.

  Back
 
Topics:
Developer - Algorithms, Seismic & Geosciences
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3449
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next