GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
Come and learn how task-based programming model leverages the performance of the Reverse Time Migration (RTM) on large GPU clusters. By relying on a dynamic runtime system to schedule the various tasks of the RTM (e.g., stencil computation kernel, Perfectly Matched Layer computations, I/O operations, image condition calculations, etc.), the overall application translates into an out-of-order execution. This opens up new opportunities to further overlap expensive and non-critical operations, such as I/O, with tasks which belong to the critical path, such as high performance GPU stencil kernel computation during the forward/backward modeling. Idle time is then reduced, while load balancing is achieved through work stealing on each node. To further reduce the overhead of the I/O operations, numerical compression algorithms are investigated, in addition to the asynchronous execution, to prevent from running in an out-of-core mode of operation for maximum occupancy on GPU memory.
Come and learn how task-based programming model leverages the performance of the Reverse Time Migration (RTM) on large GPU clusters. By relying on a dynamic runtime system to schedule the various tasks of the RTM (e.g., stencil computation kernel, Perfectly Matched Layer computations, I/O operations, image condition calculations, etc.), the overall application translates into an out-of-order execution. This opens up new opportunities to further overlap expensive and non-critical operations, such as I/O, with tasks which belong to the critical path, such as high performance GPU stencil kernel computation during the forward/backward modeling. Idle time is then reduced, while load balancing is achieved through work stealing on each node. To further reduce the overhead of the I/O operations, numerical compression algorithms are investigated, in addition to the asynchronous execution, to prevent from running in an out-of-core mode of operation for maximum occupancy on GPU memory.  Back
 
Topics:
Performance Optimization, Seismic & Geosciences
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8235
Streaming:
Download:
Share:
 
Abstract:
Have you heard about the world's biggest eye ever built? Are you interested in scientific simulations running on NVIDIA DGX-1? Come and learn how combining these powerful computing devices dramatically leaps forward the computational astronomy community in designing major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on DGX-1, we'll explain how the resulting matrix computations associated with an efficient task-based programming model help design the next generation of telescope instruments and, eventually, demonstrate a pathfinder for the discovery of new galaxies.
Have you heard about the world's biggest eye ever built? Are you interested in scientific simulations running on NVIDIA DGX-1? Come and learn how combining these powerful computing devices dramatically leaps forward the computational astronomy community in designing major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on DGX-1, we'll explain how the resulting matrix computations associated with an efficient task-based programming model help design the next generation of telescope instruments and, eventually, demonstrate a pathfinder for the discovery of new galaxies.  Back
 
Topics:
Algorithms & Numerical Techniques, Astronomy & Astrophysics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8231
Streaming:
Download:
Share:
 
Abstract:

Come and learn about new fast low-rank matrix computations on GPUs! By exploiting the low-rank off-diagonal block structure, we design and implement fast linear algebra operations on massively parallel hardware architectures. The main idea is to refactor the numerical algorithms and the corresponding implementations by aggregating similar numerical operations in terms of highly optimized batched kernels. Applications in weather prediction, seismic imaging and material science are employed to assess the trade-off between numerical accuracy and parallel performance of these fast matrix computations compared to more traditional approaches..

Come and learn about new fast low-rank matrix computations on GPUs! By exploiting the low-rank off-diagonal block structure, we design and implement fast linear algebra operations on massively parallel hardware architectures. The main idea is to refactor the numerical algorithms and the corresponding implementations by aggregating similar numerical operations in terms of highly optimized batched kernels. Applications in weather prediction, seismic imaging and material science are employed to assess the trade-off between numerical accuracy and parallel performance of these fast matrix computations compared to more traditional approaches..

  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries, HPC and AI
Type:
Talk
Event:
GTC Europe
Year:
2017
Session ID:
23367
Download:
Share:
 
Abstract:
Learn how statistical modeling is revolutionizing weather/climate prediction applications. Such models offer high fidelity in theory and are increasingly viewed as potential replacements to actual simulations. The main drawbacks of such models are the expensive number of flops and the overhead of the memory footprint due to computations resulting from the large dense covariance matrix, which makes it unrealistic in practice. By exploiting the low rank structure of the matrix and redesigning the underlying linear algebra in terms of batch operations, the fidelity of the model is not only maintained but also the corresponding performance achieved on GPUs is unprecedented. Low-rank matrix computations on GPUs boosts existing machine learning algorithms for weather prediction applications and opens new research directions.
Learn how statistical modeling is revolutionizing weather/climate prediction applications. Such models offer high fidelity in theory and are increasingly viewed as potential replacements to actual simulations. The main drawbacks of such models are the expensive number of flops and the overhead of the memory footprint due to computations resulting from the large dense covariance matrix, which makes it unrealistic in practice. By exploiting the low rank structure of the matrix and redesigning the underlying linear algebra in terms of batch operations, the fidelity of the model is not only maintained but also the corresponding performance achieved on GPUs is unprecedented. Low-rank matrix computations on GPUs boosts existing machine learning algorithms for weather prediction applications and opens new research directions.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7413
Download:
Share:
 
Abstract:

Have you heard about the largest ground-based telescope ever built? Are you interested in the newest NVIDIA DGX-1 hardware accelerator? Come and learn how the DGX-1 architecture dramatically leaps forward the computational astronomy community in designing major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on distributed-memory systems with hardware accelerators, we'll explain how the resulting matrix computations associated with an efficient task-based programming model help design the next generation of telescope instruments.

Have you heard about the largest ground-based telescope ever built? Are you interested in the newest NVIDIA DGX-1 hardware accelerator? Come and learn how the DGX-1 architecture dramatically leaps forward the computational astronomy community in designing major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on distributed-memory systems with hardware accelerators, we'll explain how the resulting matrix computations associated with an efficient task-based programming model help design the next generation of telescope instruments.

  Back
 
Topics:
Astronomy & Astrophysics, Tools & Libraries, Federal
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7153
Download:
Share:
 
Abstract:

Learn how GPUs help design major, multimillion-dollar optical instruments for the European Extremely Large Telescope. From the mathematical model up to the high-performance implementation on distributed-memory systems with hardware accelerators, this talk will explain how the resulting dense linear algebra operations associated with an efficient task-based programming model help design the next generation of telescope instruments. 

Learn how GPUs help design major, multimillion-dollar optical instruments for the European Extremely Large Telescope. From the mathematical model up to the high-performance implementation on distributed-memory systems with hardware accelerators, this talk will explain how the resulting dense linear algebra operations associated with an efficient task-based programming model help design the next generation of telescope instruments. 

  Back
 
Topics:
HPC and Supercomputing, Algorithms & Numerical Techniques, Astronomy & Astrophysics
Type:
Talk
Event:
GTC Europe
Year:
2016
Session ID:
SEU6173
Download:
Share:
 
Abstract:
Have you heard about the world's biggest eye? Learn how GPUs help design major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on distributed-memory systems with hardware accelerators, we'll explain how the resulting dense linear algebra operations associated with an efficient task-based programming model help design the next generation of telescope instruments.
Have you heard about the world's biggest eye? Learn how GPUs help design major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on distributed-memory systems with hardware accelerators, we'll explain how the resulting dense linear algebra operations associated with an efficient task-based programming model help design the next generation of telescope instruments.  Back
 
Topics:
Astronomy & Astrophysics, Algorithms & Numerical Techniques, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6229
Streaming:
Download:
Share:
 
Abstract:
Learn about a new hierarchical matrix structure for fast linear algebra computations on GPUs! Recursivity, tree traversal, hierarchical data layout, and batched kernel executions are some of the ingredients of a new HPC recipe for computing challenging linear algebra operations and solving large scientific problems (e.g., spatial statistics) on GPUs. By exploiting the low-rank matrix representations, the original dense matrix of the problem can be approximated, which results in saving the memory footprint and reducing the algorithmic complexity, while still maintaining an adequate solution accuracy. In addition, the talk showcases a new high-performance hierarchical symmetric eigensolver and SVD, juicing the horsepower out of multiple GPUs to the fullest.
Learn about a new hierarchical matrix structure for fast linear algebra computations on GPUs! Recursivity, tree traversal, hierarchical data layout, and batched kernel executions are some of the ingredients of a new HPC recipe for computing challenging linear algebra operations and solving large scientific problems (e.g., spatial statistics) on GPUs. By exploiting the low-rank matrix representations, the original dense matrix of the problem can be approximated, which results in saving the memory footprint and reducing the algorithmic complexity, while still maintaining an adequate solution accuracy. In addition, the talk showcases a new high-performance hierarchical symmetric eigensolver and SVD, juicing the horsepower out of multiple GPUs to the fullest.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6230
Streaming:
Download:
Share:
 
Abstract:
We present a high performance hierarchical matrix vector multiplication using hardware accelerators. By properly mapping the tree structures to the GPU and overlapping the phases of the computation using streams, we greatly outperform the CPU implementations and achieve up to 80% of the sustained bandwidth of the GPU.
We present a high performance hierarchical matrix vector multiplication using hardware accelerators. By properly mapping the tree structures to the GPU and overlapping the phases of the computation using streams, we greatly outperform the CPU implementations and achieve up to 80% of the sustained bandwidth of the GPU.  Back
 
Topics:
Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6140
Download:
Share:
 
Abstract:
Come and learn how GPUs can help discovering the most distant galaxies by performing close to real-time simulations at an unprecedented scale of the multi-object adaptive optics technique (MOAO). The European Southern Observatory (ESO) is leading the construction of the European Extremely Large Telescope (E-ELT), a 39m diameter telescope, to provide Europe with the biggest eye on the universe ever built. MOA is the most complex adaptive optics concept proposed for the E-ELT and simulating the instrument at full scale is extremely compute-intensive. The tomographic reconstructor (TR) is one of the core components of both the design simulations and eventually system operations and it requires the inversion of a large dense covariance matrix.
Come and learn how GPUs can help discovering the most distant galaxies by performing close to real-time simulations at an unprecedented scale of the multi-object adaptive optics technique (MOAO). The European Southern Observatory (ESO) is leading the construction of the European Extremely Large Telescope (E-ELT), a 39m diameter telescope, to provide Europe with the biggest eye on the universe ever built. MOA is the most complex adaptive optics concept proposed for the E-ELT and simulating the instrument at full scale is extremely compute-intensive. The tomographic reconstructor (TR) is one of the core components of both the design simulations and eventually system operations and it requires the inversion of a large dense covariance matrix.  Back
 
Topics:
Astronomy & Astrophysics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5122
Streaming:
Share:
 
Abstract:
Learn how to leverage current numerical algorithms for solving challenging reservoir and seismic simulation problems on GPUs using: 1) a novel preconditioner technique based on massively parallel, compute intensive Fast N-body methods, 2) an optimized implementation of the Sparse Matrix-Vector multiplication used during the iterative solver phase, which exploits the existing structure of the sparse matrix and 3) a synchronization-reducing algorithm for stencil-based computation during explicit time integration.
Learn how to leverage current numerical algorithms for solving challenging reservoir and seismic simulation problems on GPUs using: 1) a novel preconditioner technique based on massively parallel, compute intensive Fast N-body methods, 2) an optimized implementation of the Sparse Matrix-Vector multiplication used during the iterative solver phase, which exploits the existing structure of the sparse matrix and 3) a synchronization-reducing algorithm for stencil-based computation during explicit time integration.   Back
 
Topics:
Seismic & Geosciences, Numerical Algorithms & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4287
Streaming:
Download:
Share:
 
Abstract:

Learn about the newest developments in high-performance numerical linear algebra for heterogeneous GPU-based systems. A number of novel algorithms and the methodology used for their implementation on multiGPU platforms will be shown. The implementations are open source, available through the MAGMA library âààa next generation of LAPACK for heterogeneous architectures. Included are both linear system and eigenproblem solvers for both dense and sparse computations. The developments incorporate advances made through the CUDA Center of Excellence (CCOE) at University of Tennessee, the CCOE at King Abdullah University of Science and Technology, Saudi Arabia, and at INRIA, France though the StarPU and MORSE projects.

Learn about the newest developments in high-performance numerical linear algebra for heterogeneous GPU-based systems. A number of novel algorithms and the methodology used for their implementation on multiGPU platforms will be shown. The implementations are open source, available through the MAGMA library âààa next generation of LAPACK for heterogeneous architectures. Included are both linear system and eigenproblem solvers for both dense and sparse computations. The developments incorporate advances made through the CUDA Center of Excellence (CCOE) at University of Tennessee, the CCOE at King Abdullah University of Science and Technology, Saudi Arabia, and at INRIA, France though the StarPU and MORSE projects.

  Back
 
Topics:
Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3281
Streaming:
Download:
Share:
 
Abstract:

Reservoir simulation involve sparse iterative solvers for linear systems that arise from implicit discretizations of coupled PDEs from high-fidelity reservoir simulators. One of the major bottlenecks in these solvers is the sparse matrix-vector product. Sparse matrices are usually compressed in some format (e.g., CSR, ELL) before being processed. In this talk, we focus on the low-level design of a sparse matrix-vector (SpMV) kernel on GPUs. Most of the relevant contributions focus on introducing new formats that suit the GPU architecture such as the diagonal format for diagonal matrices and the blocked-ELL format for sparse matrices with small dense blocks. However, we target both generic and domain-specific implementations. Generic implementations basically target the CSR and ELL formats, in order to be part of the KAUST-BLAS library. More chances for further optimizations appear when the matrix has specific structure. In the talk, we will present the major design challenges and outlines, and preliminary results. The primary focus will be on the CSR format, where some preliminary results will be shown. The other bottleneck of reservoir simulations is the preconditioning in the sparse matrix solver. We investigate the possibility of a Fast Multipole Method based technique on GPUs as a compute-bound preconditioner.

Reservoir simulation involve sparse iterative solvers for linear systems that arise from implicit discretizations of coupled PDEs from high-fidelity reservoir simulators. One of the major bottlenecks in these solvers is the sparse matrix-vector product. Sparse matrices are usually compressed in some format (e.g., CSR, ELL) before being processed. In this talk, we focus on the low-level design of a sparse matrix-vector (SpMV) kernel on GPUs. Most of the relevant contributions focus on introducing new formats that suit the GPU architecture such as the diagonal format for diagonal matrices and the blocked-ELL format for sparse matrices with small dense blocks. However, we target both generic and domain-specific implementations. Generic implementations basically target the CSR and ELL formats, in order to be part of the KAUST-BLAS library. More chances for further optimizations appear when the matrix has specific structure. In the talk, we will present the major design challenges and outlines, and preliminary results. The primary focus will be on the CSR format, where some preliminary results will be shown. The other bottleneck of reservoir simulations is the preconditioning in the sparse matrix solver. We investigate the possibility of a Fast Multipole Method based technique on GPUs as a compute-bound preconditioner.

  Back
 
Topics:
Developer - Algorithms, Seismic & Geosciences
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3449
Streaming:
Download:
Share:
 
Abstract:

See the newest features integrated in MAGMA (Matrix Algebra on GPU and Multicore Architectures) to tackle the multiple GPU-based systems for numerical linear algebra. In this talk, we describe how we leveraged MAGMA to solve existing and new challenging numerical problems on multiple hardware accelerators. Using a hybridization methodology, the new multiGPU-enabled MAGMA is characterized by a representation of linear algebra algorithms as directed acyclic graphs, where nodes correspond to tasks and edges to data dependencies among them, and a dynamic runtime system environment StarPU used to schedule various computational kernels over hybrid architectures of GPUs and homogeneous multicores.

See the newest features integrated in MAGMA (Matrix Algebra on GPU and Multicore Architectures) to tackle the multiple GPU-based systems for numerical linear algebra. In this talk, we describe how we leveraged MAGMA to solve existing and new challenging numerical problems on multiple hardware accelerators. Using a hybridization methodology, the new multiGPU-enabled MAGMA is characterized by a representation of linear algebra algorithms as directed acyclic graphs, where nodes correspond to tasks and edges to data dependencies among them, and a dynamic runtime system environment StarPU used to schedule various computational kernels over hybrid architectures of GPUs and homogeneous multicores.

  Back
 
Topics:
Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2012
Session ID:
S2042
Streaming:
Download:
Share:
 
Abstract:
We aim in the work presented here to describe an optimized numerical kernels computing the symmetric matrix-vector product (Level 2 BLAS) on the last NVIDIA TESLA GPU family, codenamed Fermi (C2070). Due to its inherent memory-bound nature, this kernel represents one of the most critical operations in computing the tridiagonal form of a symmetric dense matrix, which is the preprocessing step toward calculating the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show up to 3.5 fold speedups over existing numerical libraries.
We aim in the work presented here to describe an optimized numerical kernels computing the symmetric matrix-vector product (Level 2 BLAS) on the last NVIDIA TESLA GPU family, codenamed Fermi (C2070). Due to its inherent memory-bound nature, this kernel represents one of the most critical operations in computing the tridiagonal form of a symmetric dense matrix, which is the preprocessing step toward calculating the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show up to 3.5 fold speedups over existing numerical libraries.  Back
 
Topics:
Developer - Algorithms
Type:
Poster
Event:
GTC Silicon Valley
Year:
2012
Session ID:
P2401
Download:
Share:
 
Speakers:
Hatem Ltaief, Stan Tomov
- University of Tennessee
Abstract:
Learn how to develop faster, cheaper and better linear algebra software for GPUs through a hybridization methodology that is built on (1) Representing linear algebra algorithms as directed acyclic graphs where nodes correspond to tasks and edges to dependencies among them, and (2) Scheduling the execution of the tasks over hybrid architectures of GPUs and multicore. Examples will be given using MAGMA, a new generation of linear algebra libraries that extends the sequential LAPACK-style algorithms to the highly parallel GPU and multicore heterogeneous architectures.
Learn how to develop faster, cheaper and better linear algebra software for GPUs through a hybridization methodology that is built on (1) Representing linear algebra algorithms as directed acyclic graphs where nodes correspond to tasks and edges to dependencies among them, and (2) Scheduling the execution of the tasks over hybrid architectures of GPUs and multicore. Examples will be given using MAGMA, a new generation of linear algebra libraries that extends the sequential LAPACK-style algorithms to the highly parallel GPU and multicore heterogeneous architectures.  Back
 
Topics:
HPC and AI, Tools & Libraries, Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2010
Session ID:
S102138
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next