GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:

Learn about the latest developments in MVAPICH2 library that simplifies the task of porting Message Passing Interface (MPI) applications to supercomputing clusters with NVIDIA GPUs. MVAPICH2 supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit. Various optimizations are integrated transparently under standard MPI API, for better programmability. Recent advances in MVAPICH2 include designs for MPI-3 RMA using GPUDirect RDMA, usage of fast GDRCOPY library, framework for MPI Datatype processing using CUDA kernels, and more. Performance results with micro-benchmarks and applications will be presented using MPI and CUDA/OpenACC. Impact of processor affinity to GPU and network affecting the performance will be presented.

Learn about the latest developments in MVAPICH2 library that simplifies the task of porting Message Passing Interface (MPI) applications to supercomputing clusters with NVIDIA GPUs. MVAPICH2 supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit. Various optimizations are integrated transparently under standard MPI API, for better programmability. Recent advances in MVAPICH2 include designs for MPI-3 RMA using GPUDirect RDMA, usage of fast GDRCOPY library, framework for MPI Datatype processing using CUDA kernels, and more. Performance results with micro-benchmarks and applications will be presented using MPI and CUDA/OpenACC. Impact of processor affinity to GPU and network affecting the performance will be presented.

  Back
 
Topics:
HPC and Supercomputing, Tools & Libraries, Data Center & Cloud Infrastructure
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5461
Streaming:
Download:
Share:
 
Abstract:

Learn about extensions that enable efficient use of Partitioned Global Address Space (PGAS) Models like OpenSHMEM and UPC on supercomputing clusters with NVIDIA GPUs. PGAS models are gaining attention for providing shared memory abstractions that make it easy to develop applications with dynamic and irregular communication patterns. However, the existing UPC and OpenSHMEM standards do not allow communication calls to be made directly on GPU device memory.This talk discusses simple extensions to the OpenSHMEM and UPC models to address this issue. Runtimes to support these extensions, optimize data movement using features like CUDA IPC and GPUDirect RDMA and exploiting overlap are presented. We demonstrate the use of the extensions and performance impact of the runtime designs.

Learn about extensions that enable efficient use of Partitioned Global Address Space (PGAS) Models like OpenSHMEM and UPC on supercomputing clusters with NVIDIA GPUs. PGAS models are gaining attention for providing shared memory abstractions that make it easy to develop applications with dynamic and irregular communication patterns. However, the existing UPC and OpenSHMEM standards do not allow communication calls to be made directly on GPU device memory.This talk discusses simple extensions to the OpenSHMEM and UPC models to address this issue. Runtimes to support these extensions, optimize data movement using features like CUDA IPC and GPUDirect RDMA and exploiting overlap are presented. We demonstrate the use of the extensions and performance impact of the runtime designs.

  Back
 
Topics:
HPC and Supercomputing, Tools & Libraries, Data Center & Cloud Infrastructure
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5470
Streaming:
Download:
Share:
 
Abstract:

Learn about the latest developments in middleware design that boosts the performance of GPGPU based streaming applications. Several middlewares already support communication directly from GPU device memory and optimize it using various features offered by the CUDA toolkit, providing optimized performance. Some of these middlewares also take advantage of novel features like hardware based multicast that high performance networks like InfiniBand offer to boost broadcast performance. This talk will focus on challenges in combining and fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design support for high performance broadcast operation for streaming applications. Performance results will be presented to demonstrate the efficacy of the proposed designs.

Learn about the latest developments in middleware design that boosts the performance of GPGPU based streaming applications. Several middlewares already support communication directly from GPU device memory and optimize it using various features offered by the CUDA toolkit, providing optimized performance. Some of these middlewares also take advantage of novel features like hardware based multicast that high performance networks like InfiniBand offer to boost broadcast performance. This talk will focus on challenges in combining and fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design support for high performance broadcast operation for streaming applications. Performance results will be presented to demonstrate the efficacy of the proposed designs.

  Back
 
Topics:
Data Center & Cloud Infrastructure, Tools & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5507
Streaming:
Download:
Share:
 
Abstract:
Learn about the latest developments in MVAPICH2 library that simplifies the task of porting Message Passing Interface (MPI) applications to supercomputing clusters with NVIDIA GPUs. MVAPICH2 supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit, providing optimized performance on different GPU node configurations. These optimizations are integrated transparently under standard MPI API, for better programmability. Recent advances in MVAPICH2 include designs for MPI-3 RMA using GPUDirect RDMA, framework for MPI Datatype processing using CUDA kernels, support for heterogeneous clusters with GPU and non-GPU nodes, and more. We use the popular OSU micro-benchmark suite and example applications to demonstrate how developers can effectively take advantage of MVAPICH2 in applications using MPI and CUDA/OpenACC. We provide guidance on issues like processor affinity to GPU and network that can significantly affect the performance of MPI applications that use MVAPICH2.
Learn about the latest developments in MVAPICH2 library that simplifies the task of porting Message Passing Interface (MPI) applications to supercomputing clusters with NVIDIA GPUs. MVAPICH2 supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit, providing optimized performance on different GPU node configurations. These optimizations are integrated transparently under standard MPI API, for better programmability. Recent advances in MVAPICH2 include designs for MPI-3 RMA using GPUDirect RDMA, framework for MPI Datatype processing using CUDA kernels, support for heterogeneous clusters with GPU and non-GPU nodes, and more. We use the popular OSU micro-benchmark suite and example applications to demonstrate how developers can effectively take advantage of MVAPICH2 in applications using MPI and CUDA/OpenACC. We provide guidance on issues like processor affinity to GPU and network that can significantly affect the performance of MPI applications that use MVAPICH2.  Back
 
Topics:
HPC and Supercomputing, Performance Optimization1, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4517
Streaming:
Download:
Share:
 
Abstract:
Learn about extensions that enable efficient use of Partitioned Global Address Space (PGAS) Models like OpenSHMEM and UPC on supercomputing clusters with NVIDIA GPUs. PGAS models are gaining attention for providing shared memory abstractions that make it easy to develop applications with dynamic communication patterns. However, the existing UPC and OpenSHMEM standards do not allow communication calls to be made directly on GPU device memory. Data has to be moved to the CPU before PGAS models can be used for communication. This talk discusses simple extensions to the OpenSHMEM and UPC models that address this issue. They allow direct communication from GPU memory and enable runtimes to optimize data movement using features like CUDA IPC and GPUDirect RDMA, in a way that is transparent to the application developer. We present designs which focus on performance and truly one-sided communication. We use application kernels to demonstrate the use of the extensions and performance impact of the runtime designs, on clusters with GPUs.
Learn about extensions that enable efficient use of Partitioned Global Address Space (PGAS) Models like OpenSHMEM and UPC on supercomputing clusters with NVIDIA GPUs. PGAS models are gaining attention for providing shared memory abstractions that make it easy to develop applications with dynamic communication patterns. However, the existing UPC and OpenSHMEM standards do not allow communication calls to be made directly on GPU device memory. Data has to be moved to the CPU before PGAS models can be used for communication. This talk discusses simple extensions to the OpenSHMEM and UPC models that address this issue. They allow direct communication from GPU memory and enable runtimes to optimize data movement using features like CUDA IPC and GPUDirect RDMA, in a way that is transparent to the application developer. We present designs which focus on performance and truly one-sided communication. We use application kernels to demonstrate the use of the extensions and performance impact of the runtime designs, on clusters with GPUs.  Back
 
Topics:
Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4528
Streaming:
Download:
Share:
 
Abstract:
Learn about the design and use of a hybrid High-Performance Linpack (HPL) benchmark to measure the peak performance of heterogeneous clusters with GPU and non-GPU nodes. HPL continues to be used as the yardstick for ranking supercomputers around the world. Many clusters, of different scales, are being deployed with only a subset of nodes equipped with NVIDIA GPU accelerators. Their true peak performance is not reported due to the lack of a version of HPL that can take advantage of all the CPU and GPU resources available. We discuss a simple yet elegant approach of a fine-grain weighted MPI process distribution to balance the load between CPU and GPU nodes. We use techniques like process reordering to minimize communication overheads. We use a real-world cluster, Oakley at the Ohio Supercomputer Center, to evaluate our approach. On a heterogeneous configuration with 32 GPU and 192 non-GPU nodes, we achieve up to 50% of the combined theoretical peak and up to 80% of the combined actual peak performance of the GPU and non-GPU nodes.
Learn about the design and use of a hybrid High-Performance Linpack (HPL) benchmark to measure the peak performance of heterogeneous clusters with GPU and non-GPU nodes. HPL continues to be used as the yardstick for ranking supercomputers around the world. Many clusters, of different scales, are being deployed with only a subset of nodes equipped with NVIDIA GPU accelerators. Their true peak performance is not reported due to the lack of a version of HPL that can take advantage of all the CPU and GPU resources available. We discuss a simple yet elegant approach of a fine-grain weighted MPI process distribution to balance the load between CPU and GPU nodes. We use techniques like process reordering to minimize communication overheads. We use a real-world cluster, Oakley at the Ohio Supercomputer Center, to evaluate our approach. On a heterogeneous configuration with 32 GPU and 192 non-GPU nodes, we achieve up to 50% of the combined theoretical peak and up to 80% of the combined actual peak performance of the GPU and non-GPU nodes.  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4535
Streaming:
Download:
Share:
 
Speakers:
Dhabaleswar K. (DK) Panda
- Ohio State University
 
Topics:
Tools & Libraries
Type:
Talk
Event:
Supercomputing
Year:
2011
Session ID:
SC120
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next