Learn about the latest developments in MVAPICH2 library that simplifies the task of porting Message Passing Interface (MPI) applications to supercomputing clusters with NVIDIA GPUs. MVAPICH2 supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit. Various optimizations are integrated transparently under standard MPI API, for better programmability. Recent advances in MVAPICH2 include designs for MPI-3 RMA using GPUDirect RDMA, usage of fast GDRCOPY library, framework for MPI Datatype processing using CUDA kernels, and more. Performance results with micro-benchmarks and applications will be presented using MPI and CUDA/OpenACC. Impact of processor affinity to GPU and network affecting the performance will be presented.
Learn about extensions that enable efficient use of Partitioned Global Address Space (PGAS) Models like OpenSHMEM and UPC on supercomputing clusters with NVIDIA GPUs. PGAS models are gaining attention for providing shared memory abstractions that make it easy to develop applications with dynamic and irregular communication patterns. However, the existing UPC and OpenSHMEM standards do not allow communication calls to be made directly on GPU device memory.This talk discusses simple extensions to the OpenSHMEM and UPC models to address this issue. Runtimes to support these extensions, optimize data movement using features like CUDA IPC and GPUDirect RDMA and exploiting overlap are presented. We demonstrate the use of the extensions and performance impact of the runtime designs.
Learn about the latest developments in middleware design that boosts the performance of GPGPU based streaming applications. Several middlewares already support communication directly from GPU device memory and optimize it using various features offered by the CUDA toolkit, providing optimized performance. Some of these middlewares also take advantage of novel features like hardware based multicast that high performance networks like InfiniBand offer to boost broadcast performance. This talk will focus on challenges in combining and fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design support for high performance broadcast operation for streaming applications. Performance results will be presented to demonstrate the efficacy of the proposed designs.