Learn about a case-study comparing OpenACC and OpenMP4.5 in the context of stellar explosions. Modeling supernovae requires multi-physics simulation codes to capture hydrodynamics, nuclear burning, gravitational forces, etc. As a nuclear detonation burns through the stellar material, it also increases the temperature. An equation of state (EOS) is then required to determine, say, the new pressure associated with this temperature increase. In fact, an EOS is needed after the thermodynamic conditions are changed by any physics routines. This means it is called many times throughout a simulation, requiring the need for a fast EOS implementation. Fortunately, these calculations can be performed independently during each time step, so the work can be offloaded to GPUs. Using the IBM/NVIDIA early test system (precursor to the upcoming Summit supercomputer) at Oak Ridge National Laboratory, we use a hybrid MPI+OpenMP (traditional CPU threads) driver program to offload work to GPUs. We'll compare the performance results as well as some of the currently available features of OpenACC and OpenMP4.5.
MPS method is a sort of particle method (not a stencil computation) used for computational fluid dynamics. "Search of neighbor-particle" is a main bottleneck of MPS. We show our porting efforts and three optimizations of search of neighbor-particle by using OpenACC. We evaluate our implementations on Tesla K20c, GeForce GTX 1080, and Tesla P100 GPUs. It achieved 45.7x, 96.8x, and 126.1x times speedup compared with single-thread Ivy-bridge CPU.
Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, interaction with Unified Memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.
We'll present the strategy and results for porting an atmospheric fluids code, HiGrad, to the GPU. Higrad is a cross-compiled, mixed-language code that includes C, C++, and Fortran, and is used for atmospheric modeling. Deep subroutine calls necessitate detailed control of the GPU data layout with CUDA-Fortran. We'll present initial kernel accelerations with OpenACC, then discuss tuning with OpenACC and a comparison with specially curated CUDA kernels. We'll demonstrate the performance improvement and different techniques used for porting this code to GPUs, using a mixed CUDA-Fortran and OpenACC implementation for single-node performance, and scaling studies conducted with MPI on local supercomputers and Oak Ridge National Laboratory's Titan supercomputer, on different architectures including the Tesla K40 and Tesla P100.
We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest NVIDIA?Tesla?P100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and large-scale runs on petascale computers such as Titan and Blue Waters. We'll highlight the performance benefits obtained from die-stacked memory on the Tesla P100, the NVIDIA NVLink# interconnect on the IBM "Minsky" platform, and the use of NVIDIA CUDA?just-in-time compilation to increase the performance of data-driven algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we'll describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.
Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU programming. This panel will throw light on what would be the primary objective(s) for a choice of model, whether its availability across multiple platforms, its rich feature set or its applicability for a certain type of scientific code or compilers' stability or other factors. This will be an interactive Q/A session where participants can discuss their experiences with programming model experts and developers.
OpenACC is a directive-based programming model that provides a simple interface to exploit GPU computing. As the GPU employs deep memory hierarchy, appropriate management of memory resources becomes crucial to ensure performance. The OpenACC programming model offers the cache directive to use on-chip hardware (read-only data cache) or software-managed (shared memory) caches to improve memory access efficiency. We have implemented several strategies to promote the shared memory utilization in our PGI compiler suite. We'll briefly discuss our investigation of cases that can be potentially optimized by the cache directive and then dive into the underlying implementation. Our compiler is evaluated with self-written micro-benchmarks as well as some real-world applications.
We'll discuss techniques for using more than one GPU in an OpenACC program. We'll demonstrate how to address multiple devices, mixing OpenACC and OpenMP to manage multiple devices, and utilizing multiple devices with OpenACC and MPI.