Learn a simple strategy guideline to optimize applications runtime. The strategy is based on four steps and illustrated on a two-dimensional Discontinuous Galerkin solver for computational fluid dynamics on structured meshes. Starting from a CPU sequential code, we guide the audience through the different steps that allowed us to increase performances on a GPU around 149 times the original runtime of the code (performances evaluated on a K20Xm). The same optimization strategy is applied to the CPU code and increases performances around 35 times the original run time (performances evaluated on a E5-1650v3 processor). Based on this methodology, we finally end up with an optimized unified version of the code which can run simultaneously on both GPU and CPU architectures.
Learn about a case-study comparing OpenACC and OpenMP4.5 in the context of stellar explosions. Modeling supernovae requires multi-physics simulation codes to capture hydrodynamics, nuclear burning, gravitational forces, etc. As a nuclear detonation burns through the stellar material, it also increases the temperature. An equation of state (EOS) is then required to determine, say, the new pressure associated with this temperature increase. In fact, an EOS is needed after the thermodynamic conditions are changed by any physics routines. This means it is called many times throughout a simulation, requiring the need for a fast EOS implementation. Fortunately, these calculations can be performed independently during each time step, so the work can be offloaded to GPUs. Using the IBM/NVIDIA early test system (precursor to the upcoming Summit supercomputer) at Oak Ridge National Laboratory, we use a hybrid MPI+OpenMP (traditional CPU threads) driver program to offload work to GPUs. We'll compare the performance results as well as some of the currently available features of OpenACC and OpenMP4.5.
MPS method is a sort of particle method (not a stencil computation) used for computational fluid dynamics. "Search of neighbor-particle" is a main bottleneck of MPS. We show our porting efforts and three optimizations of search of neighbor-particle by using OpenACC. We evaluate our implementations on Tesla K20c, GeForce GTX 1080, and Tesla P100 GPUs. It achieved 45.7x, 96.8x, and 126.1x times speedup compared with single-thread Ivy-bridge CPU.
In order to prepare the scientific communities, GENCI and its partners have set up a technology watch group and lead collaborations with vendors, relying on HPC experts and early adopted HPC solutions. The two main objectives are providing guidance and prepare the scientific communities to challenges of exascale architectures. The talk will present the OpenPOWER platform bought by GENCI and provided to the scientific community. Then, it will present the first results obtained on the platform for a set of about 15 applications using all the solutions provided to the users (CUDA,OpenACC,OpenMP,...). Finally, a presentation about one specific application will be made regarding its porting effort and techniques used for GPUs with both OpenACC and OpenMP.
We present in this talk a portable matrix assembly strategy used in solving PDEs, suited for co-execution on both the CPUs and accelerators. In addition, a dynamic load balancing strategy is considered to balance the workload among the different CPUs and GPUs available on the cluster. Numerical methods for solving partial differential equations (PDEs) involve two main steps: the assembly of an algebraic system of the form Ax=b and the solution of it with direct or iterative solvers. The assembly step consists of a loop over elements, faces and nodes in the case of the finite element, finite volume, and finite difference methods, respectively. It is computationally intensive and does not involve communication. It is therefore well-suited for accelerators.
This talk provides an overview of the key strategies used to design and implement OpenStaPLE, an application for Lattice QCD (LQCD) Monte Carlo simulations. LQCD are an example of HPC grand challenge applications, where the accuracy of results strongly depends on available computing resources. OpenStaPLE has been developed on top of MPI and OpenACC frameworks. It manages the parallelism across multiple computing nodes and devices, while OpenACC exploits the high level parallelism available on modern processors and accelerators, enabling a good level of portability across different architectures. After an initial overview, we also present performance and portability results on different architectures, highlighting key improvements of hardware and software key that may lead this class of applications to exhibit better performances.
VASP is a software package for atomic-scale materials modeling. It's one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. We'll give an overview on the status of porting VASP to GPUs with OpenACC. Parts of VASP were previously ported to CUDA C with good speed-ups on GPUs, but also with an increase in the maintenance workload, because VASP is otherwise written wholly in Fortran. We'll discuss OpenACC performance relative to CUDA, the impact of OpenACC on VASP code maintenance, and challenges encountered in the port related to management of aggregate data structures. Finally, we'll discuss possible future solutions for data management that would simplify both new development and the maintenance of VASP and similar large production applications on GPUs.
In 2014, GENCI set up a French technologyï»¿ watch group that targets the provisioning of test systems, selected as part of the prospective approach among partners from GENCI. This was done in order to prepare scientific communities and users of GENCI's computing resources for the arrival of the next "Exascale" technologies.\nThe talk will present results obtained on the OpenPOWER platform bought by GENCI and open to the scientific community. We will present on the first results obtained for a set of scientific applications using the available environments (CUDA,OpenACC,OpenMP,â¦), along with results obtained for AI applications using IBM's software distribution PowerAI.
Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, interaction with Unified Memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.
We'll present the strategy and results for porting an atmospheric fluids code, HiGrad, to the GPU. Higrad is a cross-compiled, mixed-language code that includes C, C++, and Fortran, and is used for atmospheric modeling. Deep subroutine calls necessitate detailed control of the GPU data layout with CUDA-Fortran. We'll present initial kernel accelerations with OpenACC, then discuss tuning with OpenACC and a comparison with specially curated CUDA kernels. We'll demonstrate the performance improvement and different techniques used for porting this code to GPUs, using a mixed CUDA-Fortran and OpenACC implementation for single-node performance, and scaling studies conducted with MPI on local supercomputers and Oak Ridge National Laboratory's Titan supercomputer, on different architectures including the Tesla K40 and Tesla P100.
We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest NVIDIA?Tesla?P100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and large-scale runs on petascale computers such as Titan and Blue Waters. We'll highlight the performance benefits obtained from die-stacked memory on the Tesla P100, the NVIDIA NVLink# interconnect on the IBM "Minsky" platform, and the use of NVIDIA CUDA?just-in-time compilation to increase the performance of data-driven algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we'll describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.
Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU programming. This panel will throw light on what would be the primary objective(s) for a choice of model, whether its availability across multiple platforms, its rich feature set or its applicability for a certain type of scientific code or compilers' stability or other factors. This will be an interactive Q/A session where participants can discuss their experiences with programming model experts and developers.
OpenACC is a directive-based programming model that provides a simple interface to exploit GPU computing. As the GPU employs deep memory hierarchy, appropriate management of memory resources becomes crucial to ensure performance. The OpenACC programming model offers the cache directive to use on-chip hardware (read-only data cache) or software-managed (shared memory) caches to improve memory access efficiency. We have implemented several strategies to promote the shared memory utilization in our PGI compiler suite. We'll briefly discuss our investigation of cases that can be potentially optimized by the cache directive and then dive into the underlying implementation. Our compiler is evaluated with self-written micro-benchmarks as well as some real-world applications.
We'll discuss techniques for using more than one GPU in an OpenACC program. We'll demonstrate how to address multiple devices, mixing OpenACC and OpenMP to manage multiple devices, and utilizing multiple devices with OpenACC and MPI.