Learn a simple strategy guideline to optimize applications runtime. The strategy is based on four steps and illustrated on a two-dimensional Discontinuous Galerkin solver for computational fluid dynamics on structured meshes. Starting from a CPU sequential code, we guide the audience through the different steps that allowed us to increase performances on a GPU around 149 times the original runtime of the code (performances evaluated on a K20Xm). The same optimization strategy is applied to the CPU code and increases performances around 35 times the original run time (performances evaluated on a E5-1650v3 processor). Based on this methodology, we finally end up with an optimized unified version of the code which can run simultaneously on both GPU and CPU architectures.
NAMD and VMD provide state-of-the-art molecular simulation, analysis, and visualization tools that leverage a panoply of GPU acceleration technologies to achieve performance levels that enable scientists to routinely apply research methods that were formerly too computationally demanding to be practical. To make state-of-the-art MD simulation and computational microscopy workflows available to a broader range of molecular scientists including non-traditional users of HPC systems, our center has begun producing pre-configured container images and Amazon EC2 AMIs that streamline deployment, particularly for specialized occasional-use workflows, e.g., for refinement of atomic structures obtained through cryo-electron microscopy. This talk will describe the latest technological advances in NAMD and VMD, using CUDA, OpenACC, and OptiX, including early results on ORNL Summit, state-of-the-art RTX hardware ray tracing on Turing GPUs, and easy deployment using containers and cloud computing infrastructure.
Learn about a case-study comparing OpenACC and OpenMP4.5 in the context of stellar explosions. Modeling supernovae requires multi-physics simulation codes to capture hydrodynamics, nuclear burning, gravitational forces, etc. As a nuclear detonation burns through the stellar material, it also increases the temperature. An equation of state (EOS) is then required to determine, say, the new pressure associated with this temperature increase. In fact, an EOS is needed after the thermodynamic conditions are changed by any physics routines. This means it is called many times throughout a simulation, requiring the need for a fast EOS implementation. Fortunately, these calculations can be performed independently during each time step, so the work can be offloaded to GPUs. Using the IBM/NVIDIA early test system (precursor to the upcoming Summit supercomputer) at Oak Ridge National Laboratory, we use a hybrid MPI+OpenMP (traditional CPU threads) driver program to offload work to GPUs. We'll compare the performance results as well as some of the currently available features of OpenACC and OpenMP4.5.
MPS method is a sort of particle method (not a stencil computation) used for computational fluid dynamics. "Search of neighbor-particle" is a main bottleneck of MPS. We show our porting efforts and three optimizations of search of neighbor-particle by using OpenACC. We evaluate our implementations on Tesla K20c, GeForce GTX 1080, and Tesla P100 GPUs. It achieved 45.7x, 96.8x, and 126.1x times speedup compared with single-thread Ivy-bridge CPU.
In order to prepare the scientific communities, GENCI and its partners have set up a technology watch group and lead collaborations with vendors, relying on HPC experts and early adopted HPC solutions. The two main objectives are providing guidance and prepare the scientific communities to challenges of exascale architectures. The talk will present the OpenPOWER platform bought by GENCI and provided to the scientific community. Then, it will present the first results obtained on the platform for a set of about 15 applications using all the solutions provided to the users (CUDA,OpenACC,OpenMP,...). Finally, a presentation about one specific application will be made regarding its porting effort and techniques used for GPUs with both OpenACC and OpenMP.
We present in this talk a portable matrix assembly strategy used in solving PDEs, suited for co-execution on both the CPUs and accelerators. In addition, a dynamic load balancing strategy is considered to balance the workload among the different CPUs and GPUs available on the cluster. Numerical methods for solving partial differential equations (PDEs) involve two main steps: the assembly of an algebraic system of the form Ax=b and the solution of it with direct or iterative solvers. The assembly step consists of a loop over elements, faces and nodes in the case of the finite element, finite volume, and finite difference methods, respectively. It is computationally intensive and does not involve communication. It is therefore well-suited for accelerators.
This talk provides an overview of the key strategies used to design and implement OpenStaPLE, an application for Lattice QCD (LQCD) Monte Carlo simulations. LQCD are an example of HPC grand challenge applications, where the accuracy of results strongly depends on available computing resources. OpenStaPLE has been developed on top of MPI and OpenACC frameworks. It manages the parallelism across multiple computing nodes and devices, while OpenACC exploits the high level parallelism available on modern processors and accelerators, enabling a good level of portability across different architectures. After an initial overview, we also present performance and portability results on different architectures, highlighting key improvements of hardware and software key that may lead this class of applications to exhibit better performances.
VASP is a software package for atomic-scale materials modeling. It's one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. We'll give an overview on the status of porting VASP to GPUs with OpenACC. Parts of VASP were previously ported to CUDA C with good speed-ups on GPUs, but also with an increase in the maintenance workload, because VASP is otherwise written wholly in Fortran. We'll discuss OpenACC performance relative to CUDA, the impact of OpenACC on VASP code maintenance, and challenges encountered in the port related to management of aggregate data structures. Finally, we'll discuss possible future solutions for data management that would simplify both new development and the maintenance of VASP and similar large production applications on GPUs.
In 2014, GENCI set up a French technologyï»¿ watch group that targets the provisioning of test systems, selected as part of the prospective approach among partners from GENCI. This was done in order to prepare scientific communities and users of GENCI's computing resources for the arrival of the next "Exascale" technologies.\nThe talk will present results obtained on the OpenPOWER platform bought by GENCI and open to the scientific community. We will present on the first results obtained for a set of scientific applications using the available environments (CUDA,OpenACC,OpenMP,â¦), along with results obtained for AI applications using IBM's software distribution PowerAI.
We'll guide you step by step to port and optimize an oil-and-gas miniapplication to efficiently leverage the amazing computing power of NVIDIA GPUs. While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model that may be combined with CUDA, and recently with OpenACC as well. Using OpenACC, we'll start benefiting from GPU computing, obtaining great coding productivity, and a nice performance improvement. We can next fine-tune the critical application parts developing CUDA kernels to hand-optimize the problem. OmpSs combined with either OpenACC or CUDA will enable seamless task parallelism leveraging all system devices.
Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, interaction with Unified Memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.
We'll present the strategy and results for porting an atmospheric fluids code, HiGrad, to the GPU. Higrad is a cross-compiled, mixed-language code that includes C, C++, and Fortran, and is used for atmospheric modeling. Deep subroutine calls necessitate detailed control of the GPU data layout with CUDA-Fortran. We'll present initial kernel accelerations with OpenACC, then discuss tuning with OpenACC and a comparison with specially curated CUDA kernels. We'll demonstrate the performance improvement and different techniques used for porting this code to GPUs, using a mixed CUDA-Fortran and OpenACC implementation for single-node performance, and scaling studies conducted with MPI on local supercomputers and Oak Ridge National Laboratory's Titan supercomputer, on different architectures including the Tesla K40 and Tesla P100.
We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest NVIDIA?Tesla?P100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and large-scale runs on petascale computers such as Titan and Blue Waters. We'll highlight the performance benefits obtained from die-stacked memory on the Tesla P100, the NVIDIA NVLink# interconnect on the IBM "Minsky" platform, and the use of NVIDIA CUDA?just-in-time compilation to increase the performance of data-driven algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we'll describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.
Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU programming. This panel will throw light on what would be the primary objective(s) for a choice of model, whether its availability across multiple platforms, its rich feature set or its applicability for a certain type of scientific code or compilers' stability or other factors. This will be an interactive Q/A session where participants can discuss their experiences with programming model experts and developers.
OpenACC is a directive-based programming model that provides a simple interface to exploit GPU computing. As the GPU employs deep memory hierarchy, appropriate management of memory resources becomes crucial to ensure performance. The OpenACC programming model offers the cache directive to use on-chip hardware (read-only data cache) or software-managed (shared memory) caches to improve memory access efficiency. We have implemented several strategies to promote the shared memory utilization in our PGI compiler suite. We'll briefly discuss our investigation of cases that can be potentially optimized by the cache directive and then dive into the underlying implementation. Our compiler is evaluated with self-written micro-benchmarks as well as some real-world applications.
Come learn why the authors of VASP, Fluent, Gaussian, Synopsys and numerous other science and engineering applications are using OpenACC. OpenACC supports and promotes scalable parallel programming on both multicore CPUs and GPU-accelerated systems, enabling large production applications to port effectively to the newest generation of supercomputers. It has very well-supported interoperability with CUDA C++, CUDA Fortran, MPI and OpenMP, allowing you to optimize each aspect of your application with the appropriate tools. OpenACC has proven to be the ideal on-ramp to parallel and GPU computing, even for those who need to tune their most important kernels using libraries or CUDA. Come see how you can try OpenACC with the free PGI Community Edition compiler suite.
Architectures are becoming increasingly heterogeneous offering developers a rich variety of computing resources. While these architectures benefit from customized optimization strategies, the scientific developers tend to prefer a 'write-once' code in order to create a code that is portable, yet performance-efficient and migratable to rapidly changing hardware. This talk will present stories of porting scientific applications using OpenACC to state-of-the-art heterogeneous computing systems. Applications will range from molecular dynamics, nuclear physics, neutrino experiments, and climate domains.
The C++17 and Fortran 2018 language standards include parallel programming constructs well-suited for GPU computing. The C++17 parallel STL (pSTL) was designed with intent to support GPU parallel programming. The F18 do concurrent construct with its shared and private variable clauses can be used to express loop-level parallelism across multiple array index ranges. We will share our experiences and results implementing support for these constructs in the PGI C++ and Fortran compilers for NVIDIA GPUs, and explain the capabilities and limitations they offer HPC programmers. You will learn how to use OpenACC as a bridge to GPU and parallel programming with standard C++ and Fortran, and we will present additional features we hope and expect will become a part of those standards.
We'll discuss techniques for using more than one GPU in an OpenACC program. We'll demonstrate how to address multiple devices, mixing OpenACC and OpenMP to manage multiple devices, and utilizing multiple devices with OpenACC and MPI.