SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC On-Demand

Algorithms and Numerical Techniques
Presentation
Media
Abstract:
The present study deals with porting scalable parallel CFD application HiFUN on NVIDIA Graphics Processing Unit (GPU) using an off-load strategy. The present strategy focuses on improving single node performance of the HiFUN solver with the help of G ...Read More
Abstract:
The present study deals with porting scalable parallel CFD application HiFUN on NVIDIA Graphics Processing Unit (GPU) using an off-load strategy. The present strategy focuses on improving single node performance of the HiFUN solver with the help of GPUs. This work clearly brings out the efficacy of the off-load strategy using OpenACC directives on GPUs and may be considered as one of the attractive models for porting legacy CFD codes on GPU based HPC and Supercomputing platform.  Back
 
Topics:
Algorithms and Numerical Techniques, Computational Fluid Dynamics, Computer Aided Engineering
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8799
Streaming:
Download:
Share:
 
Abstract:
We''ll present use of state-of-the-art computational fluid dynamics algorithms and their performance on NVIDIA GPUs, including the new DGX-1 Station using multiple Tesla V100 GPU accelerators. A novel mapped grid approach to implementing high-order s ...Read More
Abstract:
We''ll present use of state-of-the-art computational fluid dynamics algorithms and their performance on NVIDIA GPUs, including the new DGX-1 Station using multiple Tesla V100 GPU accelerators. A novel mapped grid approach to implementing high-order stencil based finite-difference and finite-volume methods is the highlight, but we''ll also feature the use of flux-reconstruction on GPU using OpenACC.  Back
 
Topics:
Algorithms and Numerical Techniques, Computational Fluid Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8800
Streaming:
Download:
Share:
 
Abstract:
The goal of this session is to report the knowledge acquired at the Oak Ridge GPU Hackathon that took place on October 9th-13th 2017, through the acceleration of a CFD (Computational Fluid Dynamics) solver. We''ll focus on the approach used to make t ...Read More
Abstract:
The goal of this session is to report the knowledge acquired at the Oak Ridge GPU Hackathon that took place on October 9th-13th 2017, through the acceleration of a CFD (Computational Fluid Dynamics) solver. We''ll focus on the approach used to make the application suitable for the GPU, the acceleration obtained, and the overall experience at the Hackathon. OpenACC was used to implement GPU directives in this work. We''ll detail the different OpenACC directives used, their advantages and disadvantages, as well as the particularities of CFD applications.  Back
 
Topics:
Algorithms and Numerical Techniques, Computational Fluid Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8291
Streaming:
Download:
Share:
 
Abstract:
Learn a simple strategy guideline to optimize applications runtime. The strategy is based on four steps and illustrated on a two-dimensional Discontinuous Galerkin solver for computational fluid dynamics on structured meshes. Starting from a CPU ...Read More
Abstract:

Learn a simple strategy guideline to optimize applications runtime. The strategy is based on four steps and illustrated on a two-dimensional Discontinuous Galerkin solver for computational fluid dynamics on structured meshes. Starting from a CPU sequential code, we guide the audience through the different steps that allowed us to increase performances on a GPU around 149 times the original runtime of the code (performances evaluated on a K20Xm). The same optimization strategy is applied to the CPU code and increases performances around 35 times the original run time (performances evaluated on a E5-1650v3 processor). Based on this methodology, we finally end up with an optimized unified version of the code which can run simultaneously on both GPU and CPU architectures.

  Back
 
Topics:
Algorithms and Numerical Techniques, Computational Fluid Dynamics, HPC and AI
Type:
Talk
Event:
GTC Europe
Year:
2017
Session ID:
23191
Download:
Share:
Astronomy and Astrophysics
Presentation
Media
Abstract:
We'll describe a real-world example of adding OpenACC to a legacy MPI FORTRAN Preconditioned Conjugate Gradient code, and show timing results for multi-node, multi-GPU runs. The code's application is obtaining 3D spherical potential field (PF) solu ...Read More
Abstract:
We'll describe a real-world example of adding OpenACC to a legacy MPI FORTRAN Preconditioned Conjugate Gradient code, and show timing results for multi-node, multi-GPU runs. The code's application is obtaining 3D spherical potential field (PF) solutions of the solar corona using observational boundary conditions. PF solutions yield approximations of the coronal magnetic field structure and can be used as initial/boundary conditions for MHD simulations with applications to space weather prediction. We highlight key tips and strategies used when converting the MPI code to MPI+OpenACC, including linking Fortran code to the cuSparse library, using CUDA-aware MPI, maintaining performance portability, and dealing with multi-node, multi-GPU run-time environments. We'll show timing results for three increasing-sized problems for running the code with MPI-only (up to 1728 CPU cores), and with MPI+GPU (up to 60 GPUs) using NVIDIA K80 and P100 GPUs.  Back
 
Topics:
Astronomy and Astrophysics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7535
Download:
Share:
 
Abstract:
Learn about a case-study comparing OpenACC and OpenMP4.5 in the context of stellar explosions. Modeling supernovae requires multi-physics simulation codes to capture hydrodynamics, nuclear burning, gravitational forces, etc. As a nuclear detonat ...Read More
Abstract:

Learn about a case-study comparing OpenACC and OpenMP4.5 in the context of stellar explosions. Modeling supernovae requires multi-physics simulation codes to capture hydrodynamics, nuclear burning, gravitational forces, etc. As a nuclear detonation burns through the stellar material, it also increases the temperature. An equation of state (EOS) is then required to determine, say, the new pressure associated with this temperature increase. In fact, an EOS is needed after the thermodynamic conditions are changed by any physics routines. This means it is called many times throughout a simulation, requiring the need for a fast EOS implementation. Fortunately, these calculations can be performed independently during each time step, so the work can be offloaded to GPUs. Using the IBM/NVIDIA early test system (precursor to the upcoming Summit supercomputer) at Oak Ridge National Laboratory, we use a hybrid MPI+OpenMP (traditional CPU threads) driver program to offload work to GPUs. We'll compare the performance results as well as some of the currently available features of OpenACC and OpenMP4.5.

  Back
 
Topics:
Astronomy and Astrophysics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7635
Download:
Share:
Climate, Weather, Ocean Modeling
Presentation
Media
Abstract:
We'll discuss the Max Planck/University of Chicago Radiative MHD code (MURaM), the primary model for simulating the sun's upper convection zone, its surface, and the corona. Accelerating MURaM allows physicists to interpret high-resolution solar ob ...Read More
Abstract:
We'll discuss the Max Planck/University of Chicago Radiative MHD code (MURaM), the primary model for simulating the sun's upper convection zone, its surface, and the corona. Accelerating MURaM allows physicists to interpret high-resolution solar observations. We'll describe the programmatic challenges and optimization techniques we employed while using the OpenACC programming model to accelerate MURaM on GPUs and multicore architectures. We will also examine what we learned and how it could be broadly applied on atmospheric applications that demonstrate radiation-transport methods.  Back
 
Topics:
Climate, Weather, Ocean Modeling, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9288
Streaming:
Download:
Share:
 
Abstract:
We'll detail the inherent challenges in porting a GPU-accelerated community code to a newer major version, integrating the community non-GPU changes with OpenACC directives from the earlier version. This is a non-trivial exercise - this particular v ...Read More
Abstract:
We'll detail the inherent challenges in porting a GPU-accelerated community code to a newer major version, integrating the community non-GPU changes with OpenACC directives from the earlier version. This is a non-trivial exercise - this particular version upgrade contained 143,000 modified lines of code which required reintegration into our accelerator directives. This work is important in providing support for newer features whilst still providing GPU support for the users. We'll also look at efforts to improve the maintainability of GPU accelerated community codes.  Back
 
Topics:
Climate, Weather, Ocean Modeling, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8241
Streaming:
Download:
Share:
 
Abstract:
We'll take you on a journey through enabling applications for GPUs; interoperability of different languages (including Fortran, OpenACC, C, and CUDA); CUDA library interfacing; data management, movement, and layout tuning; kernel optimization; tool ...Read More
Abstract:
We'll take you on a journey through enabling applications for GPUs; interoperability of different languages (including Fortran, OpenACC, C, and CUDA); CUDA library interfacing; data management, movement, and layout tuning; kernel optimization; tool usage; multi-GPU data transfer; and performance modeling. We'll show how careful optimizations can have a dramatic effect and push application performance towards the maximum possible on the hardware. We'll describe tuning of multi-GPU communications, including efficient exploitation of high-bandwidth NVLink hardware. The applications used in this study are from the domain of numerical weather prediction, and also feature in the ESCAPE European collaborative project, but we'll present widely relevant techniques in a generic and easily transferable way.  Back
 
Topics:
Climate, Weather, Ocean Modeling, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8190
Streaming:
Share:
 
Abstract:
We'll give a high-level overview of the results of these efforts, and how we built a cross-organizational partnership to achieve them. Ours is a directive-based approach using OMP and OpenACC to achieve portability. We have focused on achieving good ...Read More
Abstract:
We'll give a high-level overview of the results of these efforts, and how we built a cross-organizational partnership to achieve them. Ours is a directive-based approach using OMP and OpenACC to achieve portability. We have focused on achieving good performance on three main architectural branches available to us, namely: traditional multi-core processors (e.g. Intel Xeons), core processors such as the Intel Xeon Phi, and, of course NVIDIA GPUs. Our focus has been on creating tools for accelerating the optimization process, techniques for effective cross-platform optimization, and methodologies for characterizing and understanding performance. The results are encouraging, suggesting a path forward based on standard directives for responding to the pressures of future architectures.  Back
 
Topics:
Climate, Weather, Ocean Modeling, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8811
Streaming:
Download:
Share:
Computational Biology and Chemistry
Presentation
Media
Abstract:
The chemical shift of a protein structure offers a lot of information about the physical properties of the protein. Being able to accurately predict this shift is essential in drug discovery and in some other areas of molecular dynamics research. But ...Read More
Abstract:
The chemical shift of a protein structure offers a lot of information about the physical properties of the protein. Being able to accurately predict this shift is essential in drug discovery and in some other areas of molecular dynamics research. But because chemical shift prediction algorithms are so computationally intensive, no application can predict chemical shift of large protein structures in a realistic amount of time. We explored this problem by taking an algorithm called PPM_One and ported it to NVIDIA V100 GPUs using the directive-based programming model, OpenACC. When testing several different protein structures of datasets ranging from 1M to 11M atoms we observed ~45X average speedup between the datasets and a maximum of a 61X speedup. We'll discuss techniques to overcome programmatic challenges and highlight the scientific advances enabled by the model OpenACC.  Back
 
Topics:
Computational Biology and Chemistry
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9277
Streaming:
Download:
Share:
 
Abstract:
We will demonstrate the features and capabilities of OpenACC for porting and optimizing the ParDOCK docking module of the Sanjeevini suite for computer aided drug discovery developed at the HPC and Supercomputing Facility for Bioinformatics and Compu ...Read More
Abstract:
We will demonstrate the features and capabilities of OpenACC for porting and optimizing the ParDOCK docking module of the Sanjeevini suite for computer aided drug discovery developed at the HPC and Supercomputing Facility for Bioinformatics and Computational Biology at the Indian Institute of Technology Delhi. We have used OpenACC to efficiently port the existing C++ programming model of ParDOCK software with minimal code modifications to run on latest NVIDIA P100 GPU card. These code modifications and tuning resulted in a six times average speedup of improvements in turnaround time. By implementing openACC, the code is now able to sample ten times more ligand conformations leading to an increase in accuracy. The ACC ported ParDOCK code is now able to predict a correct pose of a protein-ligand interaction from 96.8 percent times, compared to 94.3 percent earlier (for poses under 1 A) and 89.9 percent times compared to 86.7 percent earlier (for poses under 0.5 A).  Back
 
Topics:
Computational Biology and Chemistry, Performance Optimization, Bioinformatics & Genomics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8188
Download:
Share:
 
Abstract:
VASP is a software package for atomic-scale materials modeling. It''s one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. We''ll give an overview and status of porting VASP to GPUs with Ope ...Read More
Abstract:
VASP is a software package for atomic-scale materials modeling. It''s one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. We''ll give an overview and status of porting VASP to GPUs with OpenACC. Parts of VASP were previously ported to CUDA C with good speed-ups on GPUs, but also with an increase in the maintenance workload as VASP is otherwise written wholly in Fortran. We''ll discuss OpenACC performance relative to CUDA, the impact of OpenACC on VASP code maintenance, and challenges encountered in the port related to management of aggregate data structures. Finally, we''ll discuss possible future solutions for data management that would simplify both new development and maintenance of VASP and similar large production applications on GPUs.  Back
 
Topics:
Computational Biology and Chemistry
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8750
Streaming:
Download:
Share:
 
Abstract:
We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest Volta-based Tesla V100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and with large scale runs on petascale com ...Read More
Abstract:
We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest Volta-based Tesla V100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and with large scale runs on petascale computers such as ORNL Summit. We'll highlight the performance benefits obtained from die-stacked memory on Tesla V100, the NVLink interconnect on the IBM OpenPOWER platforms, and the use of advanced features of CUDA, Volta's new Tensor units, and just-in-time compilation to increase the performance of key analysis algorithms. We'll present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we'll describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.  Back
 
Topics:
Computational Biology and Chemistry, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8709
Streaming:
Share:
 
Abstract:
Happy with your code but re-writing every time a hardware platform changes? Know NVIDIA CUDA but want to use a higher-level programming model? OpenACC is a directive-based technique that enables more science and less programming. The model facilitate ...Read More
Abstract:
Happy with your code but re-writing every time a hardware platform changes? Know NVIDIA CUDA but want to use a higher-level programming model? OpenACC is a directive-based technique that enables more science and less programming. The model facilitates reusing code base on more than one platform. This session will help you: (1) Learn how to incrementally improve a bioinformatics code base using OpenACC without losing performance (2) Explore how to apply optimization techniques and the challenges encountered in the process. We'll share our experience using OpenACC for DNA Next Generation Sequencing techniques.  Back
 
Topics:
Computational Biology and Chemistry, Programming Languages, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7341
Download:
Share:
Computational Fluid Dynamics
Presentation
Media
Abstract:
MPS method is a sort of particle method (not a stencil computation) used for computational fluid dynamics. "Search of neighbor-particle" is a main bottleneck of MPS. We show our porting efforts and three optimizations of search of neig ...Read More
Abstract:

MPS method is a sort of particle method (not a stencil computation) used for computational fluid dynamics. "Search of neighbor-particle" is a main bottleneck of MPS. We show our porting efforts and three optimizations of search of neighbor-particle by using OpenACC. We evaluate our implementations on Tesla K20c, GeForce GTX 1080, and Tesla P100 GPUs. It achieved 45.7x, 96.8x, and 126.1x times speedup compared with single-thread Ivy-bridge CPU.

  Back
 
Topics:
Computational Fluid Dynamics, Computer Aided Engineering
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7558
Download:
Share:
 
Abstract:
We'll demonstrate the maturity and capabilities of OpenACC and the PGI compiler suite in a professional C++ programming environment. We'll explore in detail the adaptation of the general purpose NUMECA FINE/Open CFD solver for heterogeneous CPU+GPU ...Read More
Abstract:
We'll demonstrate the maturity and capabilities of OpenACC and the PGI compiler suite in a professional C++ programming environment. We'll explore in detail the adaptation of the general purpose NUMECA FINE/Open CFD solver for heterogeneous CPU+GPU execution. We'll give extra attention to OpenACC tips and tricks used to efficiently port the existing C++ programming model with minimal code modifications. Sample code blocks will be used to clearly demonstrate the implementation principles in a clear and concise manner. Finally, we'll present simulations completed in partnership with Dresser-Rand on the OLCF Titan supercomputer, showcasing the scientific capabilities of FINE/Open and the improvements in simulation turnaround time made possible through the use of OpenACC.  Back
 
Topics:
Computational Fluid Dynamics, Tools and Libraries, Computer Aided Engineering
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7672
Download:
Share:
Computational Physics
Presentation
Media
Abstract:
We explore using OpenACC to migrate applications required for modeling solar storms from CPU HPC clusters to an "in-house" multi-GPU system. We describe the software pipeline and the utilization of OpenACC in the computationally heavy codes ...Read More
Abstract:
We explore using OpenACC to migrate applications required for modeling solar storms from CPU HPC clusters to an "in-house" multi-GPU system. We describe the software pipeline and the utilization of OpenACC in the computationally heavy codes. A major step forward is the initial implementation of OpenACC in our Magnetohydrodynamics code MAS. Strategies for overcoming some of the difficulties encountered are discussed, including handling Fortran derived types, array reductions, and performance tuning. Production-level "time-to-solution" results will be shown for multi-CPU and multi-GPU systems of various sizes. The timings show that it is possible to achieve acceptable "time-to-solution"s on a single multi-GPU server/workstation for problems that previously required using multiple HPC CPU-nodes.  Back
 
Topics:
Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8847
Streaming:
Download:
Share:
 
Abstract:
We'll describe our experience with using OpenACC to port a C++ library to run GPUs, focusing in particular on the issue of deep copy. The C++ library, Grid, is developed for numerical lattice quantum chromodynamics (LQCD) simulations, and is highly ...Read More
Abstract:
We'll describe our experience with using OpenACC to port a C++ library to run GPUs, focusing in particular on the issue of deep copy. The C++ library, Grid, is developed for numerical lattice quantum chromodynamics (LQCD) simulations, and is highly optimized for Intel x86 and many-core architectures. Our goal is to port it to run on NVIDIA GPUs using OpenACC so that its main code structure can be preserved and minimal code changes are required. We'll describe the challenges encountered and share the lessons learned during the porting process. In particular, due to the heavy use of templated abstractions, it is challenging to use OpenACC to deal with the data movement between the CPU and the GPU due to the deep-copy issue. We'll demonstrate that NVIDIA's virtual unified memory provides essential support for our porting effort. We'll also present initial performance results on Kepler and Pascal GPUs.  Back
 
Topics:
Computational Physics, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7640
Download:
Share:
HPC and AI
Presentation
Media
Abstract:
Learn about advanced features in MVAPICH2 that accelerate HPC and AI on modern dense GPU systems. We'll talk about how MVAPICH2 supports MPI communication from GPU memory and improves it using the CUDA toolkit for optimized performance on different ...Read More
Abstract:
Learn about advanced features in MVAPICH2 that accelerate HPC and AI on modern dense GPU systems. We'll talk about how MVAPICH2 supports MPI communication from GPU memory and improves it using the CUDA toolkit for optimized performance on different GPU configurations. We'll examine recent advances in MVAPICH2 that support large message collective operations and heterogeneous clusters with GPU and non-GPU nodes. We'll explain how we use the popular OSU micro-benchmark suite, and we'll provide examples from HPC and AI to demonstrate how developers can take advantage of MVAPICH2 in applications using MPI and CUDA/OpenACC. We'll also provide guidance on issues like processor affinity to GPUs and networks that can significantly affect the performance of MPI applications using MVAPICH2.  Back
 
Topics:
HPC and AI, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9476
Streaming:
Download:
Share:
 
Abstract:
Multiphysics and multiscale simulations are found in a variety of computational science subfields, but their disparate computational characteristics can make GPU implementations complex and often difficult. Simulations of supernovae are ideal example ...Read More
Abstract:
Multiphysics and multiscale simulations are found in a variety of computational science subfields, but their disparate computational characteristics can make GPU implementations complex and often difficult. Simulations of supernovae are ideal examples of this complexity. We use the scalable FLASH code to model these astrophysical cataclysms, incorporating hydrodynamics, thermonuclear kinetics, and self-­?gravity across considerable spans in space and time. Using OpenACC and GPU-­?enabled libraries coupled to new NVIDIA GPU hardware capabilities, we have improved the physical fidelity of these simulations by increasing the number of evolved nuclear species by more than an order-­?of-­? magnitude. I will discuss these and other performance improvements to the FLASH code on the Summit supercomputer at Oak Ridge National Laboratory.  Back
 
Topics:
HPC and AI, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8926
Streaming:
Share:
 
Abstract:
In order to prepare the scientific communities, GENCI and its partners have set up a technology watch group and lead collaborations with vendors, relying on HPC experts and early adopted HPC solutions. The two main objectives are providing guida ...Read More
Abstract:

In order to prepare the scientific communities, GENCI and its partners have set up a technology watch group and lead collaborations with vendors, relying on HPC experts and early adopted HPC solutions. The two main objectives are providing guidance and prepare the scientific communities to challenges of exascale architectures. The talk will present the OpenPOWER platform bought by GENCI and provided to the scientific community. Then, it will present the first results obtained on the platform for a set of about 15 applications using all the solutions provided to the users (CUDA,OpenACC,OpenMP,...). Finally, a presentation about one specific application will be made regarding its porting effort and techniques used for GPUs with both OpenACC and OpenMP.

  Back
 
Topics:
HPC and AI, Performance Optimization, Programming Languages
Type:
Talk
Event:
GTC Europe
Year:
2017
Session ID:
23183
Download:
Share:
HPC and Supercomputing
Presentation
Media
Abstract:
Come hear the latest PGI news and learn about what we'll develop in the year ahead. We'll talk about the latest PGI OpenACC Fortran/C++ and CUDA Fortran compilers and tools, which are supported on x64 and OpenPOWER systems with NVIDIA GPUs. We'll ...Read More
Abstract:
Come hear the latest PGI news and learn about what we'll develop in the year ahead. We'll talk about the latest PGI OpenACC Fortran/C++ and CUDA Fortran compilers and tools, which are supported on x64 and OpenPOWER systems with NVIDIA GPUs. We'll discuss new CUDA Fortran features, including Tensor Core support and cooperative groups, and we'll cover our current work on half-precision. We'll explain new OpenACC 2.7 features, along with Beta true deep-copy directives and support for OpenACC programs on unified memory systems. The PGI compiler-assisted software testing feature helps determine where differences arise between CPU and GPU versions of a program or when porting to a new system. Learn upcoming projects, which include a high-performance PGI Subset of OpenMP for NVIDIA GPUs, support for GPU programming with standard C++17 parallel STL and Fortran, and incorporating GPU-Accelerated math libraries to support porting and optimization of HPC applications on NVIDIA GPUs.  Back
 
Topics:
HPC and Supercomputing, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9289
Streaming:
Download:
Share:
 
Abstract:
We'll showcase the latest successes with GPU acceleration of challenging molecular simulation analysis tasks on the latest Volta and Turing GPUs paired with both Intel and IBM/OpenPOWER CPUs on petascale computers such as ORNL Summit. This presentat ...Read More
Abstract:
We'll showcase the latest successes with GPU acceleration of challenging molecular simulation analysis tasks on the latest Volta and Turing GPUs paired with both Intel and IBM/OpenPOWER CPUs on petascale computers such as ORNL Summit. This presentation will highlight the performance benefits obtained from die-stacked memory, NVLink interconnects, and the use of advanced features of CUDA such as just-in-time compilation to increase the performance of key analysis algorithms. We will present results obtained with OpenACC parallel programming directives, as well as discuss current challenges and future opportunities. We'll also describe GPU-Accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations. To make our tools easy to deploy for non-tradtional users of HPC, we publish GPU-Accelerated container images in NGC, and Amazon EC2 AMIs for GPU instance types.  Back
 
Topics:
HPC and Supercomputing, In-Situ and Scientific Visualization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9594
Streaming:
Download:
Share:
 
Abstract:
In this talk attendees will learn how key algorithms for Numerical Weather Prediction were ported to the latest GPU technology and the substantial benefits gained from doing so. We will showcase the power of individual Voltas and the impressive perfo ...Read More
Abstract:
In this talk attendees will learn how key algorithms for Numerical Weather Prediction were ported to the latest GPU technology and the substantial benefits gained from doing so. We will showcase the power of individual Voltas and the impressive performance of the cutting edge DGX-2 server with multiple GPUS connected by a high speed interconnect.  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Europe
Year:
2018
Session ID:
E8195
Streaming:
Download:
Share:
 
Abstract:
In this session, we will describe how we successfully extended a large legacy fortran code to GPUs using OpenACC. Based on a state of the art code for combustion simulation AVBP (http://www.cerfacs.fr/avbp7x/), our objective is to keep the code as si ...Read More
Abstract:
In this session, we will describe how we successfully extended a large legacy fortran code to GPUs using OpenACC. Based on a state of the art code for combustion simulation AVBP (http://www.cerfacs.fr/avbp7x/), our objective is to keep the code as simple as possible for the AVBP community while taking advantage of high end computing resources as GPU; OpenACC allows the flexibility to conduct the extension with respect to these constraints. This session will present the various strategies we tried during the refactoring of the application, including the limitations of the directive-only approach which can severely impair performance on particular parts of the code. The lessons learned are applicable to a wide range of codes in the research community.   Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Europe
Year:
2018
Session ID:
E8217
Streaming:
Download:
Share:
 
Abstract:
We present in this talk a portable matrix assembly strategy used in solving PDEs, suited for co-execution on both the CPUs and accelerators. In addition, a dynamic load balancing strategy is considered to balance the workload among the different ...Read More
Abstract:

We present in this talk a portable matrix assembly strategy used in solving PDEs, suited for co-execution on both the CPUs and accelerators. In addition, a dynamic load balancing strategy is considered to balance the workload among the different CPUs and GPUs available on the cluster. Numerical methods for solving partial differential equations (PDEs) involve two main steps: the assembly of an algebraic system of the form Ax=b and the solution of it with direct or iterative solvers. The assembly step consists of a loop over elements, faces and nodes in the case of the finite element, finite volume, and finite difference methods, respectively. It is computationally intensive and does not involve communication. It is therefore well-suited for accelerators.

  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Europe
Year:
2018
Session ID:
E8292
Streaming:
Download:
Share:
 
Abstract:
This talk provides an overview of the key strategies used to design and implement OpenStaPLE, an application for Lattice QCD (LQCD) Monte Carlo simulations. LQCD are an example of HPC grand challenge applications, where the accuracy of results s ...Read More
Abstract:

This talk provides an overview of the key strategies used to design and implement OpenStaPLE, an application for Lattice QCD (LQCD) Monte Carlo simulations. LQCD are an example of HPC grand challenge applications, where the accuracy of results strongly depends on available computing resources. OpenStaPLE has been developed on top of MPI and OpenACC frameworks. It manages the parallelism across multiple computing nodes and devices, while OpenACC exploits the high level parallelism available on modern processors and accelerators, enabling a good level of portability across different architectures. After an initial overview, we also present performance and portability results on different architectures, highlighting key improvements of hardware and software key that may lead this class of applications to exhibit better performances.

  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Europe
Year:
2018
Session ID:
E8317
Streaming:
Download:
Share:
 
Abstract:
VASP is a software package for atomic-scale materials modeling. It's one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. We'll give an overview on the status of porting VASP to ...Read More
Abstract:

VASP is a software package for atomic-scale materials modeling. It's one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. We'll give an overview on the status of porting VASP to GPUs with OpenACC. Parts of VASP were previously ported to CUDA C with good speed-ups on GPUs, but also with an increase in the maintenance workload, because VASP is otherwise written wholly in Fortran. We'll discuss OpenACC performance relative to CUDA, the impact of OpenACC on VASP code maintenance, and challenges encountered in the port related to management of aggregate data structures. Finally, we'll discuss possible future solutions for data management that would simplify both new development and the maintenance of VASP and similar large production applications on GPUs.

  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Europe
Year:
2018
Session ID:
E8367
Streaming:
Download:
Share:
 
Abstract:
In 2014, GENCI set up a French technology watch group that targets the provisioning of test systems, selected as part of the prospective approach among partners from GENCI. This was done in order to prepare scientific commun ...Read More
Abstract:

In 2014, GENCI set up a French technology watch group that targets the provisioning of test systems, selected as part of the prospective approach among partners from GENCI. This was done in order to prepare scientific communities and users of GENCI's computing resources for the arrival of the next "Exascale" technologies.\nThe talk will present results obtained on the OpenPOWER platform bought by GENCI and open to the scientific community. We will present on the first results obtained for a set of scientific applications using the available environments (CUDA,OpenACC,OpenMP,â¦), along with results obtained for AI applications using IBM's software distribution PowerAI.

  Back
 
Topics:
HPC and Supercomputing, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Europe
Year:
2018
Session ID:
E8288
Streaming:
Download:
Share:
 
Abstract:
Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We''ll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we''ll cover advanced top ...Read More
Abstract:
Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We''ll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we''ll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We''ll also cover the latest improvements with CUDA-aware MPI, interaction with unified memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.  Back
 
Topics:
HPC and Supercomputing, Tools and Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8314
Streaming:
Download:
Share:
 
Abstract:
Learn about the latest developments in the high-performance mass passing interference (MPI) over InfiniBand, iWARP, and RoCE (MVAPICH2) library that simplify the task of porting MPI applications to HPC and Supercomputing clusters with NVIDIA GPUs. MV ...Read More
Abstract:
Learn about the latest developments in the high-performance mass passing interference (MPI) over InfiniBand, iWARP, and RoCE (MVAPICH2) library that simplify the task of porting MPI applications to HPC and Supercomputing clusters with NVIDIA GPUs. MVAPICH2 supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit, providing optimized performance on different GPU node configurations. These optimizations are integrated transparently under standard MPI API, for better programmability. Recent advances in MVAPICH2 include designs for MPI-3 RMA using GPUDirect RDMA framework for MPI datatype processing using CUDA kernels, support for GPUDirect Async, support for heterogeneous clusters with GPU and non-GPU nodes, and more. We use the popular Ohio State University micro-benchmark suite and example applications to demonstrate how developers can effectively take advantage of MVAPICH2 in applications using MPI and CUDA/OpenACC. We provide guidance on issues like processor affinity to GPU and network that can significantly affect the performance of MPI applications that use MVAPICH2.  Back
 
Topics:
HPC and Supercomputing, Tools and Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8373
Streaming:
Download:
Share:
 
Abstract:
Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover a ...Read More
Abstract:

Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, interaction with Unified Memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.

  Back
 
Topics:
HPC and Supercomputing, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7133
Download:
Share:
 
Abstract:
Gyrokinetic Toroidal Code developed in Princeton (GTC-P) delivers highly-scalable plasma turbulence simulations at extreme scales on world-leading supercomputers such as Tianhe-2 and Titan. The aim of this work to achieve portable performance in a si ...Read More
Abstract:
Gyrokinetic Toroidal Code developed in Princeton (GTC-P) delivers highly-scalable plasma turbulence simulations at extreme scales on world-leading supercomputers such as Tianhe-2 and Titan. The aim of this work to achieve portable performance in a single source code for GTC-P. We developed the first OpenACC implementation for GPU, CPU, and Sunway processor. The results showed the OpenACC version achieved nearly 90% performance of NVIDIA?CUDA?version on GPU and OpenMP version on CPU; the Sunway OpenACC version achieved 2.5X speedup in the entire code. Our work demonstrates OpenACC can deliver portable performance to complex real-science codes like GTC-P. In additional, we request adding thread-id support in OpenACC standard to avoid expensive atomic operations for reductions.  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7193
Download:
Share:
 
Abstract:
We'll show a method that decreases random memory accesses for GPUs by splitting up calculations properly. The target application is unstructured low-order finite element analysis, the core application for manufacturing analyses. To reduce the memory ...Read More
Abstract:
We'll show a method that decreases random memory accesses for GPUs by splitting up calculations properly. The target application is unstructured low-order finite element analysis, the core application for manufacturing analyses. To reduce the memory access cost, we apply the element-by-element method for matrix-vector multiplication in the analysis. This method conducts local matrix-vector computation for each element in parallel. Atomic and cache hardware in GPUs has improved and we can utilize the data locality in the element node connectivity by using atomic functions for addition of local results. We port codes to GPUs using OpenACC directives and attain high performance with low development costs. We'll also describe the performance on NVIDIA DGX-1, which contains eight Pascal GPUs.  Back
 
Topics:
HPC and Supercomputing, Computational Fluid Dynamics, Computational Physics, Computer Aided Engineering, Manufacturing Industries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7527
Download:
Share:
 
Abstract:
Optimizing data movement between host and device memories is an important step when porting applications to GPUs. This is true for any programming model (CUDA, OpenACC, OpenMP 4+, etc.), and becomes even more challenging with complex aggregate data s ...Read More
Abstract:
Optimizing data movement between host and device memories is an important step when porting applications to GPUs. This is true for any programming model (CUDA, OpenACC, OpenMP 4+, etc.), and becomes even more challenging with complex aggregate data structures (arrays of structs with dynamically allocated array members). The CUDA and OpenACC APIs expose the separate host and device memories, requiring the programmer or compiler to explicitly manage the data allocation and coherence. The OpenACC committee is designing directives to extend this explicit data management for aggregate data structures. CUDA C++ has managed memory allocation routines and CUDA Fortran has the managed attribute for allocatable arrays, allowing the CUDA driver to manage data movement and coherence. Future NVIDIA GPUs will support true unified memory, with operating system and driver support for sharing the entire address space between the host and the GPU. We'll compare and contrast the current and future explicit memory movement with driver- and system-managed memory, and discuss how future developments will affect application development and performance.  Back
 
Topics:
HPC and Supercomputing, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7628
Download:
Share:
 
Abstract:
We'll present the strategy and results for porting an atmospheric fluids code, HiGrad, to the GPU. Higrad is a cross-compiled, mixed-language code that includes C, C++, and Fortran, and is used for atmospheric modeling. Deep subroutine calls ...Read More
Abstract:

We'll present the strategy and results for porting an atmospheric fluids code, HiGrad, to the GPU. Higrad is a cross-compiled, mixed-language code that includes C, C++, and Fortran, and is used for atmospheric modeling. Deep subroutine calls necessitate detailed control of the GPU data layout with CUDA-Fortran. We'll present initial kernel accelerations with OpenACC, then discuss tuning with OpenACC and a comparison with specially curated CUDA kernels. We'll demonstrate the performance improvement and different techniques used for porting this code to GPUs, using a mixed CUDA-Fortran and OpenACC implementation for single-node performance, and scaling studies conducted with MPI on local supercomputers and Oak Ridge National Laboratory's Titan supercomputer, on different architectures including the Tesla K40 and Tesla P100.

  Back
 
Topics:
HPC and Supercomputing, Computational Fluid Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7735
Download:
Share:
 
Abstract:
We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest NVIDIA?Tesla?P100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and large-scale runs on petascale compu ...Read More
Abstract:

We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest NVIDIA?Tesla?P100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and large-scale runs on petascale computers such as Titan and Blue Waters. We'll highlight the performance benefits obtained from die-stacked memory on the Tesla P100, the NVIDIA NVLink# interconnect on the IBM "Minsky" platform, and the use of NVIDIA CUDA?just-in-time compilation to increase the performance of data-driven algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we'll describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.

  Back
 
Topics:
HPC and Supercomputing, Accelerated Analytics, Computational Biology and Chemistry
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7382
Download:
Share:
 
Abstract:
Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU progra ...Read More
Abstract:

Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU programming. This panel will throw light on what would be the primary objective(s) for a choice of model, whether its availability across multiple platforms, its rich feature set or its applicability for a certain type of scientific code or compilers' stability or other factors. This will be an interactive Q/A session where participants can discuss their experiences with programming model experts and developers.

  Back
 
Topics:
HPC and Supercomputing, Programming Languages
Type:
Panel
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7564
Download:
Share:
Performance Optimization
Presentation
Media
Abstract:
Learn how to optimize large complex-number reductions in material science code BerkeleyGW on NVIDIA GPUs. Our talk will showcase two BerkeleyGW kernels implemented with four frameworks CUDA, OpenACC, OpenMP 4.5, and Kokkos. We'll share optimization ...Read More
Abstract:
Learn how to optimize large complex-number reductions in material science code BerkeleyGW on NVIDIA GPUs. Our talk will showcase two BerkeleyGW kernels implemented with four frameworks CUDA, OpenACC, OpenMP 4.5, and Kokkos. We'll share optimization techniques used to achieve decent performance across all four implementations. We'll also report on the status of OpenACC and OpenMP 4.5 compilers and compare the performance portability capabilities of OpenACC, OpenMP 4.5, and Kokkos.  Back
 
Topics:
Performance Optimization, Tools and Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9626
Streaming:
Download:
Share:
 
Abstract:
Learn about how the high-level directive-based, widely popular, programming model, OpenACC can help port radiation transport scientific codes to large scale heterogeneous systems consisting of state-of-the-art accelerators such as GPUs. Architectures ...Read More
Abstract:
Learn about how the high-level directive-based, widely popular, programming model, OpenACC can help port radiation transport scientific codes to large scale heterogeneous systems consisting of state-of-the-art accelerators such as GPUs. Architectures are rapidly evolving and the exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages, programming models among other components in order to increase parallelism from a programming standpoint to be able to migrate large scale applications to these massively powerful platforms. This talk will discuss programming challenges and its corresponding solutions for porting a wavefront based miniapplication for Denovo, which is a production code for nuclear reactor modeling, using OpenACC. Our OpenACC implementation running on NVIDIA''s next-generation Volta GPU boasts a 85.06x speedup over serial code, which is larger than CUDA''s 83.72x speedup over the same serial implementation.  Back
 
Topics:
Performance Optimization, Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8848
Streaming:
Download:
Share:
 
Abstract:
It is extremely challenging to move data between host and device memories when deep nested complex aggregate data structures are commonly used in an application. This talk will bring users diving into VASP, ICON, and other real-world applications and ...Read More
Abstract:
It is extremely challenging to move data between host and device memories when deep nested complex aggregate data structures are commonly used in an application. This talk will bring users diving into VASP, ICON, and other real-world applications and see how the deep copy issue is solved in these real-world applications with PGI compiler and OpenACC APIs. The OpenACC 2.6 specification includes directives and rules that enable programmer-controlled manual deep copy, albeit in a form that can be intrusive in terms of the number of directives required. The OpenACC committee is designing new directives to extend explicit data management to aggregate data structures in a form that is more elegant and concise. The talk will also cover comparison of unified memory, manual deepcopy, full deepcopy, and true deepcopy.  Back
 
Topics:
Performance Optimization, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8805
Streaming:
Download:
Share:
 
Abstract:
Learn a simple strategy guideline to optimize applications runtime. The strategy is based on four steps and illustrated on a two-dimensional Discontinuous Galerkin solver for computational fluid dynamics on structured meshes. Starting from a CPU sequ ...Read More
Abstract:
Learn a simple strategy guideline to optimize applications runtime. The strategy is based on four steps and illustrated on a two-dimensional Discontinuous Galerkin solver for computational fluid dynamics on structured meshes. Starting from a CPU sequential code, we guide the audience through the different steps that allowed us to increase performances on a GPU around 149 times the original runtime of the code (performances evaluated on a K20Xm). The same optimization strategy is applied to the CPU code and increases performances around 35 times the original run time (performances evaluated on a E5-1650v3 processor). Finally, different hardware architectures (Xeon CPUs, GPUs, KNL) are benchmarked with the native CUDA implementation and one based on OpenACC.  Back
 
Topics:
Performance Optimization, Algorithms and Numerical Techniques, Computational Fluid Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7626
Download:
Share:
 
Abstract:
OpenACC is a directive-based programming model that provides a simple interface to exploit GPU computing. As the GPU employs deep memory hierarchy, appropriate management of memory resources becomes crucial to ensure performance. The OpenACC pro ...Read More
Abstract:

OpenACC is a directive-based programming model that provides a simple interface to exploit GPU computing. As the GPU employs deep memory hierarchy, appropriate management of memory resources becomes crucial to ensure performance. The OpenACC programming model offers the cache directive to use on-chip hardware (read-only data cache) or software-managed (shared memory) caches to improve memory access efficiency. We have implemented several strategies to promote the shared memory utilization in our PGI compiler suite. We'll briefly discuss our investigation of cases that can be potentially optimized by the cache directive and then dive into the underlying implementation. Our compiler is evaluated with self-written micro-benchmarks as well as some real-world applications. 

  Back
 
Topics:
Performance Optimization, Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7636
Download:
Share:
Programming Languages
Presentation
Media
Abstract:
Learn how to take an application from slow, serial execution to blazing fast GPU execution using OpenACC, a directives-based parallel programming language that works with C, C++, and Fortran. By the end of this session participants will know the basi ...Read More
Abstract:
Learn how to take an application from slow, serial execution to blazing fast GPU execution using OpenACC, a directives-based parallel programming language that works with C, C++, and Fortran. By the end of this session participants will know the basics of using OpenACC to write an accelerated application that runs on multicore CPUs and GPUs with minimal code changes. No prior GPU programming experience is required, but the ability to understand C, C++, or Fortran code is necessary.  Back
 
Topics:
Programming Languages, HPC and AI
Type:
Tutorial
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9262
Streaming:
Download:
Share:
 
Abstract:
As GPU computing nodes begin packing in an increasing number of GPUs, programming to maximize performance across GPUs in a system is becoming a challenge. We'll discuss techniques to extend your GPU applications from using one GPU to using many GPUs ...Read More
Abstract:
As GPU computing nodes begin packing in an increasing number of GPUs, programming to maximize performance across GPUs in a system is becoming a challenge. We'll discuss techniques to extend your GPU applications from using one GPU to using many GPUs. By the end of the session, you'll understand the relative trade-offs in each of these approaches and how to choose the best approach for your application. Some prior OpenACC or GPU computing experience is recommended for this talk.  Back
 
Topics:
Programming Languages, HPC and AI
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9263
Streaming:
Download:
Share:
 
Abstract:
We'll discuss the C++17 parallel algorithms, which were designed to support GPU parallel programming. They include parallel versions of many existing algorithms, and a few new algorithms designed for efficient parallel execution of scans and reducti ...Read More
Abstract:
We'll discuss the C++17 parallel algorithms, which were designed to support GPU parallel programming. They include parallel versions of many existing algorithms, and a few new algorithms designed for efficient parallel execution of scans and reductions. The PGI C++ compiler has implemented these parallel algorithms for NVIDIA GPUs, making it possible in some cases to run standard C++ on GPUs with no directives, pragmas, or annotations. We will share our experiences and performance results for several of the parallel algorithms. We'll also explain the capabilities of the PGI implementation relative to CUDA, Thrust, and OpenACC.  Back
 
Topics:
Programming Languages, Tools and Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9770
Streaming:
Download:
Share:
 
Abstract:
GPUs are often the fastest way to obtain your scientific results, but many students and domain scientists don't know how to get started. In this tutorial we will take an application from simple, serial loops to a fully GPU-enabled application. Stude ...Read More
Abstract:
GPUs are often the fastest way to obtain your scientific results, but many students and domain scientists don't know how to get started. In this tutorial we will take an application from simple, serial loops to a fully GPU-enabled application. Students will learn a profile-guided approach to accelerating applications, including how to find hotspots, how to use OpenACC to accelerated important regions of code, and how to get the best performance they can on GPUs. No prior experience in GPU-programming or OpenACC is required, but experience with C, C++, or Fortran is a must. Several books will be given away to attendees who complete this tutorial.  Back
 
Topics:
Programming Languages, HPC and Supercomputing
Type:
Tutorial
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8382
Streaming:
Share:
 
Abstract:
While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model which may be combined with CUDA, and recently with O ...Read More
Abstract:
While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model which may be combined with CUDA, and recently with OpenACC as well. Using OpenACC we will start benefiting from GPU computing, obtaining great coding productivity and nice performance improvements. We can next fine-tune the critical application parts developing CUDA kernels to hand-optimize the problem. OmpSs combined with either OpenACC or CUDA will enable seamless task parallelism leveraging all system devices.  Back
 
Topics:
Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8351
Streaming:
Download:
Share:
 
Abstract:
We'll present our experience with using OpenACC to port GTC-P, a real-world plasma turbulence simulation, on NVIDIA P100 GPU and SW26010, the Chinese home-grown many-core processor. Meanwhile, we developed the GTC-P code with the native approach on ...Read More
Abstract:
We'll present our experience with using OpenACC to port GTC-P, a real-world plasma turbulence simulation, on NVIDIA P100 GPU and SW26010, the Chinese home-grown many-core processor. Meanwhile, we developed the GTC-P code with the native approach on Sunway TaihuLight supercomputer so that we can analyze the performance gap between OpenACC and the native approach on P100 GPU and SW26010. The experiment results show that the performance gap between OpenACC and CUDA on P100 GPU is less than 10% by PGI compiler. However, the gap on SW26010 is more than 50% since the register level communication only supported by native approach can avoid low-efficiency main memory access. Our case study demonstrates that OpenACC can deliver impressively portable performance on P100 GPU, but the lack of software cache via RLC supported by the OpenACC compiler on SW26010 results in large performance gap between OpenACC and the native approach.  Back
 
Topics:
Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8637
Streaming:
Download:
Share:
 
Abstract:
CUDA Unified Memory for NVIDIA Tesla GPUs offers programmers a unified view of memory on GPU-accelerated compute nodes. The CPUs can access GPU high-bandwidth memory directly, the GPUs can access CPU main memory directly, and memory pages migrate aut ...Read More
Abstract:
CUDA Unified Memory for NVIDIA Tesla GPUs offers programmers a unified view of memory on GPU-accelerated compute nodes. The CPUs can access GPU high-bandwidth memory directly, the GPUs can access CPU main memory directly, and memory pages migrate automatically between the two when the CUDA Unified Memory manager determines it is performance-profitable. PGI OpenACC compilers now leverage this capability on allocatable data to dramatically simplify parallelization and incremental optimization of HPC applications for GPUs. In the future it will extend to all types of data, and programmer-driven data management will become an optimization rather than a requirement. This talk will summarize the current status and near future of OpenACC programming and optimization for GPU-accelerated compute nodes with CUDA Unified Memory.  Back
 
Topics:
Programming Languages
Type:
Talk
Event:
SIGGRAPH
Year:
2017
Session ID:
SC1718
Download:
Share:
 
Abstract:
MPAS-A is a general circulation (global) model of the Earths atmosphere that is designed to work down to so-called non-hydrostatic scales where convective (vertical) cloud processes are resolved. To date MPAS-A has been used primarily for meteorologi ...Read More
Abstract:
MPAS-A is a general circulation (global) model of the Earths atmosphere that is designed to work down to so-called non-hydrostatic scales where convective (vertical) cloud processes are resolved. To date MPAS-A has been used primarily for meteorological research applications, although climate applications in the Community Earth System Model (CESM) are being contemplated. At a high level, MPAS-A consists of a dynamics part, a fluid flow solver that integrates the non-hydrostatic time dependent nonlinear partial differential equations of the atmosphere, and a physics part, which computes the forcings of these equations due to radiative transport, cloud physics, and surface and near surface processes. The dynamics is in turn divided into the dry dynamics and moist dynamics parts. Algorithmically, the dynamics uses a Finite Volume (FV) method on an unstructured centroidal Voronoi mesh (grid, or tessellation) with a C-grid staggering of the state variables as the basis for the horizontal discretization.As a part of NCAR''s Weather and Climate Alliance (WACA) project, a team consisting of NCAR staff, faculty and students at the University of Wyoming, and group of NVIDIA & NVIDIA PGI developers, developed a portable and multi-GPU implementation of the dry dynamical core using OpenACC. The work began in May 2016 and was completed in May 2017. Benchmarks of the OpenACC version (single source code) of the dry dynamical core (approximately 35,000 lines of code) on the Pascal GPU show that a single P100 is 2.7 times faster than a dual socket node composed of 18-core per socket Intel Xeon E5-2697V4 Broadwell processors. Put another way, the ported dry dynamics achieves a P100 performance that is 97 times the performance of a single Intel Xeon v4 core, while simultaneously maintaining good performance of the source on traditional Xeon architectures. The WACA team, along with a new collaboration with Korean Institute of Science and Technology Information (KISTI), is currently porting moist dynamics and physics parts of MPAS-A.  Back
 
Topics:
Programming Languages
Type:
Talk
Event:
SIGGRAPH
Year:
2017
Session ID:
SC1726
Share:
 
Abstract:
Discover how the OmpSs programming model enables you to develop different programming models such as OpenACC, multi-thread programming, CUDA, and OpenCL together while providing a single address space and directionality compiler directives. OmpSs is ...Read More
Abstract:
Discover how the OmpSs programming model enables you to develop different programming models such as OpenACC, multi-thread programming, CUDA, and OpenCL together while providing a single address space and directionality compiler directives. OmpSs is a flagship project in the Barcelona Supercomputing Center, as well as a forerunner of the OpenMP. We'll present the advantages in terms of coding productivity and performance brought by our recent work integrating OpenACC kernels within the OmpSs programming model, as a step forward to our previous OmpSs + CUDA support. We'll also present how to use hybrid GPU and CPU together without any code modification by our runtime system.  Back
 
Topics:
Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7192
Download:
Share:
 
Abstract:
We'll dive deeper into using OpenACC and explore potential solutions that can overcome challenges faced while parallelizing an irregular algorithm, sparse Fast Fourier Transform (sFFT). We'll analyze code characteristics using profilers, discuss op ...Read More
Abstract:
We'll dive deeper into using OpenACC and explore potential solutions that can overcome challenges faced while parallelizing an irregular algorithm, sparse Fast Fourier Transform (sFFT). We'll analyze code characteristics using profilers, discuss optimizations applied, things we did right, things we did wrong, along with roadblocks that we faced and steps taken to overcome them. We'll highlight how to compare data reproducibility between accelerators in heterogeneous platforms, and report on the algorithmic changes from sequential to parallel especially for an irregular code, while using OpenACC. The results will demonstrate how to create a portable, productive, and maintainable codebase without compromising on performance using OpenACC.  Back
 
Topics:
Programming Languages, Algorithms and Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7478
Download:
Share:
 
Abstract:
We'll discuss techniques for using more than one GPU in an OpenACC program. We'll demonstrate how to address multiple devices, mixing OpenACC and OpenMP to manage multiple devices, and utilizing multiple devices with OpenACC and MPI. ...Read More
Abstract:

We'll discuss techniques for using more than one GPU in an OpenACC program. We'll demonstrate how to address multiple devices, mixing OpenACC and OpenMP to manage multiple devices, and utilizing multiple devices with OpenACC and MPI.

  Back
 
Topics:
Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7546
Download:
Share:
Science and Research
Presentation
Media
Abstract:
We'll guide you step by step to port and optimize an oil-and-gas miniapplication to efficiently leverage the amazing computing power of NVIDIA GPUs. While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum pe ...Read More
Abstract:
We'll guide you step by step to port and optimize an oil-and-gas miniapplication to efficiently leverage the amazing computing power of NVIDIA GPUs. While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model that may be combined with CUDA, and recently with OpenACC as well. Using OpenACC, we'll start benefiting from GPU computing, obtaining great coding productivity, and a nice performance improvement. We can next fine-tune the critical application parts developing CUDA kernels to hand-optimize the problem. OmpSs combined with either OpenACC or CUDA will enable seamless task parallelism leveraging all system devices.  Back
 
Topics:
Science and Research
Type:
Instructor-Led Lab
Event:
GTC Europe
Year:
2017
Session ID:
53020
Download:
Share:
Tools and Libraries
Presentation
Media
Abstract:
Get your hands on the latest versions of Score-P and Vampir to profile the execution behavior of your large-scale GPU-Accelerated applications. See how these HPC community tools pick up as other tools (such as NVVP) drop off when your application spa ...Read More
Abstract:
Get your hands on the latest versions of Score-P and Vampir to profile the execution behavior of your large-scale GPU-Accelerated applications. See how these HPC community tools pick up as other tools (such as NVVP) drop off when your application spans multiple compute nodes. Regardless of whether your application uses CUDA, OpenACC, OpenMP or OpenCL for acceleration, or whether it is written in C, C++, Fortran or Python, you will receive a high-resolution timeline view of all program activity alongside the standard profiles to identify hot spots and avenues for optimization. The novel Python support now also enables performance studies for optimizing the inner workings of deep learning frameworks.  Back
 
Topics:
Tools and Libraries, HPC and Supercomputing
Type:
Tutorial
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9347
Streaming:
Download:
Share:
 
Abstract:
Debugging and analyzing NVIDIA GPU-Based HPC applications requires a tool that supports the demands of today's complex CUDA applications. Debuggers must deal with the extensive use of C++ templates, STL, many shared libraries, and debugging optimize ...Read More
Abstract:
Debugging and analyzing NVIDIA GPU-Based HPC applications requires a tool that supports the demands of today's complex CUDA applications. Debuggers must deal with the extensive use of C++ templates, STL, many shared libraries, and debugging optimized code. They need to seamlessly support debugging both host and GPU code, Python, and C/C++ mixed-language applications. They must also scale to the complexities of today's multi-GPU cluster supercomputers such as Summit and Sierra. We'll discuss the advanced technologies provided by the TotalView for an HPC debugger and explain how they're used to analyze and debug complex CUDA applications to make code easily understood and to quickly solve difficult problems. We'll also show TotalView's new user interface. Learn how to easily debug multi-GPU environments and OpenACC, and see a unified debugging view for Python applications that leverage C++ Python extensions such as TensorFlow.  Back
 
Topics:
Tools and Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9378
Streaming:
Download:
Share: