In this session we explore how to analyze and optimize the performance of GPU-accelerated applications. Working with a real-world example, attendees will learn how to analyze application performance by measuring data transfers, unified memory page migrations, inter-GPU communication, and performing critical path analysis. Using the example application, and using NVIDIA's profiling tools as an example tool set, we will walk through various optimizations and discuss their impact on the performance of the whole application. This session is accompanied by Session S7444, which considers performance optimization of GPU kernels.
On the path to exascale, high performance computing adapts wider and wider processors that need more parallelism. The energy required to move data and the available bandwidth pose significant challenges. See how an efficient implementation of iterative Krylov solvers can help deal with these issues. As an example, we the block conjugate gradient solver in QUDA, a library for lattice quantum chromodynamics. We demonstrate how an efficient implementation can overcome scaling issues and achieve a 10X speedup compared to a regular conjugate gradient solver.
We'll present a real CUDA application and use NVIDIA Nsight Eclipse Edition on Linux to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.
Accelerators have become a key ingredient in HPC. GPUs had a head start and are already widely used in HPC applications but now are facing competition from Intel's Xeon Phi accelerators. The latter promise comparable performance and easier portability and even feature a higher memory bandwidth - key to good performance for a wide range of bandwidth-bound HPC applications. In this session we compare their performance using a Lattice QCD application as a case study. We give a short overview of the relevant features of the architectures and discuss some implementation details. Learn about the effort it takes to achieve great performance on both architectures. See which accelerator is more energy efficient and which one takes the performance crown at about 500 GFlop/s.
Discover how data from experiments at heavy-ion colliders (the Relativistic Heavy Ion Collider at Brookhaven National Lab and the Large Hadron Collider at CERN) can immediately be compared with first-principles simulations of Quantum Chromodynamics (QCD) to quantitatively probe the fundamental properties of strongly interacting matter, i.e., quarks and gluons at high temperature. The conditions realized in the experiments governed the early evolution of the universe. The necessary high precision for these comparisons is obtained by completely performing our calculations on the GPU. In doing so we simultaneously face a low flop/byte ratio and high-register pressure. See how we deal with these complications and achieve high performance on the Bielefeld GPU cluster with 400 Fermi GPUs.