Understanding and characterizing performance problems of CPU-GPU programs, as well as providing insightful feedback to help guide programmer towards tuning their applications is critical to improving developer productivity. HPCToolkit is a start-of-the-art performance analysis tool that employs statistical sampling of timers and hardware counters, and attributes performance metrics to the hierarchical calling context. We extend HPCToolkit to measure and attribute performance of hybrid CPU-GPU codes. We present CPU-GPU blame shifting - a technique to identify code regions that underutilize CPU and/or GPU compute resources. We demonstrate the effectiveness of our tools on diverse scientific codes such as hydrodynamics (LULESH), molecular dynamics (LAMMPS), and epidemiology simulation(GPU-EpiSimdemics).
This talk presents an overview of the implementation of the particle pusher which targets NVIDIA GPUs by extending a novel energy- and charge- conserving 1D electrostatic particle pushing algorithm to a 2D electromagnetic version. Energy is conserved by using a fully implicit time integration, and particles are carefully treated at cell boundaries to maintain charge conservation. The momentum in the system is controlled by an adaptive orbit integrator that compares a first and second order integration scheme. Implementation is based on the CUDA 4.1 framework. Implementation effectively exploits the memory hierarchy on the GPU by employing the texture memory to access the electric and magnetic fields, and the shared memory to accumulate the charge and current density before a global accumulation. Evaluating a red-black scheduling scheme of CUDA blocks to reduce contention while global accumulation. Effectively utilize multiple GPUs to perform computation for different species of particles. Showcases the CUDA implementation via a two species (ion, electron) plasma physics application where the particles are in equilibrium.