Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU programming. This panel will throw light on what would be the primary objective(s) for a choice of model, whether its availability across multiple platforms, its rich feature set or its applicability for a certain type of scientific code or compilers' stability or other factors. This will be an interactive Q/A session where participants can discuss their experiences with programming model experts and developers.
We will present early results from IBM Power8 systems equipped with NVLink connected NVIDIA P100 GPUs. We will show comparative results with previous NVIDIA GPU generations for a set of synthetic and application benchmarks, highlighting in particular the advances in the memory subsystem of P100. The talk will in particular demonstrate the impact of the new double precision atomic add capabilities, and will discuss some early exploration of the behavior of NVLink between the Power8 CPUs and the P100 GPUs.
In this talk we demonstrate how LAMMPS uses the many-core device performance portability library Kokkos to implement a single code base for CPUs, NVIDIA GPUs and Intel Xeon Phi co-processors. This portable code base has equal or better performance compared to LAMMPS' current generation of hardware specific add-on packages.
The session will present implementation and optimization strategies for molecular dynamics many-body potentials. It will concentrate on the new SNAP potential for LAMMPS, which is based on the GAP bispectrum analysis of Bartok et al. [PRL 104, 136403 (2010)]. SNAP is fit to large amounts of quantum-based DFT data and is capable of reproducing the accuracy of DFT while still exhibiting linear scaling with the system size. By exploiting multiple parallelisation layers it is possible to mitigate its high cost of 500,000 flops per interaction through excellent strong scaling behaviour down to 16 atoms per GPU. Thus the achievable time to solution on GPU clusters using SNAP is comparable to running simple Lennard Jones simulations.
Performance on manycore devices is dependent data access patterns where different devices (NVIDIA, Intel-Phi, NUMA) require different data access patterns. A performance-portable programming model does not force a false-choice between arrays-of-structures or structures-of-arrays, instead it defines abstractions to transparently adapt data structures to meet device requirements. The KokkosArray library implements this strategy through simple and intuitive multidimensional array abstractions. Usability and performance-portability is demonstrated with proxy-applications for finite element and molecular dynamics codes. MiniMD, a proxy-application for the LAMMPS molecular dynamic code, has implementations in OpenMP, OpenCL, CUDA, and now KokkosArray. A comparison of miniMD''s KokkosArray implementation with the previous three versions demonstrate the relative strengths and weaknesses of KokkosArray, and that how the portable version retains about 95% of the performance of the "native" versions. Multiphysics applications with heterogeneous finite element discretizations have complex and highly irregular data structures. A KokkosArray-based prototype unstructured heterogeneous finite element mesh library and its support for heterogeneous manycore parallel computations will be presented.