GROMACS is a state-of-the-art molecular simulation package that employs extensive multi-level heterogeneous parallelization. Our new CUDA-based algorithms provide 4x speedup over handtuned CPU SIMD assembly, and unprecedented absolute performance. However, the heterogeneity of hardware and the inherent bottlenecks involved make efficient resource utilization and strong scaling very challenging. This advanced session describes our recent efforts on multi-level load-balancing, kernel execution strategies, CPU-GPU work splitting, and ways to exploit Kepler features such as Hyper-Q. Join us to talk about current limits of GPU acceleration in MD, and how to take molecular dynamics simulations to 100 millisecond/iteration, equivalent to 10,000 fps, in the near future!
The new Kepler GPU architecture introduces a new instruction: SHFL. This instruction allows threads in a warp to exchange values without using shared memory. In some cases, using the SHFL ("shuffle") instruction can significantly improve the performance. In this session we will present code patterns where SHFL helps improve the performance of your applications.
The goal of this session is to present advanced techniques to optimize CUDA code on the GPU. In particular, we will demonstrate the use of advanced CUDA instructions (inline PTX, warp instructions, "extended" syncthreads) and load-balancing strategies to improve the performance of a sparse matrix-matrix multiplication on the GPU.