We'll introduce cuTT, a tensor transpose library for GPUs that on average achieves over 70% of the attainable memory bandwidth, independent of tensor rank. Tensor transposing is important in many applications such as multi-dimensional Fast Fourier Transforms and deep learning, and in quantum chemistry calculations. Until now, no runtime library existed that fully utilized the remarkable memory bandwidth of GPUs and could perform well independent of tensor rank. We'll describe two transpose algorithms, "Tiled" and "Packed," which achieve high-memory bandwidth in most use cases, as well as their variations that take care of many important corner cases. We'll also discuss a heuristic method based on GPU performance modeling that helps cuTT choose the optimal algorithm for the particular use case. Finally, we'll present benchmarks for tensor ranks 2 to 12 and show that cuTT, a fully runtime library, performs as well as an approach based on code generation.
Learn about recent performance improvements in the GPU acceleration of NAMD biomolecular modeling application. These improvements include performance gains in the non-bonded CUDA kernels and new GPU-only implementation of Particle Mesh Ewald (PME) reciprocal computation. We will describe in detail the changes made in the non-bonded CUDA kernels that give 1.4-1.7 times better performance compared to the previous version. We will describe the new PME reciprocal code that enables computation on multiple GPUs and gives performance that is between 1.4-1.8 times faster than the previous code.
Running the latest versions of GPU accelerated applications maximizes performance and improves user productivity. The latest version, NAMD 2.11, provides up to 7x* speedup on GPUs over CPU-only systems and up to 2x performance over NAMD 2.10. Watch this on-demand webinar to hear experts from NVIDIA and NAMD answer your NAMD and GPU related questions ranging from installation to job optimization. *Dual CPU server, Intel E5-2698 firstname.lastname@example.orgGHz, NVIDIA Tesla K80 with ECC off, Autoboost On; STMV datasetoolkit to date.
This is a first snapshot of the heterogeneous CPU+GPU Molecular Dynamics (MD) in CHARMM and its performance and the accuracy. GPU is used only for the direct part of forces; CPU computes all other contributions (reciprocal, bonded, SHAKE, etc.). The GPU code was implemented natively in CHARMM using CUDA C. The MD engine is built around the DOMDEC domain decomposition code and therefore naturally enables MD simulations on multiple CPU+GPU nodes. We will present discoveries that used features implemented in DOMDEC_GPU, showing the current usefulness of the code and GPUs for biomolecular simulation, advanced sampling techniques, and for enabling DOE/NREL efforts toward affordable consumer biofuels.