Learn strategies for efficiently employing various cascaded compression algorithms on the GPU. Many database input fields are amenable to compression since they have repeating or gradually increasing pattern, such as dates and quantities. Fast implementations of decompression algorithms such as RLE-Delta will be presented. By utilizing compression, we can achieve 10 times greater effective read bandwidth than the interconnect allows for raw data transfers. However, I/O bottlenecks still play a big role in the overall performance and data has to be moved efficiently in and out of the GPU to ensure optimal decompression rate. After a deep dive into the implementation, we'll show a real-world example of how BlazingDB leverages these compression strategies to accelerate database operations.
Running the latest versions of GPU accelerated applications maximizes performance and improves user productivity. The latest version, NAMD 2.11, provides up to 7x* speedup on GPUs over CPU-only systems and up to 2x performance over NAMD 2.10. Watch this on-demand webinar to hear experts from NVIDIA and NAMD answer your NAMD and GPU related questions ranging from installation to job optimization. *Dual CPU server, Intel E5-2698 email@example.comGHz, NVIDIA Tesla K80 with ECC off, Autoboost On; STMV datasetoolkit to date.
Learn about various methods and trade-offs in the distributed GPU implementation of molecular dynamics proxy application that achieves more than 90% weak scaling efficiency on 512 GPU nodes. CoMD represents a reference implementation of classical molecular dynamics algorithms and workloads. It is created and maintained by The Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx) and is part of the R&D100 Award-winning Mantevo 1.0 software suite. In this talk we will discuss the main techniques and methods that are involved in GPU implementation of CoMD, including (1) cell-based and neighbor list approaches for neighbor particles search, (2) different thread-mapping strategies and memory layouts. An efficient distributed implementation will be covered in detail. Interior/boundary cells separation is used to allow efficient asynchronous processing and concurrent execution of kernels, memory copies and MPI transfers.