In this session we present an innovative system biology application of GPU computing, as an alternative to molecular dynamics simulation for studying biochemical mechanisms inside the cells. For the first time we are able to apply the Chemical Master Equation (CME) stochastic framework at large scale, determining both probabilistic steady-state and transient dynamic of biochemical reaction networks. Our GPU implementation leverages the structure of the problem to optimize the sparse linear algebra routines needed by the stochastic model. As a result, we achieve an average 15.57x speedup over the optimized Intel MKL library running on a 64-core architecture.
Analytical Ultracentrifugation is a technique used to compute attributes of a protein like gross shape, sample heterogeneity or size. By applying a centrifugal force to the sample and simultaneously measuring the distribution, we can use first principles to derive the relative molecule sizes. Learn how the solution to the resulting regularized least squares problem can be computed in real time with the Tesla K20.
In this session, we will report efforts and experiences in developing high-performance parallel algorithms and codes on large-scale GPU clusters for analysis of the large amounts of data generated by present high-throughput synchrotron light-sources. Such analyses are used in the characterization of macromolecules and particle-systems at micro/nano-scales. Codes include multi-GPU accelerated implementations for X-ray scattering pattern simulation using Distorted Wave Born Approximation theory, and structural fitting of such patterns through inverse modeling using Reverse Monte Carlo simulation algorithm. These codes are designed to be architecture-aware, and deliver high-performance through dynamic selection of the best-performing computational parameter values, such as computation decomposition parameters and block sizes, for the GPU architecture being used. Discussed will be detailed performance analyses and optimizations of codes.
Crystalline silicon is a fundamental material for IT and green energy industries. Pyrolysis of silane to silicon which deposites onto the seeds in circulating fluidized bed reactors may bring a revolution for its greener production. Unfortunately, its commercial application is limited by our poor understanding on its complicated hydrodynamics and reaction kinetics. A multiscale simulation to this process, from reactors to reactions, is carried out using petascale CPU-GPU hybrid computing. The molecular dynamics simulations using Tersoff family potential are carried out for gaseous silane molecules and interfacial silicon atoms by multi-threads on multi-core CPUs, while other silicon atoms are computed on GPUs with fixed neighbor-list, reaching petaflops sustainable performance. Direct numerical simulation for the gas flow around suspended silicon powders are then carried out on GPUs, coupling lattice Boltzmann method with immersed moving boundary, while the collisions among the powders are processed on CPUs with discrete element method (DEM). 1 million solid particles in 2D and 100 thousand particles in 3D with about 1 billion lattices are computed using up to 672 GPUs, which is by far the largest scale for gas-solid systems now, and enters, for the first time, the scale-independent range where intrinsic constitutive correlations can be obtained. The whole reactor is finally simulated on GPUs in coarse-grained DEM, while Navier-Stokes equation is solved for the silane flow with coarse grids on CPUs. The simulation has revealed unprecedented details of the silicon production process which is most valuable to its scaling-up and optimization.
Learn the techniques that Pacific Northwest National Laboratory (PNNL) computer scientists are applying to enhance the performance of scientific applications such as NWChem (Quantum Chemistry), STOMP (subsurface flow transport) and Paraflow (multiflow simulation) on large scale GPU-accelerated clusters (e.g. ORNL Titan). This talk will discuss approaches such as Domain Specific Languages and auto-tuners for tensor contractions, library based approaches, dynamic heterogeneous task-based runtimes, compiler and run-time transformations for GPU code, which we are currently exploring to allow scaling these scientific applications to tens of thousands of GPU-accelerated nodes. Will provide initial results on the various approaches, comparing the performances obtained with code restructuring to pragma based (e.g., OpenACC) and to library based approaches, which maintain most of the legacy code intact while still providing considerable speedups.
The goal of this session is to present the advantages of mixing CUDA libraries and CUDA kernels to deliver a robust community package for material science modeling that fully exploits multi-core systems equipped with GPUs. The Plane-Wave Self-Consistent Field (PWscf) code of the Quantum ESPRESSO suite is the focus of this work. During the session the main computation-dependent components, that also represent fundamental building blocks for many other quantum chemistry codes, will be discussed and analyzed. Subsequently an in-depth performance assessment of several realistic scientific cases will be presented, starting from single workstations to large clusters equipped with hundreds of GPUs.
This talk discuss the development of a Domain-Specific Language (DSL), the tools and the related runtime for efficiently generating Tensor Contractions (generalized matrix multiplications), an important part of many quantum chemistry methods (e.g. Coupled Cluster Theory). Starting from a high level description of the computation, the tool analyses it and generates optimized C, OpenCL or CUDA implementations. The runtime, supporting a task based computation model, is then able to execute the generated code on GPU-accelerated heterogeneous large scale clusters, maximizing the utilization of the processing elements and minimizing communication costs.
This session will detail the performance and capabilities of GPU-accelerated VASP, explain design decisions made in porting VASP to CUDA, and present a roadmap for GPU accelerated VASP development. We've achieved performance improvements up to around 20x on systems of around 100 ions and have implemented exact-exchange. We are working on ports of more conventional functionality.
In this session, we will present a series of work on density functional theory (DFT) plane wave pseudopotential(PWP) calculations on GPU clusters. The GPU version is developed based on a CPU DFT-PWP code: PEtot, which can calculate ~1000 atoms on thousands of processors. Our test indicates that the GPU version can have a ~20 times speedup over CPU code. A detail analysis of the speed-up and the scaling on the number of CPU/GPU(up to 256) will be presented.As far as we know, this is the first GPU DFT-PWP code scalable to large number of CPU/GPU.
In this session we discuss the challenges encountered in development of quantum chemistry software for GPUs from scratch and optimization of the kernels for the best performance. We attempt to create a unified framework for automatic generation of efficient quantum chemistry codes tailored individually for various GPU (NVidia, ATI) and CPU architectures and programming (CUDA, OpenCL, C/C++) languages using a meta-programming approach based on a computer algebra system. We demonstrate its utility by generating highly optimized GPU and CPU kernels dealing with various integrals over Gaussian basis functions implemented in the TeraChem quantum chemistry package.
GROMACS is a state-of-the-art molecular simulation package that employs extensive multi-level heterogeneous parallelization. Our new CUDA-based algorithms provide 4x speedup over handtuned CPU SIMD assembly, and unprecedented absolute performance. However, the heterogeneity of hardware and the inherent bottlenecks involved make efficient resource utilization and strong scaling very challenging. This advanced session describes our recent efforts on multi-level load-balancing, kernel execution strategies, CPU-GPU work splitting, and ways to exploit Kepler features such as Hyper-Q. Join us to talk about current limits of GPU acceleration in MD, and how to take molecular dynamics simulations to 100 millisecond/iteration, equivalent to 10,000 fps, in the near future!
The session will present implementation and optimization strategies for molecular dynamics many-body potentials. It will concentrate on the new SNAP potential for LAMMPS, which is based on the GAP bispectrum analysis of Bartok et al. [PRL 104, 136403 (2010)]. SNAP is fit to large amounts of quantum-based DFT data and is capable of reproducing the accuracy of DFT while still exhibiting linear scaling with the system size. By exploiting multiple parallelisation layers it is possible to mitigate its high cost of 500,000 flops per interaction through excellent strong scaling behaviour down to 16 atoms per GPU. Thus the achievable time to solution on GPU clusters using SNAP is comparable to running simple Lennard Jones simulations.
This talk will present recent successes in the use of GPUs to accelerate interactive molecular visualization and analysis tasks on hardware platforms ranging from commodity desktop computers to the latest Cray XK7 supercomputers. The talk will focus on recent algorithm algorithm developments and the applicability and efficient use of new CUDA features on state-of-the-art Kepler GPUs. Will present the latest performance results for GPU accelerated trajectory analysis runs on the Blue Waters Cray XK7 and other GPU-accelerated HPC platforms, and conclude with a discussion of ongoing work and future opportunities for GPU acceleration, particularly as applied to the analysis of petascale simulations of large biomolecular complexes and long simulation timescales.
This session will present recent results from Molecular Dynamics simulations Folding@home, discussing both schemes for parallelization on thousands to millions of GPUs as well as how these simulations have had an impact in basic biophysics and biomedical science, with an emphasis on protein folding and Alzheimer''s Disease.
This session will demonstrate how GPUs were used to accelerate the primary computational bottleneck in explicitly quantum mechanical reactive molecular dynamics simulations in the open-source code LATTE. Focusing on implementations on single and multi-GPU architectures of a remarkably simple algorithm for the computation of the density matrix in electronic structure theory that is based on a recursive series of generalized matrix-matrix multiplications. Utilizing CUDA and CUBLAS, resulted not only in significantly faster code, but also density matrices with numerical errors smaller than those obtained from traditional CPU-based algorithms. Real-world applications and timings computed using GPU-accelerated LATTE will be presented.
With the plethora of future applications of carbon nanotube materials rapidly being realized and exploited, we are pursuing fundamental studies of structural, dynamic, and energetic properties of model single-walled carbon nanotubes in pure water and in aqueous solutions of simple inorganic salt, sodium chloride (NaCl) and sodium iodide (NaI). Our transformative research is supported and made possible because of a hybrid combination of resources at Oak Ridge National Lab such as the GPU cluster Keeneland for FEN ZI GPU molecular dynamics simulations of mean force calculations and the data-intensive cluster Nautilus for the data analysis of the GPU-computed potentials of mean force. In this talk we dive deep into the various key aspects of CNT simulations on hybrid resources. Come and learn some of the underlying challenges and get the latest solutions devised to tackle both algorithmic and scientific challenges of CNT simulations and their heterogeneous workflows with GPUs.
The goal of this session is to present the design and capabilities of GPU-accelerated GPAW, a density-functional theory (DFT) code based on grid based projector-augmented wave method. It''s suitable for large scale electronic structure calculations and capable of scaling to thousands of cores. We''ll discuss how we have accelerated the most computationally intensive components of the program with CUDA. We''ll provide detailed performance and scaling analysis of our multi-GPU-accelerated code staring from small systems up to systems with few thousands atoms running on large GPU clusters with over 200 GPUs. We''ve achieved up to 15 times speed-ups on large systems.
In 2008, NVIDIA demonstrated that CUDA-enabled GPUs accelerated molecular dynamics calculations by nearly 3 orders of magnitude compared to traditional CPUs. This allowed a single GPU to achieve the performance of a supercomputer at this task. Additionally, performance has improved by 1.5x to 2x per GPU generation. Despite these obvious benefits, there is still entrenched resistance to porting many existing codes to GPUs because of the work involved in doing so. However, with 5 years of performance data now in the rear-view mirror, it is clear that not only is it of huge benefit to port to GPUs now, but also that failing to do so will only result in having to do so later when many-core architectures become the standard. Finally, given you have already ported your code to GPUs, the next logical step is make your code cloud-accessible, freeing your users from having to purchase any hardware whatsoever and allowing them to take advantage of exponentially improving performance.
Monte Carlo and Molecular Dynamics simulations are standard tools for analyzing the thermodynamic and statistical behavior of many-particle systems. The first computer experiment performed for the Manhattan project was a simulation of 12 hard spheres using a Monte Carlo algorithm. Now, massive parallelism enables routine simulations of millions of particles. In this talk, we describe our novel GPU Monte Carlo algorithm and compare it with HOOMD-blue, our open-source Molecular Dynamics code. Recent improvements to HOOMD-blue make possible parallel multiple GPU simulations on workstations and clusters. Applications include polymer dynamics, granular materials, non-equilibrium systems, and hard particle self-assembly.
The highly parallel molecular dynamics code NAMD was chosen in 2006 as a target application for the NSF petascale supercomputer now know as Blue Waters. NAMD was also one of the first codes to run on a GPU cluster when G80 and CUDA were introduced in 2007. How do the GPU-accelerated Cray XK6 Blue Waters and ORNL Titan machines compare to CPU-based platforms for a hundred-million-atom Blue Waters acceptance test? Come learn the opportunities and pitfalls of taking GPU computing to the petascale and the importance of CUDA 5 and Kepler features in combining multicore host processors and GPUs in a legacy message-driven application.
Learn how to perform molecular dynamics simulations reaching microsecond-per-day performance on GPUs, how to achieve impressive GPU acceleration of a code that was already extremely hand-tuned for x86 CPUs, and how we hope to take it even further in the future. GROMACS is one of the most widespread programs in the world to simulate biomolecular dynamics, and has long been accelerated for CPUs with handtuned assembly code. This session will cover our challenges and successes in achieving significantly higher absolute performance with CUDA in GROMACS compared to extremely tuned CPU code both on low-end systems and massively parallel supercomputers. Join us to learn about the overall architectural decisions and features of this heterogeneous multi-level parallelization, see examples of application performance, and participate in a discussion about how future molecular simulation needs to focus on efficient throughput and sampling to achieve scaling.
ROCS (Rapid Overlay of Chemical Structure) is a proprietary algorithm that helped build OpenEye as a pillar of molecular modeling software. This was due to ROCS being very fast on the CPU and its robustness as a scientific model. Porting the algorithm to OpenCL achieved over a 100x speed improvement. What has been the effect after 3 years of experience on the market? And why was it ported to CUDA? What is the true value of speed? And are there other ways to achieve it?
The present session refers to a haptic Protein - Ligand docking (HPLD) application developed in the Molecular Modelling Lab of the Cardiff School of Pharmacy. The talk aims to describe in detail how GPUs enable the application to run with a fully flexible ligand and protein target. The first part of the talk describes the algorithm used to perform the MMFF94s force-field energy and force calculations. Performance benchmarks will be presented to show the speed-up gained from the presented CUDA algorithms. The second part of the talk refers to how asynchronous stream processing helped to provide smooth visual rendering as well as force feedback on the haptic device at a rate of 1000Hz. The session closes by showing how flexible HPLD improves docking results during simulations.
This talk will focus on the impact that GPUs have had on Molecular Dynamics (MD) Simulations. In particular it will highlight the massive performance improvements that GPUs have brought to MD simulations with AMBER. Kepler based solutions can routinely provide simulation rates exceeding 100ns/day on a single GPU in a single desktop while replica exchange approaches to accelerating convergence enable hundreds of GPUs to be employed in parallel. The GPU revolution has transformed the MD landscape. No longer is access to supercomputer resources required to routinely access microsecond timescales and beyond. The world of MD research is now flat, with all researchers, young and old, rich and poor being able to run simulations that previously were restricted to those privileged enough to have routine access to supercomputers. This has made it an exciting time for research involving Molecular Dynamics.
The distributed shared-memory implementation of the coupled-cluster singles and doubles with perturbative triples algorithm, CCSD(T), in the GAMESS chemistry package was ported to the GPU using the directive-based OpenACC standard. The focus of this port was to achieve maximum strong-scaling performance for small molecular systems (
Recent advances in reformulating electronic structure algorithms for stream processors such as graphical processing units have made DFT calculations on systems comprising up to O(10 to the 3) atoms feasible. Simulations on such systems that previously required half a week on traditional processors can now be completed in only half an hour. Join Professor Heather Kulik, Massachusetts Institute of Technology, as she discusses how she leverages these GPU-accelerated quantum chemistry methods in the code TeraChem to investigate large-scale quantum mechanical features in applications ranging from protein structure to mechanochemical depolymerization. In each case, large-scale and rapid evaluation of electronic structure properties is critical for unearthing previously poorly understood properties and mechanistic features of these systems. Professor Kulik will also discuss outstanding challenges in the use of Gaussian localized-basis-set codes on GPUs pertaining to limitations in basis set size and how she circumvents such challenges to computational efficiency with systematic, physics-based error corrections to basis set incompleteness.
In this session we''ll speak about the implementation of a SAS (Synthetic Aperture Sonar) processing software on the GPU, running in real-time on-board of an Autonomous Underwater Vehicle (AUV). Current AUVs run in pre-planned survey routes and record all the data for off-line processing. They don''t have flexibility to adapt to environmental conditions and sonar performance. With this new software we can increase the level of autonomy, allowing adaptive behaviors. We''ll show the process of design and implementation of the software, as well as the first results of the tests carried out with a real AUV.