SEARCH SESSIONS
SEARCH SESSIONS

Search All
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

AEC & Manufacturing
Presentation
Media
Abstract:
For manufacturing a part using conventional 3-Axis CNC machining process, one must determine a set of machining orientations. This process planning task is carried out manually by the machinist, considering decision parameters such as part visibility ...Read More
Abstract:
For manufacturing a part using conventional 3-Axis CNC machining process, one must determine a set of machining orientations. This process planning task is carried out manually by the machinist, considering decision parameters such as part visibility, machinability, machining depths, tool geometry, etc. We modelled this task as a Linear optimization problem; the solution to which is a set of machining orientations. The solution methodology employs a Greedy algorithm and a Heuristic Simulated Annealing (SA) approach in order to get a globally optimal solution set of machining orientations automatically. The algorithms yielding process planning parameters are ported on a GPU(Tesla-C2075).   Back
 
Topics:
AEC & Manufacturing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4216
Download:
Share:
Artificial Intelligence and Deep Learning
Presentation
Media
Abstract:
Uncertainty in locomotion and sensing is one of the primary challenges in the robotics domain. GPU's are emerging as powerful new tools for uncertainty quantification through their ability to perform real-time Monte Carlo simulation as part ...Read More
Abstract:

Uncertainty in locomotion and sensing is one of the primary challenges in the robotics domain. GPU's are emerging as powerful new tools for uncertainty quantification through their ability to perform real-time Monte Carlo simulation as part of a closed-loop control system. By coupling GPU-based uncertainty propagation with optimal control laws, robotic vehicles can "hedge their bets" in unknown environments and protect themselves from unexpected disturbances. Examples of GPU-based stochastic controllers will be discussed for several robotic systems of interest, including simulated and experimental results demonstrating unique improvements in obstacle avoidance and accuracy. The theoretical concepts behind GPU-based control will be described allowing application of these control laws to a wide array of robotic systems.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4261
Streaming:
Download:
Share:
 
Abstract:
In this presentation we describe an efficient multi-level parallel implementation of the most significant bit (MSB) radix sort-based multi-select algorithm (k-NN). Our implementation processes multiple queries within a single kernel call with ea ...Read More
Abstract:

In this presentation we describe an efficient multi-level parallel implementation of the most significant bit (MSB) radix sort-based multi-select algorithm (k-NN). Our implementation processes multiple queries within a single kernel call with each thread block/warp simultaneously processing different queries. Our approach is incremental and reduces memory transactions through the use of bit operators, warp voting functions, and shared memory. Benchmarks show significant improvement for over previous implementation of k-NN search on the GPU.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4494
Streaming:
Download:
Share:
 
Abstract:
Random Forests have become an extremely popular machine learning algorithm for making predictions from large and complicated data sets. The currently highest performing implementations of Random Forests all run on the CPU. We implemented a Rando ...Read More
Abstract:

Random Forests have become an extremely popular machine learning algorithm for making predictions from large and complicated data sets. The currently highest performing implementations of Random Forests all run on the CPU. We implemented a Random Forest learner for the GPU (using PyCUDA and runtime code generation) which outperforms the currently preferred libraries (scikits-learn and wiseRF). The "obvious" parallelization strategy (using one thread-block per tree) results in poor performance. Instead, we developed a more nuanced collection of kernels to handle various tradeoffs between the number of samples and the number of features.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4525
Streaming:
Share:
 
Abstract:
The rise of the internet, especially mobile internet, has accelerated the data explosion - a driving force for the great success of deep learning in recent years. Behind the scenes, the heterogeneous high-performance computing is another key ena ...Read More
Abstract:

The rise of the internet, especially mobile internet, has accelerated the data explosion - a driving force for the great success of deep learning in recent years. Behind the scenes, the heterogeneous high-performance computing is another key enabler of that success. In this talk, we will share some of work we did at Baidu. We will highlight how big data, deep analytics and high-performance heterogeneous computing can work together with great success.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics, HPC and Supercomputing, Video & Image Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4651
Streaming:
Download:
Share:
 
Abstract:
Speeding up machine learning algorithms has often meant tedious, bug-ridden programs tuned to specific architectures, all written by parallel programming amateurs. But machine learning experts can leverage libraries such as CuBLAS to greatly eas ...Read More
Abstract:

Speeding up machine learning algorithms has often meant tedious, bug-ridden programs tuned to specific architectures, all written by parallel programming amateurs. But machine learning experts can leverage libraries such as CuBLAS to greatly ease the burden of development and make fast code widely available. We present a case study in parallelizing Kernel Support Vector Machines, powerful machine-learned classifiers which are very slow to train on large data. In contrast to previous work which relied on hand-coded exact methods, we demonstrate that a recent approximate method can be compelling for its remarkably simple implementation, portability, and unprecedented speedup on GPUs.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics, Numerical Algorithms & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4656
Streaming:
Download:
Share:
 
Abstract:
See how a cluster of GPUs has enabled our research group to train Artificial Neural Networks with more than 10 billion connections. "Deep learning" algorithms, driven by bigger datasets and the ability to train larger networks, have le ...Read More
Abstract:

See how a cluster of GPUs has enabled our research group to train Artificial Neural Networks with more than 10 billion connections. "Deep learning" algorithms, driven by bigger datasets and the ability to train larger networks, have led to advancements in diverse applications including computer vision, speech recognition, and natural language processing. After a brief introduction to deep learning, we will show how neural network training fits into our GPU computing environment and how this enables us to duplicate deep learning results that previously required thousands of CPU cores.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4694
Streaming:
Download:
Share:
 
Abstract:
In this talk, we compare the implementation of deep learning networks [1] on traditional x86 processors with the implementation on NVIDIA Tesla K20 GPU Accelerators for the purposes of training Restricted Boltzmann Machines [2] and for deep netw ...Read More
Abstract:

In this talk, we compare the implementation of deep learning networks [1] on traditional x86 processors with the implementation on NVIDIA Tesla K20 GPU Accelerators for the purposes of training Restricted Boltzmann Machines [2] and for deep network back propagation in a large-vocabulary speech recognition task (automatic transcription of TED talks). Two GPU implementations are compared: 1) a high-level implementation using Theano [3] and 2) a native implementation using low-level CUDA BLAS libraries. We describe the scaling properties of these implementations in comparison to a baseline batched-x86 implementation as a function of training data size. We also explore the development time tradeoffs for each of the implementations.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Performance Optimization1, Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4732
Streaming:
Share:
 
Abstract:
Machine learning is a powerful tool for processing large amounts of data. Learning to rank plays a key role in many information retrieval problems and constructs a ranking model from training data. Ensemble methods allow us to make a trade-off b ...Read More
Abstract:

Machine learning is a powerful tool for processing large amounts of data. Learning to rank plays a key role in many information retrieval problems and constructs a ranking model from training data. Ensemble methods allow us to make a trade-off between the quality of the obtained model and computational time of the learning process. On the other hand a lot of algorithms imply parallel processing of data. We describe the task of machine-learned ranking and consider the MatrixNet algorithm based on decision tree boosting. We present GPU optimized implementation of this method, which performs more than 20 times faster compared to the CPU based version and retains the same quality of ranking.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4739
Streaming:
Download:
Share:
 
Abstract:
This talk will describe recent progress in object recognition using deep convolutional networks. Over the last 18 months, these have demonstrated significant gains over traditional computer vision approaches and are now widely used in industry ( ...Read More
Abstract:

This talk will describe recent progress in object recognition using deep convolutional networks. Over the last 18 months, these have demonstrated significant gains over traditional computer vision approaches and are now widely used in industry (e.g. Google, Facebook, Microsoft, Baidu). Rob Fergus will outline how these models work, and describe architectures that produce state-of-the-art results on the leading recognition benchmarks. GPUs are an essential component to training these models. The talk will conclude with a live demo.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4753
Streaming:
Share:
 
Abstract:
Significant advances have recently been made in the fields of machine learning and image recognition, impacted greatly by the use of NVIDIA GPUs. Leading performance is harnessed from deep neural networks trained on millions of images to predict ...Read More
Abstract:

Significant advances have recently been made in the fields of machine learning and image recognition, impacted greatly by the use of NVIDIA GPUs. Leading performance is harnessed from deep neural networks trained on millions of images to predict thousands of categories of objects. Our expertise at Clarifai in deep neural networks helped us achieve the world's best published image labeling results [ImageNet 2013]. We use NVIDIA GPUs to train large neural networks within practical time constraints and are creating a developer API to enable the next generation of applications in a variety of fields. This talk will describe what these neural networks learn from natural images and how they can be applied to auto-tagging new images, searching large untagged photo collections, and detecting near-duplicates. A live demo of our state of the art system will showcase these capabilities and allow audience interaction.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4959
Streaming:
Share:
 
Abstract:
Many modern 3D range sensors generate on the order of one million data points per second and form the foundation of many modern applications in robotic perception. For real-time performance, it is beneficial to leverage parallel hardware when po ...Read More
Abstract:

Many modern 3D range sensors generate on the order of one million data points per second and form the foundation of many modern applications in robotic perception. For real-time performance, it is beneficial to leverage parallel hardware when possible. This poster details work to quickly compress a raw point cloud into a set of parametric surfaces using a GPU-accelerated form of Expectation Maximization. We find that our algorithm is over an order of magnitude faster than the serial C version, while the segmentation provides several orders of magnitude savings in memory while still preserving the geometric properties of the data.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4274
Download:
Share:
 
Abstract:
Our goal was development of efficient ensemble learning methods that use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. Integration of CUDA with python and R for accelerating th ...Read More
Abstract:

Our goal was development of efficient ensemble learning methods that use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. Integration of CUDA with python and R for accelerating the machine learning process on multiple GPUs is presented. Python GPU acceleration is based on pycuda and cuda math packages that are used in the machine learning environments such as theano and cudatree. R acceleration is based on the integration of plasma and magma GPU accelerated math libraries. In our presentation we have integrated the implementation of GPU accelerated random forest and GPU accelerated support vector machines in the ensemble learning method that runs on multiple GPUs simultaneously. An example of wrapper code for support vector machines is presented and some benchmarks results are depicted.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4277
Download:
Share:
 
Abstract:
MALDI imaging is a label-free bioanalytical technique which can capture spatial distribution of hundreds of molecular compounds in a single measurement while maintaining the sample molecular integrity. Development of statistical methods for MALD ...Read More
Abstract:

MALDI imaging is a label-free bioanalytical technique which can capture spatial distribution of hundreds of molecular compounds in a single measurement while maintaining the sample molecular integrity. Development of statistical methods for MALDI involve preprocessing, segmentation, generating prototype molecular images using PCA, and spectra classification. The large amounts of data and complex machine learning algorithms call for GPU acceleration. In the 3D Massomics EU funded project, SagivTech aims at developing a GPU based library for the analysis of MALDI imaging.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4164
Download:
Share:
 
Abstract:
This approach aims at aligning, unifying and expanding the set of sentiment lexicons which are available on the web in order to increase their robustness of coverage. Our USL approach computes the unified strength of polarity of each lexical ent ...Read More
Abstract:

This approach aims at aligning, unifying and expanding the set of sentiment lexicons which are available on the web in order to increase their robustness of coverage. Our USL approach computes the unified strength of polarity of each lexical entry based on the Pear- son correlation coefficient which measures how correlated lexical entries are with a value between 1 and -1, where 1 indicates that the lexical entries are perfectly correlated, 0 indicates no correlation, and -1 means they are perfectly inversely correlated and the UnifiedMetrics procedure for CPU and GPU, respectively.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4124
Download:
Share:
 
Abstract:
In this poster, we present preliminary results on our rewriting of the source code of NEURON, the de facto standard general neural network simulator in the field of computational neuroscience, using CUDA. In this study, we took a strategy to sol ...Read More
Abstract:

In this poster, we present preliminary results on our rewriting of the source code of NEURON, the de facto standard general neural network simulator in the field of computational neuroscience, using CUDA. In this study, we took a strategy to solve ordinary differential equations for mechanisms in parallel, which describe electrical and chemical properties attached on a small area of cell membrane. The rewriting was rather straightforward, and we were able to achieve maximally 27% speed up of a typical benchmark simulation. These results suggest that CUDA provides simple means to accelerate general neural simulation.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4184
Download:
Share:
 
Abstract:
We introduce an algorithm for determining optimal transition paths between given configurations. The solution is obtained by solving variational equations for Freidlin-Wentzell action functionals. One of the applications of the method presented ...Read More
Abstract:

We introduce an algorithm for determining optimal transition paths between given configurations. The solution is obtained by solving variational equations for Freidlin-Wentzell action functionals. One of the applications of the method presented is a system controlling motion and redeployment between unit's formations. The effciency of the algorithm has been evaluated in a simple sandbox environment implemented with the use of the NVIDIA CUDA technology.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4267
Download:
Share:
 
Abstract:
Multi-robot coalition formation is an NP hard combinatorial optimization problem. This work models the multi-robot coalition formation problem as a multi-objective optimization problem. Evolutionary approaches have been the preferred choice for ...Read More
Abstract:

Multi-robot coalition formation is an NP hard combinatorial optimization problem. This work models the multi-robot coalition formation problem as a multi-objective optimization problem. Evolutionary approaches have been the preferred choice for finding the set of Pareto-optimal solutions for any multi-objective optimization problem. But due their high computational complexity, these approaches fail to deliver results in time critical scenarios like the coalition formation. This work introduces a novel parallel multi-point Pareto Archived Evolutionary Strategy (PAES) algorithm to solve the multi-robot coalition formation algorithm. The results show that the proposed algorithm is scalable and produces better approximation of the Pareto optimal set than other approaches investigated.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4269
Download:
Share:
 
Abstract:
This work presents the implementation of parallel evaluation of coefficients for a navigation system providing a solution in the navigation area of autonomous mobile robots. The idea is make use of parallel computation as a tool to improve the p ...Read More
Abstract:

This work presents the implementation of parallel evaluation of coefficients for a navigation system providing a solution in the navigation area of autonomous mobile robots. The idea is make use of parallel computation as a tool to improve the performance and accelerate the evaluation of the coefficients in the system.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4157
Download:
Share:
Astronomy & Astrophysics
Presentation
Media
Abstract:
We show that with todays largest supercomputers it is possible to follow the trajectories of billions of particles, computing a unique fingerprint of their dynamics. With the use of 18,000 GPUs we could compute a 'sky map' of the radiation emitted ...Read More
Abstract:
We show that with todays largest supercomputers it is possible to follow the trajectories of billions of particles, computing a unique fingerprint of their dynamics. With the use of 18,000 GPUs we could compute a 'sky map' of the radiation emitted by individual electrons in a large-scale, turbulent plasma, providing unique insight into the relation between the plasma dynamics and observable radiation spectra.  Back
 
Topics:
Astronomy & Astrophysics, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4139
Streaming:
Download:
Share:
 
Abstract:
We are implementing a fully GPU-based imager for radio interferometric imaging for high sensitivity near real-time imaging. Modern interferometric radio telescope generated many Tera Bytes of data per observation which needs to be imaged in near ...Read More
Abstract:

We are implementing a fully GPU-based imager for radio interferometric imaging for high sensitivity near real-time imaging. Modern interferometric radio telescope generated many Tera Bytes of data per observation which needs to be imaged in near-real time. Imaging software running on conventional computers currently take many orders of magnitude longer for imaging. In this presentation, we will briefly describe the algorithms and describe in more detail their adaptation for GPUs in particular and for heterogeneous computing in general. We will discuss the resulting run-time performance on the GPU using deal data from existing radio telescopes. Test with our current implementation show a speed-up of upto 100x compared to CPU implementation in the critical parts of processing enabling us to reduce the memory footprint by replacing compute-and-cache with on-demand computing on the GPU. For scientific use cases requiring high resolution high sensitivity imaging such a GPU-based imager represents an enabler technology.

  Back
 
Topics:
Astronomy & Astrophysics, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4223
Streaming:
Download:
Share:
 
Abstract:
A wide range of major astrophysical problems can be investigated by means of computational fluid dynamics methods, and performing numerical simulations of Magneto-Hydrodynamics (MHD) flows using realistic setup parameters can be very challenging. We ...Read More
Abstract:
A wide range of major astrophysical problems can be investigated by means of computational fluid dynamics methods, and performing numerical simulations of Magneto-Hydrodynamics (MHD) flows using realistic setup parameters can be very challenging. We will first report on technical expertise gained in developing code Ramses-GPU designed for efficient use of large cluster of GPUs in solving MHD flows. We will illustrate how challenging state-of-the-art highly resolved simulations requiring hundreds of GPUs can provide new insights into real case applications: (1) the study of the Magneto-Rotational Instability and (2) high Mach number MHD turbulent flows.  Back
 
Topics:
Astronomy & Astrophysics, Computational Fluid Dynamics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4274
Streaming:
Download:
Share:
 
Abstract:
In this session we demonstrate how we are able to leverage the massive parallelism of thousands of GPUs inside the Titan supercomputer and be able to simulate the past and future of the Milky Way Galaxy on a star-by-star basis in less than 10 days. T ...Read More
Abstract:
In this session we demonstrate how we are able to leverage the massive parallelism of thousands of GPUs inside the Titan supercomputer and be able to simulate the past and future of the Milky Way Galaxy on a star-by-star basis in less than 10 days. The audience will learn what it takes to parallelize an advanced hierarchical GPU tree-code to efficiently run on the Titan supercomputer. A gravitational N-body problem is by definition an all-to-all problem and it is of utmost importance for scalability to hide data communication behind computations. This turned out to be a major challenge on the Titan supercomputer because Bonsai's GPU kernels are ~3x faster on Kepler than on Fermi, which reduced compute time and as a result hampered scalability. We were able to solve this by redesigning the communication strategy by taking full advantage of each of the 16- CPU cores while the GPUs were busy computing gravitational forces. This allowed Bonsai to scale to more than 8192 GPUs.  Back
 
Topics:
Astronomy & Astrophysics, Numerical Algorithms & Libraries, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4347
Streaming:
Download:
Share:
 
Abstract:
The European Southern Observatory is leading the construction of the European Extremely Large Telescope (E-ELT), a 39m diameter telescope, to provide Europe with the biggest eye on the Universe ever built, with a first light foreseen in 2022. The E-E ...Read More
Abstract:
The European Southern Observatory is leading the construction of the European Extremely Large Telescope (E-ELT), a 39m diameter telescope, to provide Europe with the biggest eye on the Universe ever built, with a first light foreseen in 2022. The E-ELT will be the first telescope that will entirely depend, for routine operations, on adaptive optics (AO), an instrumental technique for the correction of dynamically evolving aberrations in an optical system, used on astronomical telescopes to compensate, in real-time, for the effect of atmospheric turbulence. In this session, we will show how GPUs can provide the throughput required to both simulate at high framerate and drive in real-time these AO systems that provide tens of thousands of degrees of freedom activated several hundreds times per second.   Back
 
Topics:
Astronomy & Astrophysics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4357
Streaming:
Share:
 
Abstract:
We present the work accomplished to enable the numerical codes "RAMSES" to the GPU, in order to efficiently exploit hybrid accelerated HPC architectures. RAMSES is a code designed for the study astrophysical problems on different scales (e. ...Read More
Abstract:
We present the work accomplished to enable the numerical codes "RAMSES" to the GPU, in order to efficiently exploit hybrid accelerated HPC architectures. RAMSES is a code designed for the study astrophysical problems on different scales (e.g. star formation, galaxy dynamics, large scale structure of the universe) treating at the same time various components (dark energy, dark matter, baryonic matter, photons) and including a variety of physical processes (gravity, magneto-hydrodynamics, chemical reactions, star formation, supernova and AGN feedback, etc.). It is implemented in Fortran 90 and adopts the OpenACC paradigm to offload some the most computationally demanding algorithms to the GPU. Two different strategies have been pursued for code refactoring, in order to explore complementary solutions and select the most effective approach. The resulting algorithms are presented together with the results of tests, benchmarks and scientific use cases.  Back
 
Topics:
Astronomy & Astrophysics, Numerical Algorithms & Libraries, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4365
Streaming:
Download:
Share:
 
Abstract:
New "telescopes" that directly observe the spacetime fluctuations from black holes will come online within the next few years, but the data they generate will be meaningless unless compared against banks of known signals. Creating these ban ...Read More
Abstract:
New "telescopes" that directly observe the spacetime fluctuations from black holes will come online within the next few years, but the data they generate will be meaningless unless compared against banks of known signals. Creating these banks requires black hole mergers of many different masses, spins, and orbital eccentricities to be simulated. This is not yet feasible, since even a single simulation may take several months. GPU acceleration offers a theoretical speedup of 50X, but until now has been too laborious to attempt. This is no longer the case: using a combination of hand-coding in CUDA, calls to CUBLAS and cuSPARSE, and our own automatic porting routine "CodeWriter," we have successfully accelerated the C++-based "Spectral Einstein Code". I will discuss our porting strategy, the challenges we encountered, and the new science made possible by the GPU. This talk should be of particular interest to scientists working on GPU ports of their own codes.  Back
 
Topics:
Astronomy & Astrophysics, Programming Languages, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4423
Streaming:
Share:
 
Abstract:
We present our experiences in designing, building and deploying a massively parallel processing system for the LOFAR radio telescope using off-the-shelf hardware and software. After numerous hurdles, we created a high-throughput system, based on CUDA ...Read More
Abstract:
We present our experiences in designing, building and deploying a massively parallel processing system for the LOFAR radio telescope using off-the-shelf hardware and software. After numerous hurdles, we created a high-throughput system, based on CUDA, MPI and OpenMP running on multi-GPU, multi-socket servers and InfiniBand. These techniques have established niches. However, due to conflicting memory models, incompatible requirements and abstractions, the otherwise orthogonal techniques do not cooperate well within the same application. Using the project's time line as a guide we will answer the following questions: (1)What problems appear when combining these techniques? (2) How did we adjust both the hardware and the software to meet our requirements? (3) How did we robustly develop and deploy to both development boxes and a production cluster? And, most importantly, (4)how does the system perform?   Back
 
Topics:
Astronomy & Astrophysics, Programming Languages, Signal and Audio Processing, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4441
Streaming:
Download:
Share:
 
Abstract:
Is it worth cooling your GPUs, or should you run them hot? In this session, we discuss how operating temperature affects the computational performance of GPUs. Temperature-dependent leakage current effects contribute significantly to power dissipatio ...Read More
Abstract:
Is it worth cooling your GPUs, or should you run them hot? In this session, we discuss how operating temperature affects the computational performance of GPUs. Temperature-dependent leakage current effects contribute significantly to power dissipation in nanometer-scale circuits; within GPUs this corresponds to decreased performance per watt. We use the CUDA-based xGPU code for radio astronomy to benchmark Fermi and Kepler GPUs while controlling the GPU die temperature, voltage, and clock speed. We report on trends and relate these measurements to physical leakage current mechanisms.  Back
 
Topics:
Astronomy & Astrophysics, Clusters & GPU Management
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4484
Streaming:
Download:
Share:
 
Abstract:
How do you cross-correlate 10,000 signals 100 million times per second? This is an example of the type of compute-bound problem facing modern radio astronomy, which, paralleling the paradigm shift in computing architectures, has transitioned from mon ...Read More
Abstract:
How do you cross-correlate 10,000 signals 100 million times per second? This is an example of the type of compute-bound problem facing modern radio astronomy, which, paralleling the paradigm shift in computing architectures, has transitioned from monolithic single-dish telescopes to massive arrays of smaller antennas. In this session we will describe how general-purpose HPC installations can be used to achieve scaling of a cross-correlation pipeline to petascale with all the flexibility of a purely-software implementation. Optimisations we will discuss include tuning of the GPU cross-correlation kernel, maximising concurrency between compute and network operations, and minimising bandwidth bottlenecks in a streaming application. GPUs are already powering the world's biggest radio telescope arrays, and this work paves the way for entirely off-the-shelf correlators for the future exascale-generation of instruments.  Back
 
Topics:
Astronomy & Astrophysics, Signal and Audio Processing, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4511
Streaming:
Download:
Share:
 
Abstract:
Radio frequency interference (RFI) is the primary enemy of sensitive multi element radio instruments like the Giant Metrewave Radio Telescope (GMRT, India). Signals from radio receivers are corrupted with RFI from power lines, satellite signals, ...Read More
Abstract:

Radio frequency interference (RFI) is the primary enemy of sensitive multi element radio instruments like the Giant Metrewave Radio Telescope (GMRT, India). Signals from radio receivers are corrupted with RFI from power lines, satellite signals, etc. Seen in the form of spikes and bursts in raw voltage data, RFI is statistically seen as outliers in a Gaussian distribution. We present an approach to tackle the problem of RFI, in real-time, using a robust scale estimator such as the Median Absolute Deviation (MAD). Given the large data rate from each of the 30 antennas, sampled at 16 ns, it is necessary for the filter to work well within real-time limits. To accomplish this, the algorithm has been ported to the GPUs to work within the GMRT pipeline. Presently, the RFI rejection pipeline runs in real-time for 0.3-0.7 sec long data chunks. The GMRT will soon be upgraded to work at 10 times the current data rate. We are now working on improving the algorithm further so as to have the RFI rejection pipeline ready for the upgraded GMRT.

  Back
 
Topics:
Astronomy & Astrophysics, Big Data Analytics, Signal and Audio Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4538
Streaming:
Download:
Share:
 
Abstract:
The history of particle physics is a history of particle detectors, namely developments of new detectors and data analysis tools. For recent experiments, the size of data coming from particle detectors is huge and therefore a reconstruction of partic ...Read More
Abstract:
The history of particle physics is a history of particle detectors, namely developments of new detectors and data analysis tools. For recent experiments, the size of data coming from particle detectors is huge and therefore a reconstruction of particle trajectories using GPU is worth implementing. LHEP Bern pioneered the use of GPUs in this field. Here, we show some applications of GPUs on the reconstruction of particle trajectories. This work is partially related to the talk S4372 - Does Antimatter Fall On The Earth? Measurement Of Antimatter Annihilation with GPU, and more general for high energy physics.  Back
 
Topics:
Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4228
Download:
Share:
 
Abstract:
Information about the period immediately after the Big Bang is lost in most metrics used to study the large-scale structure of the Universe. However, the cosmological three-point correlation function (3ptCF) applied to galaxy positions can provide in ...Read More
Abstract:
Information about the period immediately after the Big Bang is lost in most metrics used to study the large-scale structure of the Universe. However, the cosmological three-point correlation function (3ptCF) applied to galaxy positions can provide information about this early time. The 3ptCF scales with the cube of the number of galaxies. Approximation functions can speed this, but can introduce systematic errors that will be unacceptable in the coming era of large astronomical datasets. Previous work (Bard et al., 2013) has established that the full calculation of the 2-point correlation function on the GPU reduces computation time by up to a factor of 140 compared to the CPU. In this work we consider the implementation of the full 3ptCF on the GPU, which presents very different challenges both cosmologically and computationally.   Back
 
Topics:
Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4236
Download:
Share:
 
Abstract:
The general purpose of the project is to provide a volume rendering suite that utilizes graphics cards to interactively visualize large astrophysical data sets. We are working with open source packages PyCuda and PyOpenGL to build inter operations be ...Read More
Abstract:
The general purpose of the project is to provide a volume rendering suite that utilizes graphics cards to interactively visualize large astrophysical data sets. We are working with open source packages PyCuda and PyOpenGL to build inter operations between CUDA and the yt-project, which has been optimized to handle various sets of astrophysical data. The result is a robust tool that provides researches with an interactive visual of their data.  Back
 
Topics:
Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4201
Download:
Share:
 
Abstract:
Solar-physics observations and 3D radiative-MHD simulations of the Sun provide an enormous amount of data that makes difficult to analyze. NASA's recently launched Interface Region Imaging Spectrograph (IRIS) provides a very large 4D dataset (2D sp ...Read More
Abstract:
Solar-physics observations and 3D radiative-MHD simulations of the Sun provide an enormous amount of data that makes difficult to analyze. NASA's recently launched Interface Region Imaging Spectrograph (IRIS) provides a very large 4D dataset (2D space, time and spectra) and enable us to study with great detail the dynamics of the Sun of one of the most intriguing layers of the Sun, the chromosphere. Moreover, state-of-the-art 3D radiative MHD simulations are needed to interpret these observations. This poster will describe different tools using GPU computing which helps scientists analyze the immense observational and numerical modeling data volumes of the Sun, as well as we can compare of both of them creating synthetic observables from the simulations using GPUs.   Back
 
Topics:
Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4138
Download:
Share:
 
Abstract:
We present an easily extensible, open source, GPU-accelerated tool for testing, comparing and experimenting with multiple approaches to multiframe deconvolution. Currently we provide options for a Gaussian, Richardson-Lucy and damped Richardson-Lucy ...Read More
Abstract:
We present an easily extensible, open source, GPU-accelerated tool for testing, comparing and experimenting with multiple approaches to multiframe deconvolution. Currently we provide options for a Gaussian, Richardson-Lucy and damped Richardson-Lucy approach as well as Wavelet filtering and Robust Statistics weighting. Our tool yields an over 20x speedup over the CPU implementation, allowing for interactive experimentation of parameters.   Back
 
Topics:
Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4285
Download:
Share:
 
Abstract:
We present direct astrophysical N-body simulations with up to a few million bodies using our parallel MPI/CUDA code on large GPU clusters in China, Ukraine and Germany, with different kinds of GPU hardware and in one case a first preliminary test wit ...Read More
Abstract:
We present direct astrophysical N-body simulations with up to a few million bodies using our parallel MPI/CUDA code on large GPU clusters in China, Ukraine and Germany, with different kinds of GPU hardware and in one case a first preliminary test with Intel PHI. Our clusters are directly linked under the Chinese Academy of Sciences special GPU cluster program in the cooperation of ICCS (International Center for Computational Science). We reach about half of the peak Kepler K20 GPU performance for our production ready phiGPU code, in a real application scenario with individual hierarchically block time-steps with the high (4th, 6th and 8th) order Hermite integration schemes and a real core-halo density structure of the modeled stellar systems. The code is mainly used to simulate star clusters and galactic nuclei with supermassive black holes, in which correlations between distant particles (two body relaxation) cannot be neglected.  Back
 
Topics:
Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4270
Download:
Share:
 
Abstract:
Recent giant impact models focus on producing a circumplanetary disk of the proper composition around Earth and defer to earlier works for the accretion of this disk into the Moon. The discontinuity between creating the circumplanetary disk and accre ...Read More
Abstract:
Recent giant impact models focus on producing a circumplanetary disk of the proper composition around Earth and defer to earlier works for the accretion of this disk into the Moon. The discontinuity between creating the circumplanetary disk and accretion of the Moon is unnatural and lacks simplicity. Here we return to first principles and produce a highly parallelizable model that readily produces stable Earth-Moon systems from a single, continuous simulation. The resultant systems possess an iron-deficient, heterogeneously mixed Moon and accurate axial tilt of the Earth. This project was made financially feasible by the utilization of modern GPUs.  Back
 
Topics:
Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4139
Download:
Share:
 
Abstract:
Investigations of the space plasma environment are necessary to space exploration. MHD simulation has been a powerful tool to modeling space plasmas, but it is computationally expensive. In this poster, large-scale global MHD simulations of solar win ...Read More
Abstract:
Investigations of the space plasma environment are necessary to space exploration. MHD simulation has been a powerful tool to modeling space plasmas, but it is computationally expensive. In this poster, large-scale global MHD simulations of solar wind interacting with the planet's magnetosphere are presented. Simulation results of a 1350 x 900 x 900 domain of the space plasma environment around a planet was produced by our GPU accelerated MHD simulation code, running on the GPU-rich supercomputer TSUBAME 2.5 using 324 K20x (Kepler) GPUs. Performance test shows 7.8 TFOPS of our simulation code. Simulation results of solar wind interacting with the Earth's magnetic field and dipole magnetic fields with non-vertical magnetic pole are presented.  Back
 
Topics:
Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4125
Download:
Share:
Autonomous Vehicles
Presentation
Media
Abstract:
In depth view into content creation using UI Composer. Including the Digital asset pipeline, animation, materials, development of state-machines, and debugging.
 
Topics:
Autonomous Vehicles, Debugging Tools & Techniques, Product & Building Design, In-Vehicle Infotainment (IVI) & Safety
Type:
Tutorial
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4616
Streaming:
Download:
Share:
 
Abstract:
A continuation of Part 1, this is a hands-on, interactive demonstration of content creation using UI Composer. The audience will be guided through the steps to build a data-driven virtual automotive gauge. In order to actively participate in thi ...Read More
Abstract:

A continuation of Part 1, this is a hands-on, interactive demonstration of content creation using UI Composer. The audience will be guided through the steps to build a data-driven virtual automotive gauge. In order to actively participate in this session, attendees are asked to bring their own Windows laptop with UI Composer installed. UI Composer is available for free from http://uicomposer.nvidia.com/

  Back
 
Topics:
Autonomous Vehicles, Debugging Tools & Techniques, Product & Building Design, In-Vehicle Infotainment (IVI) & Safety
Type:
Tutorial
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4806
Streaming:
Download:
Share:
 
Abstract:
In this session we present a real-time simulation of electromagnetic wave propagation using OptiX GPU ray tracing. This simulation is used in virtual test drives to allow testing of Advanced Driver Assistance Systems which will be based on wirel ...Read More
Abstract:

In this session we present a real-time simulation of electromagnetic wave propagation using OptiX GPU ray tracing. This simulation is used in virtual test drives to allow testing of Advanced Driver Assistance Systems which will be based on wireless Car-to-Car communication. Learn how ray tracing performance can be improved to archieve real-time simulations and how the ray tracing results are post-processed to perform the electromagnetic calculations on the GPU using the Thrust library.

  Back
 
Topics:
Autonomous Vehicles, Computational Physics, Rendering & Ray Tracing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4359
Streaming:
Download:
Share:
 
Abstract:
Discover how mobile GPUs enable modern features of car driving in a power-efficient and standardized way, by providing the fundamental building blocks of computer vision to the higher-level reasoning functions that enable the car to detect lanes ...Read More
Abstract:

Discover how mobile GPUs enable modern features of car driving in a power-efficient and standardized way, by providing the fundamental building blocks of computer vision to the higher-level reasoning functions that enable the car to detect lanes, park automatically, avoid obstacles, etc. We explain the challenges of having to fit into a given time budget, and how the low-level machine vision such as corner detection, feature tracking and even more advanced functionality such as 3D surrounding reconstruction is achieved in the context of the car's systems and its outside environment.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Computer Vision, Mobile Applications
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4412
Streaming:
Download:
Share:
 
Abstract:
People want not only cost-friendly, trouble-free, energy-efficient, but also safe cars. Today's technology provides Advanced Emergency Braking Systems that can detect pedestrians and automatically brake just before collision is unavoidable. ...Read More
Abstract:

People want not only cost-friendly, trouble-free, energy-efficient, but also safe cars. Today's technology provides Advanced Emergency Braking Systems that can detect pedestrians and automatically brake just before collision is unavoidable. We have a vision that future Advanced Driver Assistance Systems enable not just detecting pedestrians but recognizing how the pedestrians are and understanding the level of danger to avoid emergency situations. We claim deep Convolutional Neural Networks (CNN) are the right tools for these highly non-trivial tasks, and Tegra is the best partner. We demonstrate real-time deep CNN using Tegra.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4621
Streaming:
Download:
Share:
 
Abstract:
Learn about two Delphi projects that are pushing the concept of a personalized in-vehicle experience. As drivers bring more of their personal content and personal style into the car, opportunities are emerging for car makers and platform provide ...Read More
Abstract:

Learn about two Delphi projects that are pushing the concept of a personalized in-vehicle experience. As drivers bring more of their personal content and personal style into the car, opportunities are emerging for car makers and platform providers to differentiate their offerings. We will explore the infotainment architecture of the future - enabling feature upgrades at the same rate as mobile devices. We will also explore how GPU technology enables "months-to-minutes" user interfaces, and greater flexibility in end-user personalization.

  Back
 
Topics:
Autonomous Vehicles, In-Vehicle Infotainment (IVI) & Safety
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4659
Streaming:
Share:
 
Abstract:
In this session, we will present contents of the Vision Toolkit, discuss performance advantages and demonstrate real-time applications enabled by this library. The Vision Toolkit is a product of NVIDIA, designed to enable real-life Computer Visi ...Read More
Abstract:

In this session, we will present contents of the Vision Toolkit, discuss performance advantages and demonstrate real-time applications enabled by this library. The Vision Toolkit is a product of NVIDIA, designed to enable real-life Computer Vision applications. It leverages state-of-the-art Computer Vision research and offers a variety of functions to its developers,initially targeting Advanced Driver Assistance Systems (ADAS) and Augmented Reality (AR) applications. The toolkit will be highly GPU accelerated on mobile platforms, offering significant speedup and reducing engineering effort to design real-time vision applications. The toolkit includes open source samples and offers a flexible framework that enables users to extend and contribute new functionality. It will be deployed on different operating systems including Android and the Linux on ARM to registered developers and partners through NVIDIA's web site.

  Back
 
Topics:
Autonomous Vehicles, Computational Photography, Computer Vision, Mobile Summit
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4714
Streaming:
Download:
Share:
 
Abstract:
With recent advances in low-cost high-performance LiDARs (laser-based Light Detection and Ranging sensors) and GPUs, ultra-accurate GPS-free navigation based on SLAM (Simultaneous Localization and Mapping) is becoming a reality. Learn how the la ...Read More
Abstract:

With recent advances in low-cost high-performance LiDARs (laser-based Light Detection and Ranging sensors) and GPUs, ultra-accurate GPS-free navigation based on SLAM (Simultaneous Localization and Mapping) is becoming a reality. Learn how the latest 360? field of view long-range 3D mapping LiDARs capable of generating data streams at gigasample-per-second (GSPS) sampling rates are used with 192 CUDA core GPUs based on the Kepler architecture to run artificial intelligence software and deliver advanced vehicular safety and navigation systems capable of real-time object detection, tracking, identification and classification, as well as offline full-availability jam-proof centimeter-accurate navigation.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Combined Simulation & Real-Time Visualization, In-Vehicle Infotainment (IVI) & Safety
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4761
Streaming:
Download:
Share:
 
Abstract:
The Tegra K1 is a powerful SOC that will be leveraged across many industries. It is based on the same Kepler architecture as the world's fastest gaming systems and most efficient supercomputers and brings supercomputing power to mobile and e ...Read More
Abstract:

The Tegra K1 is a powerful SOC that will be leveraged across many industries. It is based on the same Kepler architecture as the world's fastest gaming systems and most efficient supercomputers and brings supercomputing power to mobile and embedded. Jesse Clayton from NVIDIA will articulate the embedded development process for Tegra K1. The talk will cover the platform, programming paradigm, and development tools, and provide details on the Tegra K1 architecture relevant to embedded applications.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Defense, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4938
Streaming:
Share:
 
Abstract:
What does it mean to bring super computing into the car? Examples of piloted parking systems show what that means for customers as well as for developers. Audis way into piloted driving for the 21st century.
 
Topics:
Autonomous Vehicles, Video & Image Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4961
Streaming:
Share:
 
Abstract:
Fast on-road object detection is an important ADAS feature (advanced driver assistance systems). We propose CUDA implementation of soft cascade detector that allows real-time object detection on Tegra K1 platform. Applicable for pedestrian and v ...Read More
Abstract:

Fast on-road object detection is an important ADAS feature (advanced driver assistance systems). We propose CUDA implementation of soft cascade detector that allows real-time object detection on Tegra K1 platform. Applicable for pedestrian and vehicle detection.

  Back
 
Topics:
Autonomous Vehicles
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4289
Download:
Share:
Big Data Analytics
Presentation
Media
Abstract:
Numerous distributed processing models have emerged, driven by (1) the growth in volumes of available data and (2) the need for precise and rapid analytics. The most famous representative of this category is undoubtedly MapReduce, however, other more ...Read More
Abstract:
Numerous distributed processing models have emerged, driven by (1) the growth in volumes of available data and (2) the need for precise and rapid analytics. The most famous representative of this category is undoubtedly MapReduce, however, other more flexible models exist based on the DFG processing model. None of the existing frameworks however have considered the case when the individual processing nodes are equipped with GPUs to accelerate parallel computations. In this talk, we discuss this challenge and the implications of the presence of GPUs on some of the processing nodes on the DFG model representation of such heterogeneous jobs and on the scheduling of the jobs, with big data mining as principal use case.  Back
 
Topics:
Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4169
Streaming:
Download:
Share:
 
Abstract:
Large-scale dense subgraph detection problem has been an active research area for decades. It has numerous applications in web and bioinformatics domains. Therefore, numerous algorithms are designed to tackle this graph kernel. Due to the computation ...Read More
Abstract:
Large-scale dense subgraph detection problem has been an active research area for decades. It has numerous applications in web and bioinformatics domains. Therefore, numerous algorithms are designed to tackle this graph kernel. Due to the computation limitation, traditional approaches are infeasible when dealing with large-scale graph with millions or billions vertices. In this presentation, we proposed a GPU accelerated dense subgraph detection algorithm to solve the large-scale dense subgraph detection problem. It successfully mapped the irregular graph clustering problem into the GPGPU platform, and extensive experimental results demonstrated our strong scalability on the GPU computing platforms.  Back
 
Topics:
Big Data Analytics, Genomics & Bioinformatics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4215
Streaming:
Download:
Share:
 
Abstract:
This session will present the Red Fox system. Attendees will leave understanding GPU performance when executing relational queries over large data sets as typically found in data warehousing applications and the automatic compilation flow of kernel f ...Read More
Abstract:
This session will present the Red Fox system. Attendees will leave understanding GPU performance when executing relational queries over large data sets as typically found in data warehousing applications and the automatic compilation flow of kernel fusion which can be applied to other applications.   Back
 
Topics:
Big Data Analytics, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4222
Streaming:
Download:
Share:
 
Abstract:
Histograms are an important statistical tool with a wide variety of applications, especially in image processing. Naive CUDA implementations suffer from low performance on degenerate input data due to contention. This presentation will show how to u ...Read More
Abstract:
Histograms are an important statistical tool with a wide variety of applications, especially in image processing. Naive CUDA implementations suffer from low performance on degenerate input data due to contention. This presentation will show how to use "privatized" (per-thread) histograms to balance performance of the average case against data-dependent performance of degenerate cases.  Back
 
Topics:
Big Data Analytics, Video & Image Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4249
Streaming:
Share:
 
Abstract:
In high-speed networks, network traffic monitoring and analysis applications may require enormous raw compute power and high I/O throughputs, especially when traffic scrutiny on a per-packet basis is needed. Under those conditions, the applications f ...Read More
Abstract:
In high-speed networks, network traffic monitoring and analysis applications may require enormous raw compute power and high I/O throughputs, especially when traffic scrutiny on a per-packet basis is needed. Under those conditions, the applications face tremendous performance and scalability challenges. The GPU architecture fits well with the features of packet-based network monitoring and analysis applications. At Fermilab, we have prototyped a GPU-assisted network traffic monitoring & analysis system, which analyzes network traffic on a per-packet basis. We implemented a GPU-accelerated library for network traffic capturing, monitoring, and analysis. The library consists of various CUDA kernels, which can be combined in various ways to perform monitoring and analysis tasks. In this talk, we will describe our architectural approach in developing a generic GPU-assisted network traffic monitoring and analysis capability. Multiple examples will be given to demonstrate how to use GPUs to analyze network traffic.  Back
 
Topics:
Big Data Analytics, Numerical Algorithms & Libraries, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4320
Streaming:
Download:
Share:
 
Abstract:
This work reports on a power and performance analysis of large-scale graph processing on hybrid (i.e., CPU and GPU), single-node systems. While, on these systems, graph processing can be accelerated by properly mapping the graph-layout such that the ...Read More
Abstract:
This work reports on a power and performance analysis of large-scale graph processing on hybrid (i.e., CPU and GPU), single-node systems. While, on these systems, graph processing can be accelerated by properly mapping the graph-layout such that the algorithmic tasks exercise each of the processing units where they perform best; GPUs have much higher TDP thus their impact on overall energy consumption is unclear. An evaluation on large real-world graphs as well as on synthetic graphs as large as 1 billion vertices and 16 billion edges shows that efficiency - in terms of both performance and power, can be achieved.   Back
 
Topics:
Big Data Analytics, Seismic & Geosciences
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4338
Streaming:
Share:
 
Abstract:
Learn how GPUs can speed up real-time calculation of advanced multidimensional data filters required in data analytics and business intelligence applications. We present the design of a massively parallel "quantification" algorithm which, g ...Read More
Abstract:
Learn how GPUs can speed up real-time calculation of advanced multidimensional data filters required in data analytics and business intelligence applications. We present the design of a massively parallel "quantification" algorithm which, given a set of dimensional elements, returns all those elements for which ANY (or ALL) numeric cells in the respective slice of a user-defined subcube satisfy a given condition. Such filters are especially useful for the exploration of big data spaces, for zero-suppression in large views, or for top-k analyses. In addition to the main algorithmic aspects, attendees will see how our implementation solves challenges such as economic utilization of the CUDA memory hierarchy or minimization of threading conflicts in parallel hashing.  Back
 
Topics:
Big Data Analytics, Finance
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4395
Streaming:
Download:
Share:
 
Abstract:
We present Rhythm, a framework for high throughput servers that exploits similarity across web service requests to improve server throughput and energy efficiency. Present work in data center efficiency primarily focuses on scale-out, with off the sh ...Read More
Abstract:
We present Rhythm, a framework for high throughput servers that exploits similarity across web service requests to improve server throughput and energy efficiency. Present work in data center efficiency primarily focuses on scale-out, with off the shelf hardware used for individual machines leading to an inefficient usage of energy and area. Rhythm improves upon this by harnessing data parallel hardware to execute "cohorts" of web service requests, grouping requests together based on similar control flow and using intelligent data layout optimizations. An evaluation of the SPECWeb Banking workload for future server platforms on the GTX Titan achieves 4x the throughput (reqs/sec) of a core i7 at efficiencies (reqs/Joule) comparable to a dual core ARM Cortex A9.  Back
 
Topics:
Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4447
Streaming:
Download:
Share:
 
Abstract:
Given the high cost of enterprise data storage, compression is becoming a major concern for the industry in the age of Big Data. Attendees can learn how to efficiently offload data compression to the GPU, leveraging its superior memory and compute re ...Read More
Abstract:
Given the high cost of enterprise data storage, compression is becoming a major concern for the industry in the age of Big Data. Attendees can learn how to efficiently offload data compression to the GPU, leveraging its superior memory and compute resources. We focus on a the DEFLATE algorithm that is a combination of the LZSS and Huffman entropy coding algorithms, used in common compression formats like gzip. Both algorithms are inherently serial and trivial parallelization methods are inefficient. We show how to parallelize these algorithms efficiently on GPUs and discuss trade-offs between compression ratio and increased parallelism to improve performance. We conclude our presentation with a head-to-head comparison to a multi-core CPU implementation, demonstrating up to half an order of performance improvement using a single Kepler GPU. This is joint work with IBM researchers Rene Mueller and Tim Kaldewey.  Back
 
Topics:
Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4459
Streaming:
Download:
Share:
 
Abstract:
Regular expression based pattern matching is a key enabling technology for a new generation of big data analytics. We'll describe several key use cases that require high throughput, low latency, regular expression pattern matching. A new GPU based r ...Read More
Abstract:
Regular expression based pattern matching is a key enabling technology for a new generation of big data analytics. We'll describe several key use cases that require high throughput, low latency, regular expression pattern matching. A new GPU based regular expression technology will be introduced, its basic performance characteristics will be presented. We'll demonstrate that the GPU enables impressive performance gains in pattern matching tasks and compare its performance against latest generation processors. Finally, we'll examine the key challenges in using such accelerators in large software products and highlight open problems in GPU implementation of pattern matching tasks.  Back
 
Topics:
Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4462
Streaming:
Share:
 
Abstract:
Performing analytics on data stored in Hadoop can be time consuming. While Hadoop is great at ingesting and storing data, getting timely insight out of the data can be difficult which reduces effectiveness and time-to-action. The use of NVIDIA GPUs ...Read More
Abstract:
Performing analytics on data stored in Hadoop can be time consuming. While Hadoop is great at ingesting and storing data, getting timely insight out of the data can be difficult which reduces effectiveness and time-to-action. The use of NVIDIA GPUs to accelerate analytics on Hadoop is an optimal solution that drives high price to performance benefits. In this session, we'll demonstrate a solution using NVIDIA GPUs for the analysis of big data in Hadoop. The demo will show how you can leverage the Hadoop file system, it's map reduce architecture and GPUs to run computationally intense models bringing together both data and computational parallelism. Methods demonstrated will include classification techniques such as decision trees, logistic regression and support vector machines and clustering techniques like k means, fuzzy k means and hierarchical k means on marketing, social and digital media data.   Back
 
Topics:
Big Data Analytics, Genomics & Bioinformatics, Finance
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4471
Streaming:
Share:
 
Abstract:
This session will describe Recursive Interaction Probability (RIP) and why it is a pretty cool algorithm. Time will be spent on benchmark analysis against other algorithms as well as performance within an operational database. The presentation will e ...Read More
Abstract:
This session will describe Recursive Interaction Probability (RIP) and why it is a pretty cool algorithm. Time will be spent on benchmark analysis against other algorithms as well as performance within an operational database. The presentation will end with how RIP was implemented on a NVIDIA Kepler K20c, the design choices and how these affect performance. Use cases that play to the strengths of RIP as well as use cases that reveal its weaknesses will also be shared.   Back
 
Topics:
Big Data Analytics, Numerical Algorithms & Libraries, Clusters & GPU Management
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4483
Streaming:
Share:
 
Abstract:
Index of web documents provides a base for search and decision making. Traditionally, GPUs are used to run applications having a lot of parallelism and a small degree of divergence. We show that GPUs also are able to outperform CPUs for an appli ...Read More
Abstract:

Index of web documents provides a base for search and decision making. Traditionally, GPUs are used to run applications having a lot of parallelism and a small degree of divergence. We show that GPUs also are able to outperform CPUs for an application that has a large degree of parallelism, but medium divergence. Specifically, we concentrate on text processing used to index web documents. We present indexing algorithms for both GPU and CPU and show that GPU outperforms CPU on two common workloads. We argue that a medium sized GPU enabled cluster will be able to index all internet documents in one day. Indexing of web documents on GPU opens a new area for GPU computing. Companies that provide search services spend a lot of cycles on indexing. Faster and more energy efficient indexing on GPU may provide a valuble alternative to CPU-only clusters used today.

  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4506
Streaming:
Download:
Share:
 
Abstract:
This presentation will cover techniques for implementing hashing functions on the GPU. We will describe various parallel implementations of hashing techniques, e.g., cuckoo hashing, Partitioned Hashing, Bin-Hash, bloom filters, etc. and then present ...Read More
Abstract:
This presentation will cover techniques for implementing hashing functions on the GPU. We will describe various parallel implementations of hashing techniques, e.g., cuckoo hashing, Partitioned Hashing, Bin-Hash, bloom filters, etc. and then present different ways of implementing these functions on the GPU, with emphasis on data structures that exploit GPU's data parallel features as well as memory constraints.   Back
 
Topics:
Big Data Analytics, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4507
Streaming:
Download:
Share:
 
Abstract:
Learn how to process large program inputs at shared-memory speeds on the example of a 2-opt TSP solver. Our implementation employs interesting code optimizations such as biasing results to avoid computation, inverting loops to enable coalescing and t ...Read More
Abstract:
Learn how to process large program inputs at shared-memory speeds on the example of a 2-opt TSP solver. Our implementation employs interesting code optimizations such as biasing results to avoid computation, inverting loops to enable coalescing and tiling, introducing non-determinism to avoid synchronization, and parallelizing each operation rather than across operations to minimize thread divergence and drastically lower the latency of result production. The final code evaluates 68.8 billion moves per second on a single Titan GPU.  Back
 
Topics:
Big Data Analytics, Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4534
Streaming:
Download:
Share:
 
Abstract:
Learn about the data-parallel programming model in EAVL and how it can be used to write efficient mesh-based algorithms for multi-core and many-core devices. EAVL, the Extreme-scale Analysis and Visualization Library, contains a flexible scientific ...Read More
Abstract:
Learn about the data-parallel programming model in EAVL and how it can be used to write efficient mesh-based algorithms for multi-core and many-core devices. EAVL, the Extreme-scale Analysis and Visualization Library, contains a flexible scientific data model and targets future high performance computing ecosystems. This talk shows how a productive programming API built upon an efficient data model can help algorithm developers achieve high performance with little code. Discussions will include examples and lessons learned.  Back
 
Topics:
Big Data Analytics, Scientific Visualization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4553
Streaming:
Download:
Share:
 
Abstract:
Current application of GPU processors for parallel computing tasks shows excellent results in terms of speed-ups compared to CPU processors. However, there is no existing middleware framework that enables automatic distribution of data and processing ...Read More
Abstract:
Current application of GPU processors for parallel computing tasks shows excellent results in terms of speed-ups compared to CPU processors. However, there is no existing middleware framework that enables automatic distribution of data and processing across heterogeneous computing resources for structured and unstructured BigData applications. Thus, we propose a middleware framework for 'Big Data' analytics that provides mechanisms for automatic data segmentation, distribution, execution, information retrieval across multiple cards (CPU & GPU) and machines, a modular design for easy addition of new GPU kernels at both analytic and processing layer, and information presentation. The architecture and components of the framework such as multi-card data distribution and execution, data structures for efficient memory ac-cess, algorithms for parallel GPU computation and results for various test con-figurations are shown. Our results show proposed middleware framework pro-vides alternative and cheaper HPC solution to users.   Back
 
Topics:
Big Data Analytics, Finance, Video & Image Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4583
Streaming:
Download:
Share:
 
Abstract:
Our objective is to design a high-level data-parallel language extension to Python on GPUs. This language extension cooperates with the CPython implementation and uses Python syntax for describing data-parallel computations. The combination of rich ...Read More
Abstract:
Our objective is to design a high-level data-parallel language extension to Python on GPUs. This language extension cooperates with the CPython implementation and uses Python syntax for describing data-parallel computations. The combination of rich library support and language simplicity makes Python ideal for subject matter experts to rapidly develop powerful applications. Python enables fast turnaround time and flexibility for custom analytic pipelines to react to immediate demands. However, CPython has been criticized as being slow and the existence of the global interpreter lock (GIL) makes it difficult to take advantage of parallel hardware. To solve this problem, Continuum Analytics has developed LLVM based JIT compilers for CPython. Numba is the open-source JIT compiler. NumbaPro is the proprietary compiler that adds CUDA GPU support. We aim to extend and improve the current GPU support in NumbaPro to further increase the scalability and portability of Python-based GPU programming.  Back
 
Topics:
Big Data Analytics, Programming Languages, Large Scale Data Analytics, Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4608
Streaming:
Download:
Share:
 
Abstract:
Gunrock is a CUDA library for graph primitives that refactors, integrates, and generalizes best-of-class GPU implementations of breadth-first search, connected components, and betweenness centrality into a unified code base useful for future developm ...Read More
Abstract:
Gunrock is a CUDA library for graph primitives that refactors, integrates, and generalizes best-of-class GPU implementations of breadth-first search, connected components, and betweenness centrality into a unified code base useful for future development of high-performance GPU graph primitives. The talk will share experience on how to design the framework and APIs for computing efficient graph primitives on GPUs. We will focus on the following two aspects: 1) Details of the implementations of several graph algorithms on GPUs. 2) How to abstract these graph algorithms using general operators and functors on GPUs to improve programmer productivity.  Back
 
Topics:
Big Data Analytics, Large Scale Data Analytics, Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4609
Streaming:
Download:
Share:
 
Abstract:
We demonstrate how describing graph algorithms using the Gather-Apply-Scatter (GAS) approach of GraphLab allows us to implement a general purpose and extremely fast GPU based framework for describing and running graph algorithms. Most algorithms and ...Read More
Abstract:
We demonstrate how describing graph algorithms using the Gather-Apply-Scatter (GAS) approach of GraphLab allows us to implement a general purpose and extremely fast GPU based framework for describing and running graph algorithms. Most algorithms and graphs demonstrate a large speedup over GraphLab. We show that speedup is possible when using multiple gpus within a box and that processing of large graphs is possible - with the latest Tesla cards over 48GB of GPU memory can be available within a single box. Example algorithms will include pagerank, bfs, and sssp. The precursor to this work serves as the basis for other attempts at a GPU based GAS framework.  Back
 
Topics:
Big Data Analytics, Performance Optimization1, Large Scale Data Analytics, Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4611
Streaming:
Share:
 
Abstract:
We demonstrate how describing graph algorithms using the Gather-Apply-Scatter (GAS) approach of GraphLab allows us to implement a general purpose and extremely fast GPU based framework for describing and running graph algorithms. Most algorithms and ...Read More
Abstract:
We demonstrate how describing graph algorithms using the Gather-Apply-Scatter (GAS) approach of GraphLab allows us to implement a general purpose and extremely fast GPU based framework for describing and running graph algorithms. Most algorithms and graphs demonstrate a large speedup over GraphLab. We show that speedup is possible when using multiple GPUs within a box and that processing of large graphs is possible - with the latest Tesla cards over 48GB of GPU memory can be available within a single box. Example algorithms will include pagerank, bfs, and sssp. The precursor to this work serves as the basis for other attempts at a GPU-based GAS framework.  Back
 
Topics:
Big Data Analytics, Performance Optimization1
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4612
Streaming:
Share:
 
Abstract:
The goal of this session is to demonstrate how our high level abstraction enables developers to quickly develop high performance graph analytics programs on GPUs with up to 3 billion edges traversed per second on a Tesla or Kepler GPU. High performan ...Read More
Abstract:
The goal of this session is to demonstrate how our high level abstraction enables developers to quickly develop high performance graph analytics programs on GPUs with up to 3 billion edges traversed per second on a Tesla or Kepler GPU. High performance graph analytics are critical for a large range of application domains. The SIMT architecture of the GPUs and the irregularity nature of the graphs make it difficult to develop efficient graph analytics programs. In this session, we present an open source library that provides a high level abstraction for efficient graph analytics with minimal coding effort. We use several specific examples to show how to use our abstraction to implement efficient graph analytics in a matter of hours.   Back
 
Topics:
Big Data Analytics, Large Scale Data Analytics, Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4617
Streaming:
Download:
Share:
 
Abstract:
We will provide an in-depth analysis of our in production, GPU based technology for Big Data analytics, highlighting how our database benefits teleco companies. We will do this by explaining the key-features of our technology, mentioning that our dat ...Read More
Abstract:
We will provide an in-depth analysis of our in production, GPU based technology for Big Data analytics, highlighting how our database benefits teleco companies. We will do this by explaining the key-features of our technology, mentioning that our database provides close to real-time analytics and provides up to 100X faster insights all in a very cost-effective manner. We will elaborate on these features and more in order to provide a clear understanding of how our technology works and why it is beneficial for teleco companies.   Back
 
Topics:
Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4644
Streaming:
Download:
Share:
 
Abstract:
Learn strategies to decompose algorithms into parallel and sequential phases. These strategies make algorithmic intent clear while enabling performance portability across device generations. Examples include scan, merge, sort, and join.
 
Topics:
Big Data Analytics, Performance Optimization1
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4674
Streaming:
Share:
 
Abstract:
BIDMach is an open-source library for GPU-accelerated machine learning. BIDMach on a single GPU node exceeds the performance of all other tools (including **cluster** systems on hundreds of nodes) for the most common machine learning tasks. BIDM ...Read More
Abstract:

BIDMach is an open-source library for GPU-accelerated machine learning. BIDMach on a single GPU node exceeds the performance of all other tools (including **cluster** systems on hundreds of nodes) for the most common machine learning tasks. BIDMach is an easy-to-use, interactive environment similar to SciPy/Matlab, but with qualitatively higher performance. The session will discuss: Performance: BIDMach follows a "Lapack" philosophy of building high-level algorithms on fast low-level routines (like BLAS). It exploits the unique hardware features of GPUs to provide more than order-of-magnitude gains over alternatives. Accuracy: Monte-Carlo methods (MCMC) are the most general way to derive models, but are slow. We have developed a new approach to MCMC which provides two orders of magnitude speedup beyond hardware gains. Our "cooled" MCMC is fast and improves model accuracy. Interactivity: We are developing interactive modeling/visualization capabilities in BIDMach to allow analysts to guide, correct, and improve models in real time.

  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning, Genomics & Bioinformatics, Scientific Visualization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4811
Streaming:
Download:
Share:
 
Abstract:
The OpenPOWER Foundation (http://www.open-power.org/) is an open alliance of companies working together to expand the hardware and software ecosystem based on the POWER architecture. This collaboration across hardware and software vendors enables uni ...Read More
Abstract:
The OpenPOWER Foundation (http://www.open-power.org/) is an open alliance of companies working together to expand the hardware and software ecosystem based on the POWER architecture. This collaboration across hardware and software vendors enables unique innovation across the full hardware and software stack. OpenPOWER ecosystem partners and developers now have more choice, control and flexibility to optimize at any level of the technology from the processor on up for next-generation, hyperscale and cloud datacenters. Integrating support for NVIDIA GPUs on the POWER platform enables high performance enterprise and technical computing applications such as Big Data and analytics workloads. This presentation will cover the software stack and developer tools for OpenPOWER, the planned support for CUDA, and a proof of concept showing GPU acceleration. This proof of concept will be available as a demo in the IBM booth.  Back
 
Topics:
Big Data Analytics, Programming Languages, Debugging Tools & Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4882
Streaming:
Download:
Share:
 
Abstract:
Graphs that model social networks, numerical simulations, and the structure of the internet are enormous and continuously changing with time. Contemporary software packages neglect temporal variations in these networks and can only analyze them stati ...Read More
Abstract:
Graphs that model social networks, numerical simulations, and the structure of the internet are enormous and continuously changing with time. Contemporary software packages neglect temporal variations in these networks and can only analyze them statically. This poster presents an optimized GPU implementation of dynamic betweenness centrality, a popular analytic with applications in power grid analysis, the study of protein interactions, and community detection. By avoiding unnecessary accesses to memory, we achieve up to a 110x speedup over a CPU implementation of the algorithm and can update the analytic 45x faster on average than a static recomputation on the GPU.  Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4171
Download:
Share:
 
Abstract:
In this study, the recently introduced Ensemble Modeling (EM) approach was used to construct a kinetic model of E. coli metabolism. We put forth a metabolic model composed of 34 reactions and 22 metabolites representing E. coli's core metabolism. We ...Read More
Abstract:
In this study, the recently introduced Ensemble Modeling (EM) approach was used to construct a kinetic model of E. coli metabolism. We put forth a metabolic model composed of 34 reactions and 22 metabolites representing E. coli's core metabolism. We developed a Newton-Raphson based estimation approach to identify the kinetic parameters of a given metabolic network. The solver is designed and implemented using CUDA, in order to accelerate the overall process. The application initially parses a large set of equations using the Boost::Spirit C++ framework, finds an analytic Jacobian J, and then iteratively updates the 'best' solution with delta by solving J.delta=-f using GMRES from CUSP. Successive updates of the parameter set, Jacobian matrix, and function updates, as well as the system solver are all implemented on GPU.   Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4258
Download:
Share:
 
Abstract:
Massive amounts of data is being generated in recent times. Current classification methods are quite accurate but extremely slow on big data. We propose a two-pronged approach a) Treat data vertically instead of conventional horizontal treatment and ...Read More
Abstract:
Massive amounts of data is being generated in recent times. Current classification methods are quite accurate but extremely slow on big data. We propose a two-pronged approach a) Treat data vertically instead of conventional horizontal treatment and use our vertical-data specific classification algorithm and b) Exploit GPUs fast mathematical computational speed to process vertical data quickly which significantly benefits from our data structure called P-Tree. Our classification algorithm is O(k) where k is number of attributes and achieves significantly high accuracy.  Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4263
Download:
Share:
 
Abstract:
This poster describes the CUDA version of PrefixSpan algorithm implemented on NVIDIA Kepler GPU's that extracts the inherent task parallelism and leverage the dynamic parallelism feature for implementing recursion. The results show that the GPU acce ...Read More
Abstract:
This poster describes the CUDA version of PrefixSpan algorithm implemented on NVIDIA Kepler GPU's that extracts the inherent task parallelism and leverage the dynamic parallelism feature for implementing recursion. The results show that the GPU accelerated PrefixSpan i.e., CUDAPrefixSpan achieves a speedup of ~5x for sequence database of varying sizes.   Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4187
Download:
Share:
 
Abstract:
This poster presents the Red Fox system sponsored by NVIDIA Grad Fellowship program. It introduces the compilation flow and performance results of executing relational queries such as TPC-H on GPUs.
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4182
Download:
Share:
 
Abstract:
In this poster, we show a column-based approach with the use of multiple GPUs to quickly produce ad-hoc histograms from previously compiled data. We then compare this approach's histogram building speed to Apache's Lucene based Solr.
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4230
Download:
Share:
 
Abstract:
Information is one of the most influential forces transforming the growth of business. Companies churn out a burgeoning volume of transactional data, capturing and matching trillions of bytes of information. This has caused the data to grow exponen ...Read More
Abstract:
Information is one of the most influential forces transforming the growth of business. Companies churn out a burgeoning volume of transactional data, capturing and matching trillions of bytes of information. This has caused the data to grow exponentially. Our works progressively research the mechanisms for accelerating SQL query operations by using GPU. This proposed system is able to process large volume of data, which is exceeding the total size of the GPU RAM. It performs fundamental SQL operations such as select, like, order by, join, sum, min and others. In addition, it works with PostgreSQL and MySQL.   Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4149
Download:
Share:
 
Abstract:
Recent several supercomputers usually have GPUs on each compute node to accelerate computation. However, not all applications can be accelerated by GPUs. For example, performance of I/O-bound applications is limited by underlying I/O device performan ...Read More
Abstract:
Recent several supercomputers usually have GPUs on each compute node to accelerate computation. However, not all applications can be accelerated by GPUs. For example, performance of I/O-bound applications is limited by underlying I/O device performance. Such I/O-bound applications require more I/O bandwidth rather than computational power. If we execute such non-GPU applications, the GPUs are not utilized, and we waste the resources. To accelerate I/O-bound applications, we develop GPU-accelerated I/O interface (gmfs). Our experimental results show gmfs can accelerate sequential read/write, and utilize 82% of PCIe-gen2 peak bandwidth, 50% of PCIe-gen3 peak bandwidth.  Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4185
Download:
Share:
 
Abstract:
Recent supercomputers deploy not only many-core accelerators such as GPU but also Non-Volatile Memory (NVM) such as flash memory as an external memory, in order to handle large-scale data processing for a wide range of applications. However, how to c ...Read More
Abstract:
Recent supercomputers deploy not only many-core accelerators such as GPU but also Non-Volatile Memory (NVM) such as flash memory as an external memory, in order to handle large-scale data processing for a wide range of applications. However, how to construct NVM as local disks at a low cost with large volume for heterogeneous supercomputers is not clear. In order to clarify I/O characteristics between GPU and NVM, we comparatively investigate I/O strategies on GPU and multiple mini SATA SSDs. Our preliminary results exhibit 3.06GB/s of throughput from 8 mini SATA SSDs to GPU by using RAID0 with appropriate stripe size.  Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4251
Download:
Share:
 
Abstract:
The research project focuses on GPU implementation of commonly used clustering algorithm K-means. Our implementation minimizes the overhead caused by copying the data between CPU and GPU. We were able to implement the entire algorithm on GPU which gr ...Read More
Abstract:
The research project focuses on GPU implementation of commonly used clustering algorithm K-means. Our implementation minimizes the overhead caused by copying the data between CPU and GPU. We were able to implement the entire algorithm on GPU which greatly improved the performance over CPU reaching up to 15x speedup. Our work also analyses an improved version of the algorithm called K-means++. This algorithm builds on the original version of K-means improving it by more careful initialization which leads to better performance. We adjusted a K-means++ algorithm to work on a GPU, which led to 9x speedup.   Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4264
Download:
Share:
 
Abstract:
This poster shows the space compression and speed gain while processing 'Big Data' using pTrees which is a vertical bit slice of column of a data set. Our experiment shows 92% speed gain over traditional processing of data set of size in the range ...Read More
Abstract:
This poster shows the space compression and speed gain while processing 'Big Data' using pTrees which is a vertical bit slice of column of a data set. Our experiment shows 92% speed gain over traditional processing of data set of size in the range of billion records.  Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4268
Download:
Share:
 
Abstract:
Column-store in-memory databases have received a lot of attention because of their fast query processing response times on modern multi-core machines. As part of our research, we are developing a high performance GPU library for costly database opera ...Read More
Abstract:
Column-store in-memory databases have received a lot of attention because of their fast query processing response times on modern multi-core machines. As part of our research, we are developing a high performance GPU library for costly database operations. Our work will leverage the latest NVIDIA GPU features (i.e. Unified Virtual Addressing, Multi-Streaming) and various host side partitioning algorithms to run database operations on large size tables. The focus of this article is on the prototype for GroupBy/Aggregate operations that we created to exploit GPUs. The algorithm has two main steps. In the first step, we create a hash table by doing coalesced reads from the table on which we run the Groupby/aggregate query. The aggregation operations occur at the same time as creating the hash table. After creating the hash table, we only need a probe phase to retrieve result from hash. Our results indicate that by using GPU shared memory we can get 28X speed up over the CPU implementation  Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4211
Download:
Share:
 
Abstract:
Growing rates of collected data present a challenge when it comes to scalable solutions for data transmission and processing. Even more challenging is the problem of real-time stream processing. In such applications, the system needs to react to the ...Read More
Abstract:
Growing rates of collected data present a challenge when it comes to scalable solutions for data transmission and processing. Even more challenging is the problem of real-time stream processing. In such applications, the system needs to react to the incoming data within given time bounds. This poster presents the challenges in processing multiple real-time data streams on CPU/GPU systems, and the results of our efforts for dealing with these challenges. The work addresses various issues related to single- and multi-GPU systems, including resource sharing in computation and communication under real-time constraints.  Back
 
Topics:
Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4141
Download:
Share:
Climate, Weather & Ocean Modeling
Presentation
Media
Abstract:
The Non-hydrostatic Icosahedral Model (NIM) is a next-generation global weather model being developed at NOAA to improve 0-100 day weather predictions. Since development began in 2008, the model has been designed to run on highly parallel computer a ...Read More
Abstract:
The Non-hydrostatic Icosahedral Model (NIM) is a next-generation global weather model being developed at NOAA to improve 0-100 day weather predictions. Since development began in 2008, the model has been designed to run on highly parallel computer architectures such as GPUs. GPU parallelization has relied on the directive-based Fortran-to-CUDA ACCelerator (F2C-ACC) compiler developed at NOAA. Recent work has focused on parallelization of model physics, evaluating the openACC compilers, and preparing the model to run at the full 3.5KM resolution on 5000 nodes of Titan. This talk will report on the development of the NIM model, describe our efforts to improve parallel performance on Titan, and report on our experiences using the openACC compilers.  Back
 
Topics:
Climate, Weather & Ocean Modeling
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4157
Streaming:
Share:
 
Abstract:
This session presents QUIC EnvSim, a scientific tool for modeling the complex interactions between the environment and urban form. The talk will focus on the simulation of radiative heat transfer in urban environments with vegetation (such as trees, ...Read More
Abstract:
This session presents QUIC EnvSim, a scientific tool for modeling the complex interactions between the environment and urban form. The talk will focus on the simulation of radiative heat transfer in urban environments with vegetation (such as trees, parks, or green rooftops) using the GPU accelerated NVIDIA OptiX ray tracing engine. Attend this session to learn how we utilize OptiX to efficiently and accurately simulate radiative transport in urban domains. Topics include: (1) The physical properties of surfaces and vegetation and how they interact with longwave and shortwave radiation; (2) Efficient and scalable discretization of large urban domains; (3) Strategies we employed for overcoming challenges such as atomic operations, multiple GPUs, and more; and (4) Results that illustrate the validity, efficiency, and scalability of the system.  Back
 
Topics:
Climate, Weather & Ocean Modeling, Rendering & Ray Tracing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4312
Streaming:
Download:
Share:
 
Abstract:
ASUCA is the next generation non-hydrostatic Japanese mesoscale weather prediction model, currently developed at the Japan Meteorological Agency. In order to join the successful GPU port of its Dynamical Core by Shimokawabe et al., the Physical Core ...Read More
Abstract:
ASUCA is the next generation non-hydrostatic Japanese mesoscale weather prediction model, currently developed at the Japan Meteorological Agency. In order to join the successful GPU port of its Dynamical Core by Shimokawabe et al., the Physical Core has now been fully ported as well. In order to achieve a unified codebase with high usability as well as high performance on both GPU and CPU, a new directive based Open Source language extension called 'Hybrid Fortran' has been used (as introduced at GTC 2013). Using a python-based preprocessor it automatically creates CUDA Fortran code for GPU and OpenMP Fortran code for CPU - with two separate horizontal loop orders in order to keep performance. Attendees of this session will learn how to create a hybrid codebase with high usability as well as high performance on both CPU and GPU, how we used a preprocessor to achieve our goals and, how to use Macros for Memory optimizations while following the DRY principle.  Back
 
Topics:
Climate, Weather & Ocean Modeling
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4352
Streaming:
Download:
Share:
 
Abstract:
Numerical weather prediction is one of the major applications in high-performance computing and is accelerated on GPU supercomputers. Obtaining good parallel efficiency using more than thousand GPUs often requires skillful programming, for example, b ...Read More
Abstract:
Numerical weather prediction is one of the major applications in high-performance computing and is accelerated on GPU supercomputers. Obtaining good parallel efficiency using more than thousand GPUs often requires skillful programming, for example, both MPI for the inter-node communication and NVIDIA GPUDirect for the intra-node communication. The Japan Meteorological Agency is developing a next-generation high-resolution meso-scale weather prediction code ASUCA. We are implementing it on a multi-GPU platform by using a high-productivity framework for mesh-based application. Our framework automatically translates user-written functions that update a grid point and generates both GPU and CPU codes. The framework can also hide the complicated implementation for the efficient communications described above. In this presentation, we will show the implementation of the weather prediction code by using this framework and the performance evaluation on the TSUBAME 2.5 supercomputer at Tokyo Institute of Technology.  Back
 
Topics:
Climate, Weather & Ocean Modeling, Computational Fluid Dynamics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4565
Streaming:
Share:
 
Abstract:
ETH-Zurich is proposing a new concept for wave propagation laboratories in which the physical experiment is linked with a numerical simulation in real time. Adding live experimental data to a larger numerical simulation domain creates a virtual lab e ...Read More
Abstract:
ETH-Zurich is proposing a new concept for wave propagation laboratories in which the physical experiment is linked with a numerical simulation in real time. Adding live experimental data to a larger numerical simulation domain creates a virtual lab environment never before realized and enabling the study of frequencies inherent in important seismological and acoustic real-world scenarios. The resulting environment is made possible by a real-time computing system under development. This system must perform computations typically reserved for traditional (offline) HPC applications but produce results in a matter of microseconds. To do so, National Instruments is using the LabVIEW platform to leverage NI's fastest data acquisition and FPGA hardware with NVIDIA's most powerful GPU processors to build a real-time heterogenous simulator.  Back
 
Topics:
Climate, Weather & Ocean Modeling, Big Data Analytics, Numerical Algorithms & Libraries, Signal and Audio Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4682
Streaming:
Download:
Share:
 
Abstract:
GPU-based supercomputers are the most energy efficient and among the most powerful computing systems in use today. We show with examples from computational physics and climate simulations how this performance is delivered today to solve real-world pr ...Read More
Abstract:
GPU-based supercomputers are the most energy efficient and among the most powerful computing systems in use today. We show with examples from computational physics and climate simulations how this performance is delivered today to solve real-world problems. You will see how application software can has been structured in order to port seamlessly across hardware platforms, what aspects of current hybrid CPU-GPU platforms matter, and how such architectures should best develop, so that applications continue to benefit from exponential performance increases in the future.  Back
 
Topics:
Climate, Weather & Ocean Modeling, Numerical Algorithms & Libraries, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4719
Streaming:
Download:
Share:
 
Abstract:
Geostatistical techniques are widely used for the spatial characterization of phenomena in the Earth Sciences. Many of the estimation and simulation techniques proposed decades ago are still in use, however little effort has been done to rethink thei ...Read More
Abstract:
Geostatistical techniques are widely used for the spatial characterization of phenomena in the Earth Sciences. Many of the estimation and simulation techniques proposed decades ago are still in use, however little effort has been done to rethink their programming structure to take advantage of the current languages and hardware available. This poster shows a parallel implementation of the Turning Bands Method for conditional simulation of random fields, using Graphics Processing Units (GPU).   Back
 
Topics:
Climate, Weather & Ocean Modeling
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4248
Download:
Share:
 
Abstract:
We present an efficient GPU accelerated numerical method for modeling tsunami wave propagation. We use two dimensional shallow water equations to model tsunami waves and a high order accurate discontinuous Galerkin method for numerical solution of th ...Read More
Abstract:
We present an efficient GPU accelerated numerical method for modeling tsunami wave propagation. We use two dimensional shallow water equations to model tsunami waves and a high order accurate discontinuous Galerkin method for numerical solution of the model. We describe the inherent fine-grain parallel nature of our algorithms, and implementation on GPUs and CPUs using a portable threading language OCCA. Kernels written in OCCA are cross compiled with CUDA, OpenCL or OpenMP at runtime. This enables portability of the code among several hardware architectures. We compare the performance of these kernels across different threading languages on GPUs and CPUs.  Back
 
Topics:
Climate, Weather & Ocean Modeling
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4135
Download:
Share:
Clusters & GPU Management
Presentation
Media
Abstract:
GPI2 for GPUs is a PGAS framework for efficient communication in heterogeneous clusters. In this session you learn, how multi GPU programs can benefit from an RDMA based programming model. We will introduce the industry proven PGAS-communication libr ...Read More
Abstract:
GPI2 for GPUs is a PGAS framework for efficient communication in heterogeneous clusters. In this session you learn, how multi GPU programs can benefit from an RDMA based programming model. We will introduce the industry proven PGAS-communication library GPI2 and its support for GPUs. GPUDirect RDMA technology allows real one sided communication between multiple GPUs on different nodes. Therefore, an RDMA based programming model suits best for this technology. Due to the very low ommunication overhead of one sided operations, a latency for an inter-node data transfer of 3us can be reached. Still, GPI2 for GPUs is not only optimized for inter-node communication, but also intra-node communication is optimized by combining the different GPU-Direct technologies.   Back
 
Topics:
Clusters & GPU Management, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4183
Streaming:
Share:
 
Abstract:
Managing a multi-user heterogeneous HPC cluster can be challenging, but there are ways to make it easier. This session will cover the GPU-aware cluster software stack from the perspective of a system administrator, from driver installation through re ...Read More
Abstract:
Managing a multi-user heterogeneous HPC cluster can be challenging, but there are ways to make it easier. This session will cover the GPU-aware cluster software stack from the perspective of a system administrator, from driver installation through resource manager integration and centrally-managed development tools such as MPI libraries. This will include an overview NVIDIA's tools for GPU management and monitoring, a survey of third-party tools with GPU integration, and a number of "lessons learned" from managing HPC clusters inside NVIDIA.  Back
 
Topics:
Clusters & GPU Management
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4253
Streaming:
Download:
Share:
 
Abstract:
Learn how to deploy both Fermi and Kepler-based GPUs in an OpenStack cloud. In this session we describe the latest HPC features for the OpenStack cloud computing platform, including Kepler and Fermi GPU support, high speed networking, bare metal pro ...Read More
Abstract:
Learn how to deploy both Fermi and Kepler-based GPUs in an OpenStack cloud. In this session we describe the latest HPC features for the OpenStack cloud computing platform, including Kepler and Fermi GPU support, high speed networking, bare metal provisioning, and heterogeneous scheduling. The features are based on OpenStack Grizzly and Havana, with upcoming support for OpenStack Icehouse. Using examples drawn from signal and image processing, we will characterize the performance and versatility of LXC and Xen GPU support for both regular and irregular computations. We'll also characterize the performance improvements due to support for high speed networking in the OpenStack cloud. The session will conclude with a discussion of the next steps in HPC OpenStack development.  Back
 
Topics:
Clusters & GPU Management, Desktop & Application Virtualization, Signal and Audio Processing, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4257
Streaming:
Download:
Share:
 
Abstract:
The Research Computing and Cyberinfrastructure (RCC) Unit at The Pennsylvania State University (PSU) has a strong commitment to GPU enabled research, and is currently a CUDA Research Center. The main GPU cluster Lion-GA consists of Tesla M2070 a ...Read More
Abstract:

The Research Computing and Cyberinfrastructure (RCC) Unit at The Pennsylvania State University (PSU) has a strong commitment to GPU enabled research, and is currently a CUDA Research Center. The main GPU cluster Lion-GA consists of Tesla M2070 and M2090 GPU cards, and newer devices including the K20 are available for interactive use. Lion-GA itself is capable of delivering above thirty Teraflops during peak usage, and delivered almost twenty GPU-years in 2012. This presentation will detail experiences in establishing GPU enriched teaching and research at Penn State, covering a broad range of topics including benchmarking, administration, and high level code development.

  Back
 
Topics:
Clusters & GPU Management
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4298
Streaming:
Download:
Share:
 
Abstract:
In this session, you will get familiar with vACC, a virtual accelerator/GPU library that virtualizes remote and local GPUs installed across a cluster of compute nodes. The main objective is to provide efficient virtualized access to GPUs from any hos ...Read More
Abstract:
In this session, you will get familiar with vACC, a virtual accelerator/GPU library that virtualizes remote and local GPUs installed across a cluster of compute nodes. The main objective is to provide efficient virtualized access to GPUs from any host in the system. GPU virtualization brings new opportunities for effective management of GPU resources by decoupling them from host applications. In addition to access to remote GPUs, the vACC framework offers power-aware physical/virtual accelerator mapping, fault tolerance with transparent migration, efficient integration with virtual machines in Cloud environments and support for both CUDA and OpenCL paradigms. vACC can enable GPU service providers to offer cost-effective, flexible and fault-tolerant access to GPUs in the Cloud. Such capabilities are crucial in facilitating the adoption of GPU-based services across academia and industry. During the session, we will demonstrate how using vACC can improve GPU access experience and maintenance cost in a local cluster or a Cloud.  Back
 
Topics:
Clusters & GPU Management, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4321
Streaming:
Share:
 
Abstract:
Learn how to correctly profile the power and energy consumption of your kernels using the built-in power sensor of K20 compute GPUs. The measurements do not directly follow the GPU activity but lag behind and are distorted. This can cause large inacc ...Read More
Abstract:
Learn how to correctly profile the power and energy consumption of your kernels using the built-in power sensor of K20 compute GPUs. The measurements do not directly follow the GPU activity but lag behind and are distorted. This can cause large inaccuracies, especially for short running kernels, when taking the power samples at face value. This session explains how to compute the true power and energy consumption and provides general guidelines on how to best profile the power draw of GPU kernels using NVIDIA's Management Library.  Back
 
Topics:
Clusters & GPU Management, Performance Optimization1
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4454
Streaming:
Download:
Share:
 
Abstract:
We describe the design of a runtime component that enables the effective use of GPUs in cluster environments. In particular, our system allows:(1) Abstraction of GPUs from end-users; (2) Different GPU sharing and scheduling mechanisms; (3) Virtual me ...Read More
Abstract:
We describe the design of a runtime component that enables the effective use of GPUs in cluster environments. In particular, our system allows:(1) Abstraction of GPUs from end-users; (2) Different GPU sharing and scheduling mechanisms; (3) Virtual memory management; (4) Load balancing and dynamic recovery in case of GPU failure, upgrade and downgrade; (5) Integration with existing cluster-level schedulers and resource managers for CPU clusters.  Back
 
Topics:
Clusters & GPU Management, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4473
Streaming:
Download:
Share:
 
Abstract:
In modern heterogeneous architectures for the HPC, several computing resources (CPU, accelerators) and I/O resources (InfiniBand cards, PCIe links, QPI links) should be used simultaneously to take the best of the hardware. This observation is even mo ...Read More
Abstract:
In modern heterogeneous architectures for the HPC, several computing resources (CPU, accelerators) and I/O resources (InfiniBand cards, PCIe links, QPI links) should be used simultaneously to take the best of the hardware. This observation is even more true with the rising of technologies such as GPU Direct RDMA, able to perform communications directly between GPUs and Infiniband links. In this context, resources affinity (i.e resources selection and processes placement) can have a strong impact on performance. Thus, the aim of the presentation is to, firstly, identify the main affinity issues that can occur in current heterogeneous architectures (i.e which CPU core to choose when a particular GPU is used? Which IB interface to chose when a GPU direct RDMA transfer is launched?). We will show visible impact on performance. Then, we propose solutions to handle these issues. We think that affinity selection should be managed globally at the cluster resource manager level (with SLURM in our work), and not by the HPC programmers.  Back
 
Topics:
Clusters & GPU Management, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4491
Streaming:
Download:
Share:
 
Abstract:
Open MPI is an open source implementation of the Message Passing Interface (MPI) library used to support parallel applications. With GPUs being used more and more in large clusters, there has been work done to make CUDA and MPI work seamlessly togeth ...Read More
Abstract:
Open MPI is an open source implementation of the Message Passing Interface (MPI) library used to support parallel applications. With GPUs being used more and more in large clusters, there has been work done to make CUDA and MPI work seamlessly together. In this talk, we will cover new features added to the library to support sending and receiving of GPU buffers directly.   Back
 
Topics:
Clusters & GPU Management
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4589
Streaming:
Download:
Share:
 
Abstract:
In today's fast changing business environment, companies are looking for ways to deliver better designs faster and cheaper while creating high quality products across an ecosystem of partners. To succeed, a company must transform its design ...Read More
Abstract:

In today's fast changing business environment, companies are looking for ways to deliver better designs faster and cheaper while creating high quality products across an ecosystem of partners. To succeed, a company must transform its design processes by converting engineering silos into shared engineering clouds that improve collaboration, standardize processes and create a secure environment for sharing designs across operations and organizations including partners and suppliers. The 3D Engineering Cloud Solution is a high performance visual computing environment for organizations that have large 3D intensive graphics requirements and want to improve collaboration while protecting their assets and reducing costs. The 3D Engineering Cloud Solution is made possible due to a partnership between IBM, Citrix, and NVIDIA. This combination creates a unique 3D engineering environment in the Cloud.

  Back
 
Topics:
Clusters & GPU Management, GPU Virtualization, Graphics and AI, Computer Aided Engineering
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4849
Streaming:
Download:
Share:
 
Abstract:
We present a new abstraction for provisioning resources on high-performance heterogeneous GPGPU-based clusters: slices. Slices represent aggregated subsets of resources across a cluster for use by an application and target an environment where divers ...Read More
Abstract:
We present a new abstraction for provisioning resources on high-performance heterogeneous GPGPU-based clusters: slices. Slices represent aggregated subsets of resources across a cluster for use by an application and target an environment where diverse applications co-run on shared cluster resources. Our poster present studies examining application scalability and limitations, efficiency in mapping applications to slices using a novel GPGPU 'sensitivity' metric, and gains in multi-application throughput when mapping slices to underlying cluster resources, guided by application profiles. We evaluate behaviors of representative HPC codes: LAMMPS, NAS-LU and SHOC's S3D application kernel on clusters of 48 and 72 nodes.  Back
 
Topics:
Clusters & GPU Management
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4150
Download:
Share:
 
Abstract:
Traditional throughput-oriented GPGPU-based platforms are primarily designed to support a single GPGPU process at a time. This is problematic in deadline-oriented (real-time) systems when multiple processes compete for GPU resources. System-level ser ...Read More
Abstract:
Traditional throughput-oriented GPGPU-based platforms are primarily designed to support a single GPGPU process at a time. This is problematic in deadline-oriented (real-time) systems when multiple processes compete for GPU resources. System-level services are necessary to schedule competing work according to priority to ensure that deadlines are met. GPUSync is a framework for implementing such schedulers in multi-GPU, multicore, real-time systems. GPUSync enables GPUs to be shared among processes in safety-oriented applications, such as advanced driver assistance systems (ADAS) and autonomous vehicles, since timing constraints can be guaranteed to be met.  Back
 
Topics:
Clusters & GPU Management
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4286
Download:
Share:
 
Abstract:
The poster illustrates and present an initial concept of a research about Dynamic Intelligent Kernel Assignment in Heterogenous MultiGPU Systems, where given one application using the StarPU framework, our scheduler will select custom scheduling pol ...Read More
Abstract:
The poster illustrates and present an initial concept of a research about Dynamic Intelligent Kernel Assignment in Heterogenous MultiGPU Systems, where given one application using the StarPU framework, our scheduler will select custom scheduling policies and execute the kernels in an intelligent way, being responsible for the mapping of kernels to the correspondent devices in a seamless way, minimizing the execution time.  Back
 
Topics:
Clusters & GPU Management
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4200
Download:
Share:
 
Abstract:
The poster presents visualization tools for applications executed in a parallel environment that is a collection of clusters with multiple GPUs and CPUs. The system allows modeling applications as acyclic workflow graphs with customizable algorithms ...Read More
Abstract:
The poster presents visualization tools for applications executed in a parallel environment that is a collection of clusters with multiple GPUs and CPUs. The system allows modeling applications as acyclic workflow graphs with customizable algorithms for scheduling onto underlying network of clusters. The poster depicts visualization tools that provide three distinct views with runtime visualization of: 1. hardware infrastructure with clusters, nodes and computing devices as well as monitoring resources with presentation of computing loads, memory usage etc. 2. progress of execution of particular stages of the application workflow graph, 3. application state that can represent progress of numerical computations or physical phenomena graphically.   Back
 
Topics:
Clusters & GPU Management
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4250
Download:
Share:
Collaborative & Large Resolution Displays
Presentation
Media
Abstract:
Large format high resolution displays are being utilized everywhere from corporate conference rooms to Supercomputing facilities. NVIDIA Quadro SVS solutions provide many features to make it easier to install and utilize these large scale displays. A ...Read More
Abstract:
Large format high resolution displays are being utilized everywhere from corporate conference rooms to Supercomputing facilities. NVIDIA Quadro SVS solutions provide many features to make it easier to install and utilize these large scale displays. Attendees of this tutorial will learn how to configure Quadro Graphics for thin bezel panel, edge-blended projectors, stereoscopic and immersive displays.  Back
 
Topics:
Collaborative & Large Resolution Displays
Type:
Tutorial
Event:
GTC Silicon Valley
Year:
2014
Session ID:
SIG4113
Streaming:
Download:
Share:
 
Abstract:
We describe how to put together vr caves that used to cost 250k for a whole lot less using NVIDIA NVAPI, provide case studies, pictures, and diagrams of how to go about it. We believe that an substantial expansion in the VR market is occurring and th ...Read More
Abstract:
We describe how to put together vr caves that used to cost 250k for a whole lot less using NVIDIA NVAPI, provide case studies, pictures, and diagrams of how to go about it. We believe that an substantial expansion in the VR market is occurring and that these kinds of systems will become more commonplace and the market expands both by more effectively using the Quadro Cards in the System and use of the Warp and blend apis.   Back
 
Topics:
Collaborative & Large Resolution Displays, Product & Building Design, Virtual Reality & Augmented Reality
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4452
Streaming:
Download:
Share:
 
Abstract:
Learn how Mechdyne leverages video compression and streaming to create remote collaboration solutions. Connecting CAVEs, Powerwalls and other ultra-resolution displays to enable multi-site, multi-display sharing and decision making. We will expl ...Read More
Abstract:

Learn how Mechdyne leverages video compression and streaming to create remote collaboration solutions. Connecting CAVEs, Powerwalls and other ultra-resolution displays to enable multi-site, multi-display sharing and decision making. We will explore multiple customer use-cases: immersive-to-immersive, desktop-to-immersive, immersive-to-desktop, monoscopic and stereoscopic.

  Back
 
Topics:
Collaborative & Large Resolution Displays, Graphics and AI, Virtual Reality & Augmented Reality, Video & Image Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4631
Streaming:
Download:
Share:
Combined Simulation & Real-Time Visualization
Presentation
Media
Abstract:
With GPUs large-scale plasma simulations can provide frames-per-second simulation speeds. We present interactive, in-GPU rendering of large-scale particle-in-cell simulations running on GPU clusters. The user can choose which data is visualized ...Read More
Abstract:

With GPUs large-scale plasma simulations can provide frames-per-second simulation speeds. We present interactive, in-GPU rendering of large-scale particle-in-cell simulations running on GPU clusters. The user can choose which data is visualized and change the direction of view while the simulation is running. A remote visualization client can connect to the running simulation, allowing for live visualization even when bandwidth is limited.

  Back
 
Topics:
Combined Simulation & Real-Time Visualization, Graphics and AI, Large Scale Data Visualization & In-Situ Graphics, Scientific Visualization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4140
Streaming:
Download:
Share:
 
Abstract:
Create your own world and use the power of GPU programming to visualize it. Build a unique landscape by your hands with help of device called "Interactive sandbox" and study real-time modelled and realistically visualized natural phenomena, ...Read More
Abstract:
Create your own world and use the power of GPU programming to visualize it. Build a unique landscape by your hands with help of device called "Interactive sandbox" and study real-time modelled and realistically visualized natural phenomena, such as volcanic eruptions, floods, weather and seasons changing. You will learn about using GPUs to increase performance of modelling and visualization, find out, how to implement real-time simulation of fluid flow over varying bottom topography and also discover an efficient and fast method of Microsoft Kinect data filtering.  Back
 
Topics:
Combined Simulation & Real-Time Visualization, Virtual Reality & Augmented Reality, Computational Fluid Dynamics, Real-Time Graphics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4269
Streaming:
Download:
Share:
 
Abstract:
In this presentation, some recent features of OpenGL for GPGPU programming are presented through the implementation of a real-time physically-based deformable objects simulation application. Atomic operations and Barriers objects are employed success ...Read More
Abstract:
In this presentation, some recent features of OpenGL for GPGPU programming are presented through the implementation of a real-time physically-based deformable objects simulation application. Atomic operations and Barriers objects are employed successfully to manage the threads execution in order to get correct simulation results. The performance achieved for solving numerical simulations overs-speed more than ten times a CPU implementation. This work considers solid objects represented as tetrahedral meshes, and results for huge meshes running at interactive rate will be shown.  Back
 
Topics:
Combined Simulation & Real-Time Visualization, Real-Time Graphics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4276
Streaming:
Share:
 
Abstract:
The Visual Simulation Laboratory (VSL) is an open-source framework for visual simulation. One of its core features is a visual programming interface that enables users to dynamically create simulation pipelines at run time. The pipeline framework has ...Read More
Abstract:
The Visual Simulation Laboratory (VSL) is an open-source framework for visual simulation. One of its core features is a visual programming interface that enables users to dynamically create simulation pipelines at run time. The pipeline framework has been built such that developers can quickly create nodes based on user needs. VSL has implemented CUDA based nodes that compute a traditional ballistic simulation pipeline for survivability analysis.  Back
 
Topics:
Combined Simulation & Real-Time Visualization
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4202
Download:
Share:
 
Abstract:
Myriad is a GPU-based simulation engine for networks of biologically detailed neurons. Its core architecture is based on three design principles. First, the model hierarchy is flattened into (a) isometric compartments and (b) arbitrary mechanisms t ...Read More
Abstract:
Myriad is a GPU-based simulation engine for networks of biologically detailed neurons. Its core architecture is based on three design principles. First, the model hierarchy is flattened into (a) isometric compartments and (b) arbitrary mechanisms that couple exactly two compartments. Second, parameters are passed via a shared memory matrix, eliminating message-passing overhead in favor of the lockstep time synchronization of the GPU architecture. Third, a "minimally object-oriented" framework is designed, enabling inheritance, type inferencing, and method architecture. Myriad is expected to outperform large computational clusters when executing realistic models in which many nodes are densely coupled at every timestep.   Back
 
Topics:
Combined Simulation & Real-Time Visualization
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4169
Download:
Share:
Computational Fluid Dynamics
Presentation
Media
Abstract:
Nearest neighbor search is the key to efficient simulation of many discrete physical models. This talk focuses on a novel, efficient fixed-radius NNS by introducing counting sort accelerated with atomic GPU operations which require only two kernel ca ...Read More
Abstract:
Nearest neighbor search is the key to efficient simulation of many discrete physical models. This talk focuses on a novel, efficient fixed-radius NNS by introducing counting sort accelerated with atomic GPU operations which require only two kernel calls. As a sample application, fluid simulations based on smooth particles hydrodynamics (SPH) make use of NNS to determine interacting fluid particles. The Counting-sort NNS method achieves a performance gain of 3-5x over previous Radix-sort NNS, which allows for interactive SPH fluids of 4 million particles at 4 fps on current hardware. The technique presented is generic and easily adapted to other domains, such as molecular interactions or point cloud reconstructions.   Back
 
Topics:
Computational Fluid Dynamics, Numerical Algorithms & Libraries, Performance Optimization1, Molecular Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4117
Streaming:
Download:
Share:
 
Abstract:
Discover how GPUs are being used to accelerate high-fidelity computational fluid dynamics (CFD) simulations on unstructured grids. In this talk I will (i) introduce the flux reconstruction approach to high-order methods; a discretization that is par ...Read More
Abstract:
Discover how GPUs are being used to accelerate high-fidelity computational fluid dynamics (CFD) simulations on unstructured grids. In this talk I will (i) introduce the flux reconstruction approach to high-order methods; a discretization that is particularly well-suited to many-core architectures, (ii) introduce our massively parallel implementation, PyFR, which through a combination of symbolic manipulation and run-time code generation is able to easily target NVIDIA GPU hardware and, (iii) showcase some of the high-fidelity, unsteady, simulations undertaken using PyFR on both desktop and HPC systems.  Back
 
Topics:
Computational Fluid Dynamics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4250
Streaming:
Download:
Share:
 
Abstract:
Learn about the challenges and possibilities of applying CUDA to a Multi-Phase Particle-In-Cell code base through (1) An applied approach to parallelizing Barracuda VR, a CAE MP-PIC code, (2) Achieved speed-ups of operation types specific to MP-PIC c ...Read More
Abstract:
Learn about the challenges and possibilities of applying CUDA to a Multi-Phase Particle-In-Cell code base through (1) An applied approach to parallelizing Barracuda VR, a CAE MP-PIC code, (2) Achieved speed-ups of operation types specific to MP-PIC codes (in double-precision), (3) Focused discussion on the crux of MP-PIC, i.e. mapping Lagrangian data to the Eulerian grid and (4) Demonstrated speed-up and future expectations.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4417
Streaming:
Download:
Share:
 
Abstract:
Explore the latest techniques for accelerating combustion simulations with finite-rate chemical kinetics using GPUs. In this session we will compare the performance of different numerical methods for solving stiff and non-stiff ODEs and discuss the c ...Read More
Abstract:
Explore the latest techniques for accelerating combustion simulations with finite-rate chemical kinetics using GPUs. In this session we will compare the performance of different numerical methods for solving stiff and non-stiff ODEs and discuss the compromises that must be made between parallel throughput and numerical efficiency. Learn techniques used to (1) manage variable integration costs across the concurrent ODEs and (2) reduce thread divergence caused by non-linear iterative solvers.  Back
 
Topics:
Computational Fluid Dynamics, Numerical Algorithms & Libraries, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4418
Streaming:
Download:
Share:
 
Abstract:
Are you interested in decreasing the runtime of your 24-hour flow simulation to nine minutes? This is the story of how GPUs achieved a 150-time speedup and made Physalis into a viable computational tool for investigating the behavior of large fluid-p ...Read More
Abstract:
Are you interested in decreasing the runtime of your 24-hour flow simulation to nine minutes? This is the story of how GPUs achieved a 150-time speedup and made Physalis into a viable computational tool for investigating the behavior of large fluid-particle systems. The Physalis method is the only known means of applying near-perfect boundary conditions to spherical particles in a coarse Cartesian finite-difference flow solver, but it suffers from a debilitating computational requirement. GPU technology enables us to overcome this limitation so we can investigate the underlying physics behind natural phenomena like dust storms and energy-generation technologies such as fluidized bed reactors. We will discuss concepts of the design of a GPU finite-difference incompressible Navier-Stokes flow solver, introduce the algorithm behind the Physalis method, and evaluate the current and future capabilities of this GPU fluid-particle interaction code.  Back
 
Topics:
Computational Fluid Dynamics, Numerical Algorithms & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4544
Streaming:
Download:
Share:
 
Abstract:
Learn about a new approach to developing large-scale Computational Fluid Dynamics (CFD) software for parallel processors such as GPUs. The session focuses on two topics: (1) the use of automatic source code generation for CFD kernels on unstructured ...Read More
Abstract:
Learn about a new approach to developing large-scale Computational Fluid Dynamics (CFD) software for parallel processors such as GPUs. The session focuses on two topics: (1) the use of automatic source code generation for CFD kernels on unstructured grids to achieve close to optimal performance while maintaining code readability, and (2) case studies of advanced gas turbine simulations on clusters with 100s of GPUs.   Back
 
Topics:
Computational Fluid Dynamics, Computer Aided Engineering, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4594
Streaming:
Share:
 
Abstract:
CFD calculations in an industrial context prioritize fast turn-around times - a requirement that can be addressed by porting parts of the CFD calculation to the GPU, leading to a hybrid CPU/GPU approach. In a first step, the GPU library Culises ...Read More
Abstract:

CFD calculations in an industrial context prioritize fast turn-around times - a requirement that can be addressed by porting parts of the CFD calculation to the GPU, leading to a hybrid CPU/GPU approach. In a first step, the GPU library Culises has been developed, allowing the GPU-based solution of large-scale linear systems of equations that are in turn set up by MPI-parallelized CFD codes (e.g. OpenFOAM) on CPU. In this session we will address a second step, which consists in porting the construction of the linear system to the GPU as well, while pre- and post-processing remain on the CPU. Aiming for industrial applications in the automotive sector, the approach will be aligned on the simpleFOAM solver of OpenFOAM. As the set up of the linear system consumes up to 40-50% of computational time in typical cases of the automotive industry, this approach can further increase the acceleration of CFD computations.

  Back
 
Topics:
Computational Fluid Dynamics, Autonomous Vehicles, AEC & Manufacturing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4598
Streaming:
Download:
Share:
 
Abstract:
Learn how to develop efficient highly-scalable GPU codes faster through use of the Python programming language. In this talk I will describe our accelerated massively parallel computational fluid dynamics (CFD) code, PyFR, and outline some of the te ...Read More
Abstract:
Learn how to develop efficient highly-scalable GPU codes faster through use of the Python programming language. In this talk I will describe our accelerated massively parallel computational fluid dynamics (CFD) code, PyFR, and outline some of the techniques employed to reduce development time and enhance performance. Specifically, it will be shown how even complex algorithms - such as those employed for performing CFD on unstructured grids - can be constructed in terms of efficient matrix-matrix multiplications. Moreover, general advice will be given on how best to integrate CUDA and MPI. Furthermore, I will demonstrate how Python can be used both to simplify development and bring techniques such as run-time kernel generation to the mainstream. Examples of these techniques, as utilized in PyFR, will be given throughout.  Back
 
Topics:
Computational Fluid Dynamics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4649
Streaming:
Download:
Share:
 
Abstract:
The solution of the linear equation systems arising from discretization of flow equations can be a major time consuming portion of a flow simulation. In the context of ANSYS FLUENT flow solver, especially when using the coupled solver, the linear sol ...Read More
Abstract:
The solution of the linear equation systems arising from discretization of flow equations can be a major time consuming portion of a flow simulation. In the context of ANSYS FLUENT flow solver, especially when using the coupled solver, the linear solver takes a major chunk of the simulation time. In order to improve performance and also to let user take advantage of the available GPU hardware, we provide a mechanism in ANSYS FLUENT to off load the linear solver on to a GPU using NVIDIA's multi-grid AMG solver . In this talk we present a top level view of the architectural design of integrating the AmgX solver into ANSYS FLUENT. We also present some preliminary performance results obtained from our first offering of AmgX inside ANSYS FLUENT release 15.0.   Back
 
Topics:
Computational Fluid Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4672
Streaming:
Download:
Share:
 
Abstract:
Fluid Structure interaction is one of the most challenging areas for numerical simulations. By itself modeling Fluid flow is complicated enough but to add the interaction with a deformable structure makes it even more challenging. One particular th ...Read More
Abstract:
Fluid Structure interaction is one of the most challenging areas for numerical simulations. By itself modeling Fluid flow is complicated enough but to add the interaction with a deformable structure makes it even more challenging. One particular theory, SPH, is especially suited for GPU processing. SPH stands for Smooth Particle Hydrodynamics and it is a particle based Lagrangian continuum method which can run completely on the GPU. Improvements in the classic SPH Solver has led to an extremely accurate and robust solver that can better capture the pressure field for violent water impacts. FSI means fluid structure interaction and so to solve complicated problems an equally robust and accurate finite element solver needs to be part of the coupled solution. One particular application is modelling a Real Human Heart Valve, something that has not been done until now. Results using the latest NVIDIA GPU, the K40, will be shown for Heart Valve model along with other FSI applications.  Back
 
Topics:
Computational Fluid Dynamics, Computational Structural Mechanics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4762
Streaming:
Download:
Share:
 
Abstract:
There is a growing need in internal combustion (IC) engine design to resolve the complicated combustion kinetics in simulations. Without more predictive simulation tools in the design cycle, the cost of development will consume new concepts as i ...Read More
Abstract:

There is a growing need in internal combustion (IC) engine design to resolve the complicated combustion kinetics in simulations. Without more predictive simulation tools in the design cycle, the cost of development will consume new concepts as it becomes harder to meet the performance and emission targets of the future. The combustion kinetics of real transportation fuels involve thousands of components - each that can react through thousands of intermediate species and tens of thousands of reaction paths. GPUs show promise delivering more physical accuracy (per $) to the IC engine design process. Specifically, GPU acceleration of nearly a factor of ten is demonstrated for the integration of multiple chemical source terms in a reacting fluid dynamics simulation. This speedup is achieved by reorganizing the thermodynamics and chemical reaction functions and by updating the sparse matrix functions using NVIDIA's latest GLU library.

  Back
 
Topics:
Computational Fluid Dynamics, Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4881
Streaming:
Download:
Share:
 
Abstract:
In this poster we present the solution adopted to accelerate STOMP, a reactive flow transport tool, on heterogeneous cluster. We present two different components focused at accelerating the most compute intensive part of the application, the solution ...Read More
Abstract:
In this poster we present the solution adopted to accelerate STOMP, a reactive flow transport tool, on heterogeneous cluster. We present two different components focused at accelerating the most compute intensive part of the application, the solution of the reaction systems for each grid node with the Newton-Raphson technique. The first is an accelerated batched LU-solver for GPUs, to speed up the solution of small. The second a load balancer, that appropriately divides the workload between CPU and GPU. The batched GPU LU solver is designed to comply with STOMP's requirements (full pivoting, and matrices up to 100x100 elements), differently from other currently available solvers. These solutions, integrated into the full application, provide speed ups from 6 to 7 times on large problems, executed on up to 16 nodes of a cluster with two AMD Opteron 6272 and a Tesla M2090 per node.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4224
Download:
Share:
 
Abstract:
Lattice Boltzmann Method (LBM) is used to study fluid flow in a 2D lid driven cavity. In LBM, a linearized Boltzmann equation is solved on a discrete lattice. Since the algorithm is massively parallel, GPGPU is a good choice for such implementation. ...Read More
Abstract:
Lattice Boltzmann Method (LBM) is used to study fluid flow in a 2D lid driven cavity. In LBM, a linearized Boltzmann equation is solved on a discrete lattice. Since the algorithm is massively parallel, GPGPU is a good choice for such implementation. We implemented this problem using CUDA C and optimized to obtain a maximum performance of 315 Million grid updates per second.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4189
Download:
Share:
 
Abstract:
GPU is used for virtual design of high performance electrode in electrochemical energy storage system. Results show good scalability and speedup compared to tradition CPU/MPI implementations.
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4234
Download:
Share:
 
Abstract:
This work describes the application of Morphological Evolution by Parallel Genetic Algorithms for optimizing earth barriers construction to divert a case study lava flow that is simulated by the Cellular Automata model SCIARA-fv2. The genotype fitnes ...Read More
Abstract:
This work describes the application of Morphological Evolution by Parallel Genetic Algorithms for optimizing earth barriers construction to divert a case study lava flow that is simulated by the Cellular Automata model SCIARA-fv2. The genotype fitness evaluation has implied a massive use of the numerical simulator. Therefore, a Multi-GPGPU approach was developed to accelerate the GA execution by running large number of simultaneous lava flow simulations. The study has produced extremely positive results. Solutions provided by the implemented Decision Support System were extremely efficient and the parallel speedups were indeed significant shortening the execution by a factor of 67.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4127
Download:
Share:
 
Abstract:
The restrictions of the aircraft noise and engine emission are constantly tightened. Thus new low-noise and low-emission aircraft engines are required. To be able to design these eco-friendly engines it is essential to perform a precise and robust nu ...Read More
Abstract:
The restrictions of the aircraft noise and engine emission are constantly tightened. Thus new low-noise and low-emission aircraft engines are required. To be able to design these eco-friendly engines it is essential to perform a precise and robust numerical simulation which requires huge computational resources. To resolve this problem new in-house solver called GHOST CFD was developed. The solver utilizes GPUs in conjunction with the high-order of accuracy schemes and allows one to perform quick and precise numerical simulations. The solver also supports tran- and supersonic gas flow simulations with multiple gas species and overset ("CHIMERA") computational grids.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P3160
Download:
Share:
 
Abstract:
We present a framework for a efficient coupling for a GPU based Discrete Element Method (DEM) solver for real world particle problems with a coupling ability with a CPU based Computational Fluid Dynamics (CFD) Solver provided by one of our partners. ...Read More
Abstract:
We present a framework for a efficient coupling for a GPU based Discrete Element Method (DEM) solver for real world particle problems with a coupling ability with a CPU based Computational Fluid Dynamics (CFD) Solver provided by one of our partners.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4143
Download:
Share:
 
Abstract:
Hydra is an industrial CFD application used for the design of turbomachinery, now automatically accelerated by GPUs through the OP2 domain specific "active library" for unstructured grid algorithms. From the high-level definition, GPU code ...Read More
Abstract:
Hydra is an industrial CFD application used for the design of turbomachinery, now automatically accelerated by GPUs through the OP2 domain specific "active library" for unstructured grid algorithms. From the high-level definition, GPU code is generated, applying optimisations such as conversion to Structure-of-Arrays, use of the read-only cache or the tuning of block sizes automatically. A single GPU is over 2 times faster than the original on a server-class CPU, we demonstrate excellent strong and weak scaling, evaluated up to 16 GPUs.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4181
Download:
Share:
 
Abstract:
Low-fidelity aerodynamic codes, including panel, lifting line and vortex lattice methods, are used in the preliminary aerodynamic studies typically in the early stages of the aircraft design and constitute an important and compute-intensive part of a ...Read More
Abstract:
Low-fidelity aerodynamic codes, including panel, lifting line and vortex lattice methods, are used in the preliminary aerodynamic studies typically in the early stages of the aircraft design and constitute an important and compute-intensive part of aircraft design process. This preliminary design phase is usually very time consuming as it involves parametric studies counting tens of thousands computations. Speeding-up low-fidelity aerodynamic codes should result in reduced research costs and faster product time to market. We present GPU-based (Tesla K20) implementation of non-linear vortex lattice code and demonstrate its superior performance over sequential execution.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4137
Download:
Share:
 
Abstract:
Using an optimised lattice Boltzmann model (LBM) on a GPU, a real-time indoor airflow solver has been implemented. LBM is an alternative approach to solving the Navier-Stokes equations, it is more stable, allows for better GPU implementation and achi ...Read More
Abstract:
Using an optimised lattice Boltzmann model (LBM) on a GPU, a real-time indoor airflow solver has been implemented. LBM is an alternative approach to solving the Navier-Stokes equations, it is more stable, allows for better GPU implementation and achieves the same accuracy. Our tool can solve 2D and 3D time dependent flows in real-time, including thermal and turbulent effects. An integrated visualisation tool allows for real-time interaction with the fluid.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4239
Download:
Share:
 
Abstract:
OpenACC is a new accelerator programming interface that provides a set of OpenMP-like loop directives for the programming of accelerators. This poster shows the performance and optimization studies with kernel benchmarks and a real-world CFD applicat ...Read More
Abstract:
OpenACC is a new accelerator programming interface that provides a set of OpenMP-like loop directives for the programming of accelerators. This poster shows the performance and optimization studies with kernel benchmarks and a real-world CFD application. We port these applications with OpneACC and CUDA and execute on K20X GPU to compare the performance. We found that the transformation of data structure is very effective optimization for a real-world application of both OpenACC and CUDA versions.   Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4144
Download:
Share:
 
Abstract:
The Lattice Boltzmann Method (LBM) is implemented through a finite-volume approach to perform 2-D, incompressible, and laminar fluid flow analyses on structured grids. Once the serial version is implemented and validated through the laminar 2-D lid-d ...Read More
Abstract:
The Lattice Boltzmann Method (LBM) is implemented through a finite-volume approach to perform 2-D, incompressible, and laminar fluid flow analyses on structured grids. Once the serial version is implemented and validated through the laminar 2-D lid-driven cavity problem and 2-D flow over a circular cylinder, the flow solver is accelerated on GPU by porting compute and memory bandwidth intense functions to CUDA. The CUDA accelerated implementation is compared against serial implementation and multi-threaded version running on dual Intel Xeon processors.  Back
 
Topics:
Computational Fluid Dynamics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4180
Download:
Share:
Computational Photography
Presentation
Media
Abstract:
Proposed method allows to create panoramic images using pixel-precise alignment of frames from the camera. Though computationally expensive, it can be implemented on a mobile device with a CUDA capable GPU. As the result application works in real-tim ...Read More
Abstract:
Proposed method allows to create panoramic images using pixel-precise alignment of frames from the camera. Though computationally expensive, it can be implemented on a mobile device with a CUDA capable GPU. As the result application works in real-time on a Tegra K1 tablet.  Back
 
Topics:
Computational Photography
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4288
Download:
Share:
 
Abstract:
Computational strength of modern devices brings personal photography to a new level. We implement advanced photo effects using stacks of images. Object Removal is a technique for removing occluding objects. Object Cloning creates collages showing mov ...Read More
Abstract:
Computational strength of modern devices brings personal photography to a new level. We implement advanced photo effects using stacks of images. Object Removal is a technique for removing occluding objects. Object Cloning creates collages showing movement of an object.  Back
 
Topics:
Computational Photography
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4287
Download:
Share:
Computational Physics
Presentation
Media
Abstract:
The radial distribution function (RDF) is a fundamental tool in validation and analysis of particle simulation data. Computation of RDF is a very time expensive process. It may take days or even months to process moderate size data points (millions) ...Read More
Abstract:
The radial distribution function (RDF) is a fundamental tool in validation and analysis of particle simulation data. Computation of RDF is a very time expensive process. It may take days or even months to process moderate size data points (millions) on CPU. We present an efficient technique to compute RDF on GPUs, which takes advantage of shared memory, registers, and special instructions. Recent GPU architectures support shuffle instruction that can be used to share data between threads, via registers. We exploit these features of the new architecture to improve performance of the RDF algorithm. Further, we present benefits of using different GPU optimization techniques to improve the performance. Effect of algorithm behavior on the speedup is also presented in detail with the help of examples.   Back
 
Topics:
Computational Physics, Big Data Analytics, Molecular Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4149
Streaming:
Share:
 
Abstract:
Learn how to port multi-dimensional multi-group discrete ordinate neutron transport equations code SNAP to GPU clusters. It will show that GPU is a good fit for this class of applications. GPU enables both faster throughput at small scale and better ...Read More
Abstract:
Learn how to port multi-dimensional multi-group discrete ordinate neutron transport equations code SNAP to GPU clusters. It will show that GPU is a good fit for this class of applications. GPU enables both faster throughput at small scale and better scalability at large scale. The porting strategy and performance model on GPU will be described.  Back
 
Topics:
Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4164
Streaming:
Share:
 
Abstract:
Learn how to scale Metropolis Monte Carlo simulations of hard particles to many GPUs. Prior codes run only in serial on the CPU, limiting researchers' abilities to study complex systems. We implement Monte Carlo for arbitrary hard shapes in HOOMD-bl ...Read More
Abstract:
Learn how to scale Metropolis Monte Carlo simulations of hard particles to many GPUs. Prior codes run only in serial on the CPU, limiting researchers' abilities to study complex systems. We implement Monte Carlo for arbitrary hard shapes in HOOMD-blue, a GPU-accelerated particle simulation tool, to enable million particle simulations in a field where thousands is the norm. In this talk, we present the basic parallel algorithms, optimizations that maximize GPU performance, and communication patterns for scaling to multiple GPUs. Research applications include finding densest packings, self-assembly studies, and other uses in materials design, biological aggregation, and operations research.  Back
 
Topics:
Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4166
Streaming:
Download:
Share:
 
Abstract:
Monte Carlo neutron transport is an approach to simulating radiation transport and nuclear reaction physics by simulating the individual lifespans of many millions of unbound neutrons. OpenMC is a recently developed Monte Carlo neutron transport appl ...Read More
Abstract:
Monte Carlo neutron transport is an approach to simulating radiation transport and nuclear reaction physics by simulating the individual lifespans of many millions of unbound neutrons. OpenMC is a recently developed Monte Carlo neutron transport application intended to allow future reactor designer to leverage extremely low-level simulation of new reactors years before they are built. The presenter, Tony Scudiero, has adapted OpenMC from its original incarnation as 27k lines of single-threaded Fortran 90 to a parallel CUDA C/C++ implementation optimized for the GPU. This talk covers computational considerations of Monte Carlo neutron transport, the design and process of porting OpenMC to CUDA, and the results and lessons learned in the process. Along with OpenMC, its miniapp benchmark XSBench will be discussed.   Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4170
Streaming:
Download:
Share:
 
Abstract:
The monitoring of waste water for dioxins is important because these compounds are extremely toxic. One possible way to detect dioxins is by an analytical instrument called a Gas Chromatograph / Mass Spectrometer. This session summarizes our research ...Read More
Abstract:
The monitoring of waste water for dioxins is important because these compounds are extremely toxic. One possible way to detect dioxins is by an analytical instrument called a Gas Chromatograph / Mass Spectrometer. This session summarizes our research aimed at increasing the sensitivity of a commercially available time-of-flight mass spectrometer without sacrificing resolution, mass range, or acquisition rate. In brief, we configured the mass spectrometer to pulse ions into the flight tube 20 times faster than originally designed, causing more ions to strike the detector per unit time, increasing sensitivity. However, because lighter, faster ions from one pulse overtake heaver ions from a previous pulse, the resulting mass spectra are severely intertwined, or multiplexed. Our work included developing a demultiplexing algorithm, which computes the theoretical source spectrum from the multiplexed data. Because the instrument generates 1.2GB/s, we designed and coded all algorithms for execution on a GTX Titan.  Back
 
Topics:
Computational Physics, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4245
Streaming:
Download:
Share:
 
Abstract:
This presentation describes an R&D effort sponsored by NASA's Earth Sciences Division, that involves GPU implementation of a topologically-realistic numerical simulation of earthquakes occurring on the fault systems of California. The computatio ...Read More
Abstract:
This presentation describes an R&D effort sponsored by NASA's Earth Sciences Division, that involves GPU implementation of a topologically-realistic numerical simulation of earthquakes occurring on the fault systems of California. The computationally intensive modules include (i) Generation of a large-scale stress influence matrix (Green's functions) from fault element data, and (ii) Calculation of stress from the strain vector using large-scale matrix vector multiply, during the rupture propagation phase. Identification of the computational bottlenecks, CUDA code implementation and various code optimizations that led to a 45x speedup over a multi-core CPU implementation, for a 30,000-year earthquake simulation, will be discussed.   Back
 
Topics:
Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4258
Streaming:
Download:
Share:
 
Abstract:
Learn about optimization efforts in G4CU, a CUDA Monte Carlo code for radiation therapy. G4CU is based on the core algorithm and physics processes in Geant4, a toolkit for simulating particles traveling through and interacting with matter. The techni ...Read More
Abstract:
Learn about optimization efforts in G4CU, a CUDA Monte Carlo code for radiation therapy. G4CU is based on the core algorithm and physics processes in Geant4, a toolkit for simulating particles traveling through and interacting with matter. The techniques covered will include the use of texture references for look-up tables, device configuration for different simulation components, and scheduling of work for different particle types.  Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, Medical Imaging & Radiology
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4259
Streaming:
Download:
Share:
 
Abstract:
In this talk, you will learn how to use the game and visualization wizard's tool chest to accelerate your scientific computing applications. NVIDIA's game physics engine PhysX and the ray tracing framework OptiX offer a wealth of functionality ofte ...Read More
Abstract:
In this talk, you will learn how to use the game and visualization wizard's tool chest to accelerate your scientific computing applications. NVIDIA's game physics engine PhysX and the ray tracing framework OptiX offer a wealth of functionality often needed in scientific computing application. However, due to the different target audiences, these frameworks are generally not very well known to the scientific computing communities. High-frequency electromagnetic simulations, particle simulations in complex geometries, or discrete element simulations are all examples of applications that could immediately benefit from these frameworks. Based on examples, we will talk about the basic concepts of these frameworks, introduce their strengths and their approximation, and how to take advantage of them from within a scientific application.   Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4260
Streaming:
Download:
Share:
 
Abstract:
See how subdividing and preprocessing static parts of your simulation system beyond the obvious can significantly increase your performance. As an example we use our Micromagnetism simulator TetraMag whose solvers for systems of linear equations and ...Read More
Abstract:
See how subdividing and preprocessing static parts of your simulation system beyond the obvious can significantly increase your performance. As an example we use our Micromagnetism simulator TetraMag whose solvers for systems of linear equations and field calculation parts rely heavily on sparse matrix - vector multiplications. The sizes of the involved matrices for large-scale simulations often outgrow the memory capacity of a single GPU. In our case, these matrices are constant over a program run, which can mean millions of iterations. This talk will show how analyZing, reordering and splitting our original matrices in a checkerboard style enables us to reduce expensive data transfers between GPUs and helps to reduce transfer overhead through fine grained streaming.  Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4283
Streaming:
Download:
Share:
 
Abstract:
Getting an eigenvalue and eigenvector set of inversion matrix is the key point of accelerating the matrix inversion and the Lanczos algorithm is one of the well-known methods for this problem. But this routine is highly dominated by data access IO so ...Read More
Abstract:
Getting an eigenvalue and eigenvector set of inversion matrix is the key point of accelerating the matrix inversion and the Lanczos algorithm is one of the well-known methods for this problem. But this routine is highly dominated by data access IO so it can be another bottleneck in the whole sequence. Even though the FLOPS/Bandwidth ratio of GPU is not good enough, GPU still has an advantage in memory bandwidth compared with that of CPU. We are implementing the Lanczos algorithm based on CUDA and will show preliminary performance result on multi GPU clusters.   Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4311
Streaming:
Share:
 
Abstract:
Learn how to construct easily Monte Carlo procedures on GPUs with a new open-source OpenCL library of pseudo-random number generators (PRNG) - PRNGCL. We will introduce our OpenCL implementation of most popular uniform PRNGs and briefly discuss the g ...Read More
Abstract:
Learn how to construct easily Monte Carlo procedures on GPUs with a new open-source OpenCL library of pseudo-random number generators (PRNG) - PRNGCL. We will introduce our OpenCL implementation of most popular uniform PRNGs and briefly discuss the general techniques of PRN generation on GPUs. The performance comparison of existing PRNG libraries with PRNGCL will be provided. Some examples of PRNGCL library application for high-energy physics lattice simulations will be given.  Back
 
Topics:
Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4313
Streaming:
Download:
Share:
 
Abstract:
Graphics Processing Units (GPUs) are an increasingly popular platform upon which to deploy lattice quantum chromodynamics calculations. While there has been much progress to date in developing solver algorithms to improve strong scaling on such plat ...Read More
Abstract:
Graphics Processing Units (GPUs) are an increasingly popular platform upon which to deploy lattice quantum chromodynamics calculations. While there has been much progress to date in developing solver algorithms to improve strong scaling on such platforms, there has been less focus on deploying 'mathematically optimal' algorithms. A good example of this are hierarchical solver algorithms such as adaptive multigrid, which are known to solve the Dirac operator with optimal O(N) complexity. We describe progress to date in deploying adaptive multigrid solver algorithms to NVIDIA GPU architectures and discuss in general the suitability of heterogeneous architectures for hierarchical algorithms.  Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4327
Streaming:
Download:
Share:
 
Abstract:
One of the most important unanswered questions in physics is: Does antimatter fall in the same way as matter? At the European Organization for Nuclear Research (CERN, Geneva) the AEgIS experiment is underway to measure the gravitational force on anti ...Read More
Abstract:
One of the most important unanswered questions in physics is: Does antimatter fall in the same way as matter? At the European Organization for Nuclear Research (CERN, Geneva) the AEgIS experiment is underway to measure the gravitational force on antimatter and has to reach a nanometric precision in determining the free-fall of antimatter. In particular, the 3D reconstruction of particle tracks produced in matter - antimatter annihilations requires a huge amount of computing resources, that is a processing of tomographic images of 30 TByte per day. In this talk, the application of GPUs for the 3D tracking of particles in photo-emulsion detectors will be reported.  Back
 
Topics:
Computational Physics, Astronomy & Astrophysics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4372
Streaming:
Share:
 
Abstract:
See how advances in GPU Computing enable us to simulate Quantum Chromodynamics and learn about fundamental properties of strongly interacting matter i.e., quarks and gluons at finite temperatures. With the advances in hardware and algorithms these si ...Read More
Abstract:
See how advances in GPU Computing enable us to simulate Quantum Chromodynamics and learn about fundamental properties of strongly interacting matter i.e., quarks and gluons at finite temperatures. With the advances in hardware and algorithms these simulations have reached a level that allows for a quantitative comparison with experimental data from heavy-ion colliders. Discover how the Kepler architecture helps us to boost the performance of the simulations and reach new level of precision. I will discuss selected optimizations for the Kepler K20 cards and modifications to prepare the code for the Titan supercomputer. Furthermore I compare and discuss pros and cons of our in-house in comparison to available libraries like the QUDA library.   Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4453
Streaming:
Download:
Share:
 
Abstract:
We present the GooFit maximum likelihood fit framework which has been develop to run effectively on general purpose graphics processing units (GPUs) to enable next generation experimental high energy physics (HEP) research. Most analyses of data from ...Read More
Abstract:
We present the GooFit maximum likelihood fit framework which has been develop to run effectively on general purpose graphics processing units (GPUs) to enable next generation experimental high energy physics (HEP) research. Most analyses of data from HEP experiments use maximum likelihood fits. Some of today's analyses use fits which require more than 24 hours on traditional multi-core systems. The next generation of experiments will require computing power two orders of magnitude greater for analyses which are sensitive to New Physics. Our GooFit framework, which has been demonstrated to run on nVidia GPU devices ranging from high end Teslas to laptop GeForce GTs, uses CUDA and the Thrust library to massively parallelize the per-event probability calculation. For realistic physics fits we achieve speedups, relative to executing the same algorithm on a single CPU, of several hundred.   Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4488
Streaming:
Download:
Share:
 
Abstract:
We discuss the CUDA and OpenCL implementation of the hierarchical equations of motions (GPU-HEOM) method for tracking quantum-mechanical effects in photosynthesis. The hierarchy of coupled equations yields the time-evolution of the density matrix of ...Read More
Abstract:
We discuss the CUDA and OpenCL implementation of the hierarchical equations of motions (GPU-HEOM) method for tracking quantum-mechanical effects in photosynthesis. The hierarchy of coupled equations yields the time-evolution of the density matrix of a photosynthetic network and is efficiently mapped to the GPU architecture by assigning one thread to each hierarchy member, while storing time-independent information in constant memory. This makes the GPU architecture the optimal choice compared to conventional pthread-based parallelization schemes suffering from higher thread latency and allows one to connect theoretical simulations directly with experimental images of the energy-flow in photosynthesis. It answers the outstanding questions in the field: why is transport in photosynthesis so efficient and how to design artificial devices? The ready-to-run GPU-HEOM tool is installed on the publicly accessible nanoHUB platform where user share data and sessions while performing computations on the connected NVIDIA M2090 GPU cluster.  Back
 
Topics:
Computational Physics, Quantum Chemistry, Desktop & Application Virtualization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4490
Streaming:
Download:
Share:
 
Abstract:
The porting process of a large scale Particle-In-Cell Solver (GTS) to the GPU using CUDA is described. We present weak scaling results run at scale on Titan which show a speed up of 3-4x for the entire solver. Starting from a performance analysis of ...Read More
Abstract:
The porting process of a large scale Particle-In-Cell Solver (GTS) to the GPU using CUDA is described. We present weak scaling results run at scale on Titan which show a speed up of 3-4x for the entire solver. Starting from a performance analysis of computational kernels, we systematically proceed to eliminating the most significant bottlenecks in the code - in this case, the PUSH step, which constitutes the 'gather' portion of the gather-scatter algorithm that characterizes this PIC code. Points that we think might be instructive to developers include: (1) using the PGI CUDA Fortran infrastructure to interface between CUDA C and Fortran; (2) memory optimizations - creation of a device memory pool, and pinned memory; (3) a demonstration of how communication causes performance degradation at scale, with implications on shifter performance in general PIC solvers, and why we need algorithms that handle communication in particle shifters more effectively; (4) Use of textures and LDG for irregular memory accesses.   Back
 
Topics:
Computational Physics, Computational Fluid Dynamics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4495
Streaming:
Share:
 
Abstract:
PANDA is a next generation particle physics experiment involving a novel data acquisition mechanism. Commonly, particle physics experiments read out the full detector response of particle collisions only when a fast hardware-level trigger fires. In c ...Read More
Abstract:
PANDA is a next generation particle physics experiment involving a novel data acquisition mechanism. Commonly, particle physics experiments read out the full detector response of particle collisions only when a fast hardware-level trigger fires. In contrast to this, PANDA uses a sophisticated event filtering scheme which involves reconstruction of the whole incoming data stream in real time (online) to distinguish signal from background events. At a rate of about 20 million events per second, a massive amount of computing power is needed in order to sufficiently reduce the incoming data rate of 100 GB/s to 2 PB/year for permanent storage. We explore the feasibility of using GPUs for this task. This talk outlines the challenges PANDA faces with data acquisition and presents the status of the GPU investigations. Different reconstruction (tracking) algorithms running on NVIDIA GPUs are shown and their features and performances highlighted.  Back
 
Topics:
Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4499
Streaming:
Download:
Share:
 
Abstract:
We present an extremely innovative GPU implementation of a Particle-in-Cell code for plasma dynamics simulation on 3-D unstructured grids. Starting from a proven codebase, we integrate solutions and ideas coming from a thorough study of the state-of- ...Read More
Abstract:
We present an extremely innovative GPU implementation of a Particle-in-Cell code for plasma dynamics simulation on 3-D unstructured grids. Starting from a proven codebase, we integrate solutions and ideas coming from a thorough study of the state-of-the-art in parallel plasma simulation and other fields, adding some original contributions in areas such as workload management, particle ordering and domain decomposition. The result is a novel, flexible simulation pipeline, capable of performing more than an order of magnitude faster than the CPU implementation it originates from, while still presenting exciting opportunities for future developments. Moreover, all the concepts presented are applicable not only to Particle-in-Cell simulation, but in general to any simulation relying on the interaction between lagrangian particles and a spatial grid.   Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4500
Streaming:
Download:
Share:
 
Abstract:
Programmed self-assembly of nucleic acids offers the unique opportunity to engineer geometrically complex megadalton-scale macromolecular architectures with atomic-level accuracy. The sequence specificity of DNA renders these nanoassemblies as spatia ...Read More
Abstract:
Programmed self-assembly of nucleic acids offers the unique opportunity to engineer geometrically complex megadalton-scale macromolecular architectures with atomic-level accuracy. The sequence specificity of DNA renders these nanoassemblies as spatially addressable structural scaffolds to host secondary molecules including light-harvesting dyes and chemically functional groups. These properties may be exploited to rationally design biomimetic light-harvesting antennas to replicate aspects of bacterial photosynthesis. Here, I present our computational design framework CanDo (http://cando-dna-origami.org) that quantitatively predicts the 3D solution structure of megadalton-scale DNA-based nanoassemblies based on underlying DNA sequence, as well as their emergent light-harvesting properties when decorated with dyes. This computational framework enables the in silico design and optimization of functional DNA-based light-harvesting devices prior to time-consuming and costly synthesis and experimental validation.  Back
 
Topics:
Computational Physics, Seismic & Geosciences, Genomics & Bioinformatics, Computer Aided Engineering
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4521
Streaming:
Share:
 
Abstract:
The session will describe the CUDA implementation of a variational Monte Carlo method for the study of strongly correlated quantum systems including high-temperature superconductors, magnetic semiconductors and metal oxides heterostructures. The pres ...Read More
Abstract:
The session will describe the CUDA implementation of a variational Monte Carlo method for the study of strongly correlated quantum systems including high-temperature superconductors, magnetic semiconductors and metal oxides heterostructures. The presentation will cover different tuning and optimization strategies implemented in the GPU code. To eliminate the bandwidth limited performance we have used caching and a novel restructuring of the computation and data access patterns.We also perform two specific optimizations for Kepler. The code uses dynamic compilation to improve performance, especially in parts with limited parallelism. Using Kepler, our code achieves 22 times and 176 times speedup compared to 8 cores and single core CPU implementations respectively. The GPU code allows us to obtain accurate results for large lattices which are crucial for developing predictive capabilities of materials properties. Our developed techniques for matrix inverse and determinant updates can be recycled for other quantum Monte Carlo methods.  Back
 
Topics:
Computational Physics, Quantum Chemistry
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4554
Streaming:
Share:
 
Abstract:
Lattice Quantum ChromoDynamics (QCD) is a numerical treatment of the theory of the strong nuclear force. Calculations in this field can answer fundamental questions about the nature of matter, provide insight into the evolution of the early universe, ...Read More
Abstract:
Lattice Quantum ChromoDynamics (QCD) is a numerical treatment of the theory of the strong nuclear force. Calculations in this field can answer fundamental questions about the nature of matter, provide insight into the evolution of the early universe, and play a crucial role in the search for new theories of fundamental physics. However, massive computational resources are needed to achieve these goals. In this talk, we describe how NVIDIA GPUs are powering Lattice QCD calculations involving the MILC code suite and the QUDA library. This code base has allowed lattice applications to access unparalleled compute power on leadership-class facilities such as Blue Waters and Titan.   Back
 
Topics:
Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4641
Streaming:
Download:
Share:
 
Abstract:
Learn how to leverage the power of GPUs to accelerate solution of large sparse linear systems with multiple right hand sides by means of the incremental eigCG algorithm. For a given hermitian system with multiple right hand sides this algorithm allow ...Read More
Abstract:
Learn how to leverage the power of GPUs to accelerate solution of large sparse linear systems with multiple right hand sides by means of the incremental eigCG algorithm. For a given hermitian system with multiple right hand sides this algorithm allows (1) to compute incrementally a number of small magnitude eigenvalues and corresponding eigenvectors while solving the first few systems with standard Conjugate Gradient (CG), and then (2) to reuse the computed eigenvectors to deflate the CG solver for the remaining systems. In this session we will discuss implementation aspects of the technique and analyse its efficiency on the example of lattice QCD fermion matrix inversions.   Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4693
Streaming:
Download:
Share:
 
Abstract:
Multi-scale molecular dynamics of the systems of nanomagnets is investigated by numerical simulation using parallel algorithms. Fortran- code Magnetodynamics-F provides next types of research: study of the possibility of regulation time of switc ...Read More
Abstract:

Multi-scale molecular dynamics of the systems of nanomagnets is investigated by numerical simulation using parallel algorithms. Fortran- code Magnetodynamics-F provides next types of research: study of the possibility of regulation time of switching of the magnetic moment of the nanostructure; estimation of the role of nanocrystal geometry on super-radiation of 1-, 2- and 3-dimensional objects; study of magnetodynamics of a nanodots inductively coupled with the passive resonator; depending on the solution from initial orientation of the magnetic moment in order to find the configurations for which the super-radiance and radiative damping are maximal. The parallel programs created using application programming interfaces OpenMP and OpenACC. The estimates of speedup and efficiency of implemented algorithms in comparison with sequential algorithms have been obtained. It is shown that the use of NVIDIA Tesla accelerates simulation for study of magnetic dynamics systems which include thousands of magnetic nanoparticles.

  Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, Molecular Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4493
Download:
Share:
 
Abstract:
This work is devote to the processing and analysis of focal series images of amorphous alloys obtained by transmission electron microscopy (TEM). We have discussed methods of processing microscopic images based on orthogonal transformation and method ...Read More
Abstract:
This work is devote to the processing and analysis of focal series images of amorphous alloys obtained by transmission electron microscopy (TEM). We have discussed methods of processing microscopic images based on orthogonal transformation and method for separating morphological inhomogeneities on the sample surface by means of NVIDIA GPU. In this work, for reconstruction the profile map of sample surface by electron microscopic images, required calculate a huge count of orthogonal transformations of small subimage size. We develop orthogonal transformations device function for the problem solution to reduce the computational time of a few dozen times.  Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4261
Download:
Share:
 
Abstract:
Hybrid CPU/GPU systems for HPC require the adoption of a scalable, efficient network architecture: we designed APEnet+ as a point-to-point, low-latency, 3D-torus network integrated in a PCI-Express Gen2 board based on high-end FPGA. APEnet+ implement ...Read More
Abstract:
Hybrid CPU/GPU systems for HPC require the adoption of a scalable, efficient network architecture: we designed APEnet+ as a point-to-point, low-latency, 3D-torus network integrated in a PCI-Express Gen2 board based on high-end FPGA. APEnet+ implements a RDMA protocol leveraging upon GPUDirect capabilities of Fermi and Kepler-class NVIDIA GPUs achieving minimal latency GPU-to-GPU transfers. GPUDirect implementation is the basis for NaNet: a real-time data acquisition system enabling development of GPU-based HEP low-level trigger systems. During 2014 we will release a new APEnet/NaNet generation featuring PCIe Gen3 host interface and faster off-board communication channels: 10-Gbe for NaNet and 56-Gbps for 3D-Torus Link.   Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4213
Download:
Share:
 
Abstract:
Scaling molecular dynamics simulations from one to many GPUs presents unique challenges. Due to the high parallel efficiency of a single GPU, communication processes become a bottleneck when multiple GPUs are combined in parallel and limit scaling. W ...Read More
Abstract:
Scaling molecular dynamics simulations from one to many GPUs presents unique challenges. Due to the high parallel efficiency of a single GPU, communication processes become a bottleneck when multiple GPUs are combined in parallel and limit scaling. We show how the fastest general-purpose molecular dynamics code currently available for single GPUs, HOOMD-blue, has been extended using spatial domain decomposition to run efficiently on tens or hundreds of GPUs.   Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4275
Download:
Share:
 
Abstract:
The particle-in-cell code PATRIC (Particle Tracking Code) is used to simulate particles in a circular particle accelerator at the GSI Helmholtz Center for Heavy Ion in Darmstadt, Germany. Parallelization of PIC codes is an open research field and sol ...Read More
Abstract:
The particle-in-cell code PATRIC (Particle Tracking Code) is used to simulate particles in a circular particle accelerator at the GSI Helmholtz Center for Heavy Ion in Darmstadt, Germany. Parallelization of PIC codes is an open research field and solutions depend very much on the specific problem. As topic of a diploma thesis, the possibilities and limits for a GPU integration into the existing simulation codes is evaluated. The focus lies on general GPU aspects and on the problems arising from collective particle effects. Aim is clear and maintainable code as well as reuse of existing code where possible.   Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4196
Download:
Share:
 
Abstract:
This poster shows three solutions to three bottlenecks common to every Ising spin simulation on cubic lattices: nearest neighbour access, random numbers generation, spin storing. The first one is addressed via a grid-wide access pattern automatically ...Read More
Abstract:
This poster shows three solutions to three bottlenecks common to every Ising spin simulation on cubic lattices: nearest neighbour access, random numbers generation, spin storing. The first one is addressed via a grid-wide access pattern automatically separating the colours of the checkerboard decomposition. As for the second one we present an access pattern useful for Lagged-Fibonacci like Pseudo Random Number Generator that makes possible to double the speed of the CURAND library for the Mersenne Twister implementation. Finally, we present a new Asynchronous Multi-Spin Coding implementation for the Metropolis spin flip, encoding 32 spins in one unsigned word.  Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4199
Download:
Share:
 
Abstract:
Future high-powered electron accelerators require increasing bunch charge densities and beam current. At high bunch charge density self-collective effects such as space charge and coherent synchrotron radiation (CSR) wakefields have a significant ef ...Read More
Abstract:
Future high-powered electron accelerators require increasing bunch charge densities and beam current. At high bunch charge density self-collective effects such as space charge and coherent synchrotron radiation (CSR) wakefields have a significant effect on the beam dynamics. Simulating the full beam dynamics with these effects is computationally intense. The PARADOKS (Parallel Accelerator Design Optimization Kernels) code was developed to utilize advanced algorithms and the GPU to accelerate electron beam simulations where self collective effects are large. The new code was built specifically for performing the intense computation required to develop and optimize new accelerator designs.   Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4146
Download:
Share:
 
Abstract:
DEM simulations are useful in a number of engineering disciplines such as mining, agriculture, etc. The computational cost of discrete methods limits the number and detail of particles that can be simulated in a reasonable time frame without the use ...Read More
Abstract:
DEM simulations are useful in a number of engineering disciplines such as mining, agriculture, etc. The computational cost of discrete methods limits the number and detail of particles that can be simulated in a reasonable time frame without the use of a dedicated CPU cluster. Here, we present a GPU framework for a DEM code that takes particle shape into account by using polyhedra while allowing for millions of spherical particles to be simulated also.   Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4126
Download:
Share:
 
Abstract:
We present a GPU-based trigger algorithm which can be used in LHC experiments to detect new event signatures previously too time-consuming to reconstruct, making new physics searches possible.
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4238
Download:
Share:
 
Abstract:
GPUs are gradually becoming mainstream in supercomputing as their capabilities to significantly accelerate a large spectrum of scientific applications have been clearly identified and proven. Moreover, with the introduction of high-level programming ...Read More
Abstract:
GPUs are gradually becoming mainstream in supercomputing as their capabilities to significantly accelerate a large spectrum of scientific applications have been clearly identified and proven. Moreover, with the introduction of high-level programming models such as OpenACC and OpenMP 4.0, these devices are becoming more accessible and practical to use by a larger scientific community. In this work, an explicit time domain volume integral equation solver is ported to multiple GPUs by applying OpenACC directives to the MPI version of the code. The MPI and OpenACC implementation achieved significant performance up to 11.2X speedup against the MPI and OpenMP CPU code.  Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4162
Download:
Share:
 
Abstract:
GPU-calculations is modern, effective, and fastest way for "ab-initio" mathematical modeling of nanostructures, and, in particular, a properties of nanostructures. This article consist some results of ab-initio research for hardness propert ...Read More
Abstract:
GPU-calculations is modern, effective, and fastest way for "ab-initio" mathematical modeling of nanostructures, and, in particular, a properties of nanostructures. This article consist some results of ab-initio research for hardness properties of several stable binary compounds. It is a combinations of Al, Si, Mg, Cu, and Fe in "fcc", "nacl", "cu2mg", "mgcu2", "zns", "caf2", "cscl", "alfe3", and other structures (B1, B2, B3, C1, C15, A15, D03, L12, ...). We find an equilibrium volumes, elastic moduli, and total energy per atom for most stable compounds and compare this ab-initio simulations and experimental data. Also we estimated a increase of performance coefficient, for GPU-calculations with CUDA technology. All calculations performed on GPAW and Abinit software, based on DFT, and their GPU-versions.  Back
 
Topics:
Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4121
Download:
Share:
Computational Structural Mechanics
Presentation
Media
Abstract:
Nowadays, ceramic is often used in the automotive or aeronautics industries, but the simulation of dynamic cracks and fractures in these materials is difficult, because of bifurcations at the crack tips. In this session we present the benefits of GPU ...Read More
Abstract:
Nowadays, ceramic is often used in the automotive or aeronautics industries, but the simulation of dynamic cracks and fractures in these materials is difficult, because of bifurcations at the crack tips. In this session we present the benefits of GPU's to simulate dynamic crack and fractures in solids, e. g. ceramic materials, using the Peridynamic technique. (1) Most discrete equations of particle-based methods depend on finding neighborhoods. Therefore we present our novel library to find the k-nearest neighbors efficient on the GPU's. (2) Using the high parallelism of the GPU allows increasing the amount of particles, which influence the dependability of the simulation. To validate our implementation on the GPU we simulate a common high-velocity impact scenario and compare our results with experimental data.   Back
 
Topics:
Computational Structural Mechanics, Digital Manufacturing, Visual Effects & Simulation
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4255
Streaming:
Download:
Share:
 
Abstract:
This poster presents how to prototype medical simulation without performance sacrifice by GPUs:(1) development of patient specific respiratory simulation (2) How actual implementation incorporated with power of GPUs with Julia language. Examples wil ...Read More
Abstract:
This poster presents how to prototype medical simulation without performance sacrifice by GPUs:(1) development of patient specific respiratory simulation (2) How actual implementation incorporated with power of GPUs with Julia language. Examples will be given using proposed GPU harnessed nonlinear finite element solver. A new generation of composite simulation on the highly parallel GPU and multicore heterogeneous architectures provides a new power-tool to challenging engineering problems.  Back
 
Topics:
Computational Structural Mechanics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2014
Session ID:
P4247
Download:
Share:
Computer Aided Engineering
Presentation
Media
Abstract:
The goal of this session is to show how to create geometric shapes in GPUs, by taking advantage of GPU's tessellation feature, using the state of the art spline technique called PSP splines (PSPS). PSPS are simpler than B-splines in its mathematical ...Read More
Abstract:
The goal of this session is to show how to create geometric shapes in GPUs, by taking advantage of GPU's tessellation feature, using the state of the art spline technique called PSP splines (PSPS). PSPS are simpler than B-splines in its mathematical form, but are much more powerful than NURBS in geometric design. Compared with Bezier, B-spline, NURBS, design a geometric shape using PSPS is much more efficient, flexible and more intuitive. In this session we will describe what PSPS are and demonstrate how to directly implement PSPS in GLSL or HLSL in the tessellation stages to create new geometries.   Back
 
Topics:
Computer Aided Engineering, Product & Building Design, Gaming and AI
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4240
Streaming:
Share:
 
Speakers:
Abstract:
Come learn about the technology and partnership between HP and NVIDIA that empowers users around the world to design and create without limitations.
 
Topics:
Computer Aided Engineering, Digital Manufacturing, Media and Entertainment, Rendering & Ray Tracing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4883
Streaming:
Share: