SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC On-Demand

AI Application Deployment and Inference
Presentation
Media
Monte Carlo Methods and Neural Networks
Noah Gamboa (Stanford University)
The average human brain has about 100 billion nerve cells. We therefore investigate the question whether there are algorithms for artificial neural networks that are linear in the number of neurons, while the number of connections incident to a neuro ...Read More
The average human brain has about 100 billion nerve cells. We therefore investigate the question whether there are algorithms for artificial neural networks that are linear in the number of neurons, while the number of connections incident to a neuron is bounded by a constant. We offer two approaches to answer this question: First, we derive an algorithm that quantizes a trained artificial neural network such that the resulting complexity is linear. Second, we demonstrate that training networks, whose connections are determined by uniform sampling can achieve a similar precision as compared to using fully connected layers. Due to sparsity upfront, these networks can be trained much faster. Both approaches are made plausible by relating artificial neural units to Monte Carlo integration. We'll demonstrate the results for classic test datasets.  Back
 
Keywords:
AI Application Deployment and Inference, AI and DL Research, GTC Silicon Valley 2018 - ID S8780
Streaming:
Download:
 
Accelerating AI Adoption and Impact (Presented by Dell EMC)
Jay Boisseau (Dell)
Attendees will learn and understand why AI techniques are so powerful, why developing and deploying optimal AI solutions is complex, why using AI techniques effectively is still difficult--and what Dell Technologies is doing to remove these difficult ...Read More
Attendees will learn and understand why AI techniques are so powerful, why developing and deploying optimal AI solutions is complex, why using AI techniques effectively is still difficult--and what Dell Technologies is doing to remove these difficulties and bring easier, effective AI to everyone. Dell Technologies includes seven companies with a comprehensive portfolio of technology products, services and solutions for global industry, government, and education markets, and aims to be the leader in designing and delivering the best AI solutions for every customer, of every type and scale. From Dell Precision workstations for developers and Gateways for edge sensors, to Dell EMC GPU-optimized PowerEdge Servers and Ready Solutions for Deep Learning and hybrid cloud offerings, Dell is leveraging its leadership in technology and in enterprise relationships to design a world-class portfolio of AI solutions for diverse customer workloads, requirements and objectives. This presentation will cover AI and deep learning in an enterprise context, including customer challenges and needs, and then discuss Dell AI solutions and strategy to empower people to use AI rapidly and effectively.  Back
 
Keywords:
AI Application Deployment and Inference, GTC Silicon Valley 2018 - ID S81046
Streaming:
 
Accelerate TensorFlow Inference with New TensorRT Integration
Julie Bernauer (NVIDIA)
TensorFlow is an open source software library for numerical computation using data flow graphs. NVIDIA TensorRT is an inference optimizer and runtime for runtime deployment. TensorRT provides optimizations for deep neural networks and uses reduced pr ...Read More
TensorFlow is an open source software library for numerical computation using data flow graphs. NVIDIA TensorRT is an inference optimizer and runtime for runtime deployment. TensorRT provides optimizations for deep neural networks and uses reduced precision to increase throughput, reduce latency, while maintaining accuracy. Today we announced tighter integration in TensorFlow for TensorRT through with new TensorFlow APIs, sub-graph optimizations and INT8 calibration to automatically leverage Tensor Cores on Volta GPUs. TensorRT delivers 2.5x faster inference throughput compared to inference without TensorRT. In this session, NVIDIA developers will use an example based workflow to show how to use this new capability.  Back
 
Keywords:
AI Application Deployment and Inference, Deep Learning and AI Frameworks, GTC Silicon Valley 2018 - ID S81009
Streaming:
Download:
 
Low-Latency GPU Accelerated Inferencing with TensorRT
Prethvi Kashinkunti (NVIDIA)
Come learn how you can optimize the deployment of your trained neural networks using the GPU-accelerated inferencing library called TensorRT. TensorRT is a high-performance tool for low-latency, high-throughput deep neural network (DNN) inference tha ...Read More
Come learn how you can optimize the deployment of your trained neural networks using the GPU-accelerated inferencing library called TensorRT. TensorRT is a high-performance tool for low-latency, high-throughput deep neural network (DNN) inference that runs on NVIDIA GPUs. The latest release of TensorRT introduces a novel, framework-agnostic network definition format called universal framework format, allowing TensorRT to support and optimize DNN models trained in multiple deep learning frameworks like Caffe and TensorFlow. It also provides the capability to run inference at reduced precision, giving developers the ability to take advantage of new GPU hardware features like the Volta Tensor Core architecture. This session will be a combination of lecture and live demos.  Back
 
Keywords:
AI Application Deployment and Inference, Tools and Libraries, Performance Optimization, Data Center and Cloud Infrastructure, GTC Silicon Valley 2018 - ID S8496
Streaming:
AI and DL Research
Presentation
Media
Training Neural Networks with Mixed Precision: Real Examples
Benjamin Barsdell (NVIDIA), Michael O'Connor (NVIDIA), Christian M. Sarofeen (NVIDIA)
We will cover the techniques for training DNNs with Tensor Cores described in "S8923 - Training Neural Networks with Mixed Precision: Theory and Practice". These methods were introduced for AI processing with the Volta GPU architecture. T ...Read More
We will cover the techniques for training DNNs with Tensor Cores described in "S8923 - Training Neural Networks with Mixed Precision: Theory and Practice". These methods were introduced for AI processing with the Volta GPU architecture. Tensor Cores provide up to 120 TFlops throughput, mixing operations on IEEE half- and single-precision floats. Techniques used will include loss-scaling, master weights copy, and choosing the proper precision for a given operation. For each of TensorFlow and PyTorch we will describe a fp32 network definition and then demonstrate the same network using mixed precision techniques.  Back
 
Keywords:
AI and DL Research, Algorithms and Numerical Techniques, GTC Silicon Valley 2018 - ID S81012
Streaming:
Download:
 
Model Architectures and Training Techniques for High-Precision Landmark Localization
Sina Honari (University of Montreal - MILA), Pavlo Molchanov (NVIDIA)
We'll discuss training techniques and deep learning architectures for high-precision landmark localization. In the first part of the session, we'll talk about ReCombinator Networks, which aims at maintaining pixel-level image information ...Read More

We'll discuss training techniques and deep learning architectures for high-precision landmark localization. In the first part of the session, we'll talk about ReCombinator Networks, which aims at maintaining pixel-level image information, for high-accuracy landmark localization. This model combines coarse-to-fine features to first observe global (coarse) image information and then recombines local (fine) information. By using this model, we report SOTA on three facial landmark datasets. This model can be used for other tasks that require pixel-level accuracy (for example, image segmentation, image-to-image translation). In the second part, we'll talk about improving landmark localization in a semi-supervised setting, where less labeled data is provided. Specifically, we consider a scenario where few labeled landmarks are given during training, but lots of weaker labels (for example, face emotions, hand gesture) that are easier to obtain are provided. We'll describe training techniques and model architectures that can leverage weaker labels to improve landmark localization.

  Back
 
Keywords:
AI and DL Research, Computer Vision, GTC Silicon Valley 2018 - ID S8406
Streaming:
Download:
 
Synthetic Data Generation for an All-in-One Driver Monitoring System
Sagar Bhokre (NVIDIA)
Driver monitoring systems are used to detect many driver attributes like gaze, head pose, eye openness, and other features pertaining to attention and assistance. We''ll present a synthetic method of generating data for training DNNs, which caters to ...Read More
Driver monitoring systems are used to detect many driver attributes like gaze, head pose, eye openness, and other features pertaining to attention and assistance. We''ll present a synthetic method of generating data for training DNNs, which caters to the above mentioned features of the subject. We use blender for generating synthetic images, powered by NVIDIA GPUs, which can be scaled to match training needs. Synthetic data generatation allows precise control over data points that are difficult to control in a real environment, like pupil dialation. This approach avoids noisy measurements and results in high accuracy without the need for a high-precision 3D sensor.  Back
 
Keywords:
AI and DL Research, Autonomous Vehicles, Advanced AI Learning Techniques (incl. GANs and NTMs), GTC Silicon Valley 2018 - ID S8324
Streaming:
Download:
 
Accelerating Scientific Simulation with Generative Adversarial Networks
Luke de Oliveira (Vai Technologies), Benjamin Nachman (Lawrence Berkeley National Laboratory), Michela Paganini (Yale University)
Many scientific and engineering fields increasingly rely on complex and time consuming computational simulation as part of the modern scientific workflow. In many applications, such as High Energy Particle Physics, Cosmology, Geophysics, and others, ...Read More
Many scientific and engineering fields increasingly rely on complex and time consuming computational simulation as part of the modern scientific workflow. In many applications, such as High Energy Particle Physics, Cosmology, Geophysics, and others, simulations are the computational bottleneck for producing and testing results. We introduce the usage of Generative Adversarial Networks (GAN) as a potential tool for speeding up expensive theoretical models and simulations in scientific and engineering applications, ushering in a new era of deep learning-powered scientific discovery. We will show that using a GAN-based High Energy Physics fast simulator on GPUs can provide speedups of up to 100,000x when compared to traditional simulation software, while retaining high levels of precision. Finally, we will discuss modeling and architectural considerations in this domain with the hope of directly empowering scientists and engineers in other fields to experiment with Generative Adversarial Networks in order to speed up simulation across scientific domains.  Back
 
Keywords:
AI and DL Research, Advanced AI Learning Techniques (incl. GANs and NTMs), HPC and AI, GTC Silicon Valley 2018 - ID S81001
Streaming:
Download:
 
Training Neural Networks with Mixed Precision: Theory and Practice
Paulius Micikevicius (NVIDIA)
We'll cover the theory and practice for training DNNs with Tensor Cores, introduced for AI processing with the Volta GPU architecture. Tensor Cores provide up to 120 TFlops throughput, mixing operations on IEEE half- and single-precision floats. In ...Read More
We'll cover the theory and practice for training DNNs with Tensor Cores, introduced for AI processing with the Volta GPU architecture. Tensor Cores provide up to 120 TFlops throughput, mixing operations on IEEE half- and single-precision floats. In the theory portion of the talk, we'll review the half-precision format, values that arise in DNN computations, and techniques that maximize utilization of fp16 format by these values. Techniques include loss-scaling, master weights, and choosing the proper precision for a given operation. In the practice portion of this talk, we'll survey various models that have been trained in mixed precision, matching the accuracy of fp32 training sessions while using the same hyperparameters. Models include various architectures (feed forward, recurrent, generative) as well as cover diverse tasks (image, speech, and language processing). We'll also provide network design and training guidelines to maximize speed when using Tensor Cores.  Back
 
Keywords:
AI and DL Research, Algorithms and Numerical Techniques, GTC Silicon Valley 2018 - ID S8923
Streaming:
Download:
 
Accelerating Cancer Research with Deep Learning
Fernanda Foertter (NVIDIA)
The Department of Energy (DOE) entered into a partnership with the National Cancer Institute (NCI) of the National Institutes of Health (NIH) to accelerate cancer research. This "Cancer Moonshot" aims to tackle three main objectives: better ...Read More
The Department of Energy (DOE) entered into a partnership with the National Cancer Institute (NCI) of the National Institutes of Health (NIH) to accelerate cancer research. This "Cancer Moonshot" aims to tackle three main objectives: better understand the mechanisms of cancer, use large amounts of diverse medical data for predictive models, and enable precision medicine by providing guidance for treatment to individual patients. Leveraging the compute expertise of DOE in high performance computing (HPC) and new methods for deep learning in artificial intelligence, this HPC+AI approach aims to create a single scalable deep neural network code called CANDLE (CANcer Distributed Learning Environment) that will be used to address all three challenges. This talk aims to give an overview of the project and highlight how GPU accelerated systems in the DOE ecosystem, Summit and Sierra, have contributed to the project.  Back
 
Keywords:
AI and DL Research, HPC and AI, Medical Imaging and Radiology, GTC Silicon Valley 2018 - ID S81033
Streaming:
 
GPU Performance Testing and PowerAI on IBM Cloud (Presented by IBM Cloud)
Alex Hudak (IBM), Brian Wan (IBM)
In this session, you will learn about the latest IBM PowerAI solution, IBM Cloud GPU offerings and see a price-performance comparison, with supporting data, on the number of CPUs required to optimize GPU performance. We've also aggregated extensive ...Read More
In this session, you will learn about the latest IBM PowerAI solution, IBM Cloud GPU offerings and see a price-performance comparison, with supporting data, on the number of CPUs required to optimize GPU performance. We've also aggregated extensive test data to determine general best practices such as half-precision deep learning advantages on the Tesla V100 and the implications of neural-network model variable distribution and gradient aggregation techniques on your performance results. Join us to see why NVIDIA GPUs on IBM Cloud offer superior results.  Back
 
Keywords:
AI and DL Research, Accelerated Analytics, GTC Silicon Valley 2018 - ID S81013
Streaming:
Download:
AI in Healthcare
Presentation
Media
Extreme Computing, Clinical Medicine and GPUs
Joel Saltz (Stony Brook Medicine and College of Engineering and Applied Sciences Stony Brook)
Images and sensors provide crucial information needed to make treatment decisions and machine learning methods are increasingly employed to supplement subjective human image interpretations and to integrate heterogeneous collections of information. W ...Read More
Images and sensors provide crucial information needed to make treatment decisions and machine learning methods are increasingly employed to supplement subjective human image interpretations and to integrate heterogeneous collections of information. We'll describe the rapidly changing landscape of medical images and sensors from both a computing, data, and medical point of view. We'll then do a deep dive in the area of pathology image analytics along with contributions made by deep learning methods to precision medicine and clinical diagnostics. Finally, we'll address the pivotal role of GPUs in supporting all of these computations and describe the roles of GPU-related tools, languages, and libraries in the medical image and sensor analytics.  Back
 
Keywords:
AI in Healthcare, Medical Imaging and Radiology, GTC Washington D.C. 2017 - ID DC7248
Download:
 
Targeted Sequencing for All on S5 an S5 XL: GPUs Make It Happen
Mohit Gupta (Thermo Fisher Scientific)
We'll disscuss how GPUs are playing a central role in making advances in Ion Torrent's targeted sequencing workflow and talk about the S5 DNA sequencer from Ion Torrent that is enabling democratization of sequencing market and accel ...Read More

We'll disscuss how GPUs are playing a central role in making advances in Ion Torrent's targeted sequencing workflow and talk about the S5 DNA sequencer from Ion Torrent that is enabling democratization of sequencing market and accelerating research in precision medicine at a breathtaking pace with the help of GPUs. We'll highlight our work in liquid biopsy and non-invasive prenatal testing and how the breadth in technology offerings in semiconductor chips gives us the scale of sequencing from small panels to exomes. We'll discuss our analysis pipeline and the latest and greatest in algorithm development and acceleration on GPUs as well as our experiences ranging from Fermi to Pascal GPU architectures. 

  Back
 
Keywords:
AI in Healthcare, Bioinformatics & Genomics, GTC Silicon Valley 2018 - ID S8419
Streaming:
Download:
 
Machine Learning in Precision Medicine: Quantitative Medical Imaging, Artificial Intelligence, GPU Efficiency
Milan Sonka (University of Iowa)
Machine Learning in Precision Medicine: Patient-Specific Treatment Enabled by Quantitative Medical Imaging, Artificial Intelligence, and GPU Efficiency The attendees will learn about the need for and use of machine learning in today's patien ...Read More

Machine Learning in Precision Medicine: Patient-Specific Treatment Enabled by Quantitative Medical Imaging, Artificial Intelligence, and GPU Efficiency The attendees will learn about the need for and use of machine learning in today's patient-centered healthcare. The talk will focus on general approaches requiring machine learning to obtain image-based quantitative features, reach patient diagnoses, predict disease outcomes, and identify proper precision-treatment strategies. While the presented methods are general in nature, examples from cardiovascular disease management will be used to demonstrate the need for and power of machine learning enabled by the performance advantages of GPU computation.

  Back
 
Keywords:
AI in Healthcare, Medical Imaging and Radiology, GTC Silicon Valley 2018 - ID S8892
Streaming:
Download:
 
Computational Precision Medicine - How Healthcare May Look Like in 10 years Thanks to GPUs
Alejandro Frangi (CISTIB / The University of Sheffield)
This talk will overview the fields of Personalised Computational Medicine and In Silico Clinical Trials, which are revolutionizing Medicine and Medical Product Development. This talk will introduce these concepts, provide examples of how they ca ...Read More

This talk will overview the fields of Personalised Computational Medicine and In Silico Clinical Trials, which are revolutionizing Medicine and Medical Product Development. This talk will introduce these concepts, provide examples of how they can transform healthcare, and emphasize why artificial intelligence and machine learning are relevant to them. We will also explain the limitations of these approaches and why it is paramout to engage in both phenomenological (data-driven) and mechanistic (principle-driven) modelling. Both areas are in desperate need for better infrastructures -sofrware and hardaware- giving access to computational and storage resources. The talk will be thought-provoking and eye-opening as to opportunities in this space for researchers and industries alike.

  Back
 
Keywords:
AI in Healthcare, Deep Learning and AI, Medical Imaging and Radiology, GTC Silicon Valley 2018 - ID S8887
Streaming:
 
Computer-Augmented Healthcare: Opportunities and Challenges
Gregory Hager (The Malone Center for Engineering in Healthcare, Johns Hopkins University)
The Role of Data in Achieving Precision and Value in Healthcare The goal of healthcare is to provide the most effective treatment to every patient in the most efficient way. Data plays a key role in every aspect of this process from decision sup ...Read More

The Role of Data in Achieving Precision and Value in Healthcare The goal of healthcare is to provide the most effective treatment to every patient in the most efficient way. Data plays a key role in every aspect of this process from decision support systems that provide a clinician with the right information at the right time, to scheduling algorithms that predict patient flow and schedule accordingly, to analytics to coach and support patients in achieving or maintaining a healthy lifestyle. Achieving the vision of a data-informed healthcare system will require fundamental advances in many areas including causal inference, inference on complex, high-dimensional and heterogeneous data, missing data, process modeling, bias reduction, statistical validation, and model adaptation, to name a few. In this talk, I will illustrate some of these challenges through concrete examples within the Malone Center.

  Back
 
Keywords:
AI in Healthcare, Medical Imaging and Radiology, GTC Silicon Valley 2018 - ID S8891
Streaming:
Download:
Accelerated Analytics
Presentation
Media
Converging HPC and BD/AI: Tokyo Tech. TSUBAME3.0 and AIST ABCI
Satoshi Matsuoka (Tokyo Tech)
The TSUBAME3 supercomputer at Tokyo Institute of Technology became online in Aug. 2017, and became the greenest supercomputer in the world on the Green 500 at 14.11 GFlops/W; the other aspect of TSUBAME3, is to embody various BYTES-oriented features ...Read More
The TSUBAME3 supercomputer at Tokyo Institute of Technology became online in Aug. 2017, and became the greenest supercomputer in the world on the Green 500 at 14.11 GFlops/W; the other aspect of TSUBAME3, is to embody various BYTES-oriented features to allow for HPC to BD/AI convergence at scale, including significant scalable horizontal bandwidth as well as support for deep memory hierarchy and capacity, along with high flops in low precision arithmetic for deep learning.   Back
 
Keywords:
Accelerated Analytics, SIGGRAPH 2017 - ID SC1720
Download:
Advanced AI Learning Techniques (incl. GANs and NTMs)
Presentation
Media
Application of Generative Deep Neural Networks for Mass Customization of Patient-Specific Products
Sergei Azernikov (Glidewell Dental), Jyh-Jing Hwang (UC Berkeley)
We''ll show how generative adversarial networks (GANs) running on GPUs are about to revolutionize mass customization of patient-specific products at Glidewell Dental. Every day, our labs produce thousands of patient-specific items, such as dental res ...Read More
We''ll show how generative adversarial networks (GANs) running on GPUs are about to revolutionize mass customization of patient-specific products at Glidewell Dental. Every day, our labs produce thousands of patient-specific items, such as dental restorations, implants, and appliances. To deliver functional and aesthetic products, high levels of precision and consistency are essential. Traditionally, dental restoration design and manufacturing process was very labor intensive and required highly skilled dental professionals. Today, with the advances in CAD/CAM, the amount of manual labor has been significantly reduced; however, there are still many aspects of the process that require human intervention due to the fact that some of these aspects are hard to formalize and therefore impossible to automate with traditional tools. The convergence of several technologies, such as deep learning, GPGPU, and cloud computing, has allowed us to effectively train generative models on historical data. These models are now capable of automatically generating high-quality patient-specific designs.  Back
 
Keywords:
Advanced AI Learning Techniques (incl. GANs and NTMs), Consumer Engagement and Personalization, GTC Silicon Valley 2018 - ID S8155
Streaming:
Download:
Algorithms and Numerical Techniques
Presentation
Media
Not Just a Universal Crutch: Other Useful Things to Do with atomicCAS
Elmar Westphal (Forschungszentrum Julich GmbH)
There is more to atomicCAS than the double-precision atomicAdd loop from the programming guide. Something different from the universal atomic operation loop it represents. We'll show how to build shared, memory-based hash function loops to solve d ...Read More
There is more to atomicCAS than the double-precision atomicAdd loop from the programming guide. Something different from the universal atomic operation loop it represents. We'll show how to build shared, memory-based hash function loops to solve different counting and grouping problems at warp- and block-level. Variations of this loop can be used to count unique elements in a block, find threads sharing common data elements, or speed up histogram building for large numbers of bins. With the now natively implemented atomic operations on shared memory on Maxwell, these functions can be significantly faster than algorithms optimised for other architectures.  Back
 
Keywords:
Algorithms and Numerical Techniques, Performance Optimization, GTC Silicon Valley 2016 - ID S6220
Streaming:
Download:
 
XMP Library Internals: Modular Multiplication on Kepler and Maxwell
Niall Emmart (University of Massachusetts)
We'll present an overview of the internals of the XMP multiple precision library and take a detailed look at the low-level algorithms used for modular squaring and modular multiplication on Kepler and present novel algorithms for Maxwell. Modular ...Read More
We'll present an overview of the internals of the XMP multiple precision library and take a detailed look at the low-level algorithms used for modular squaring and modular multiplication on Kepler and present novel algorithms for Maxwell. Modular multiplication is a performance-critical primitive and widely used in cryptographic algorithms from prime testing and factorization to public key/private key algorithms such as RSA, Diffie-Hellman, and digital signatures.  Back
 
Keywords:
Algorithms and Numerical Techniques, Tools and Libraries, GTC Silicon Valley 2016 - ID S6349
Streaming:
Download:
 
Training Recurrent Neural Networks in FP16
Erich Elsen (Baidu USA, Inc.)
Reducing training time allows us to learn from our experiments more quickly and make new innovations based on what we've learned. Using less than the standard 32 bits to represent a number can help reduce training times. We'll talk about how to ...Read More
Reducing training time allows us to learn from our experiments more quickly and make new innovations based on what we've learned. Using less than the standard 32 bits to represent a number can help reduce training times. We'll talk about how to use 16-bit floating point because it is starting to have wide hardware support with the release of Pascal. Unfortunately, naively converting all datatypes from 32- to 16-bits doesn't work, as training stability and accuracy are comprised. We'll discuss the reasons for the difficulties and solutions. Finally, we'll show performance and scalability improvements due to using reduced precision.  Back
 
Keywords:
Algorithms and Numerical Techniques, Deep Learning and AI, Performance Optimization, GTC Silicon Valley 2016 - ID S6661
Streaming:
Download:
 
Half Precision Benchmarking for HPC
Piotr Luszczek (University of Tennessee)
With Tegra X1 and Pascal architecture Tesla P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. We'll introduce the steps required to build a viable benchmark for this new arithmetic format. ...Read More
With Tegra X1 and Pascal architecture Tesla P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. We'll introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.  Back
 
Keywords:
Algorithms and Numerical Techniques, Deep Learning and AI, HPC and Supercomputing, GTC Silicon Valley 2017 - ID S7676
Download:
 
CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA
Andrew Kerr (NVIDIA)
Audience members will learn how to implement efficient Deep Learning computations using CUDA C++ in the context of CUTLASS. CUTLASS is an open-source collection of C++ template abstractions for implementing high-performance matrix-multiplication (GE ...Read More
Audience members will learn how to implement efficient Deep Learning computations using CUDA C++ in the context of CUTLASS. CUTLASS is an open-source collection of C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) at all levels of the CUDA thread hierarchy. We will describe many of the algorithmic strategies used by cuBLAS and cuDNN, and how they can be implemented using C++ templates to cover an extensive space of problem sizes, data layouts, and data types. In particular, we will emphasize how to support alternative and mixed precision math operations such as Pascal's integer DP4A operation and Volta's TensorCores. Finally, we will illustrate how CUTLASS primitives can be combined with custom functionality to implement related algorithms such as convolution. Although this talk highlights CUTLASS, the architecture concepts and algorithm details are relevant to any CUDA programmer focused on Deep Learning.  Back
 
Keywords:
Algorithms and Numerical Techniques, Tools and Libraries, GTC Silicon Valley 2018 - ID S8854
Streaming:
 
Datasets and Algorithms for Road Identification Via Satellite Imagery
Adam Van Etten (In-Q-Tel)
Road identification and route prediction in near real time remains a challenging problem for many geographic regions, particularly in the case of natural disasters or crisis situations. Existing methods such as manual road labeling or aggregatio ...Read More

Road identification and route prediction in near real time remains a challenging problem for many geographic regions, particularly in the case of natural disasters or crisis situations. Existing methods such as manual road labeling or aggregation of mobile GPS track data are currently insufficient in dynamic scenarios. The frequent revisits of satellite imaging constellations may accelerate efforts to rapidly update road network and optimal path prediction, provided routing information can be extracted from imaging pixels. We'll demonstrate deep learning segmentation methods for identifying road center lines and intersections from satellite imagery, and inferring networks from these road segments. We'll also explore data quality requirements by comparing open source labels with-high precision labels created as part of the SpaceNet Roads challenge.

  Back
 
Keywords:
Algorithms and Numerical Techniques, HD Mapping, Federal, GTC Silicon Valley 2018 - ID S8384
Streaming:
 
New Frontiers for Dense Linear Solvers: Towards Extreme Performance and Energy Efficiency
Ahmad Abdelfattah (Innovative Computing Laboratory University of Tennessee), Azzam Haidar (Innovative Computing Laboratory, University of Tennessee)
Learn how to develop fast and energy-efficient linear solvers using GPUs. Hybrid CPU-GPU techniques achieve high performance at the cost of extra power consumption. The new advancements in GPU architectures enable full GPU solutions that are high per ...Read More
Learn how to develop fast and energy-efficient linear solvers using GPUs. Hybrid CPU-GPU techniques achieve high performance at the cost of extra power consumption. The new advancements in GPU architectures enable full GPU solutions that are high performance, energy efficient, and CPU-independent. In addition, new technologies such as half precision arithmetic (FP16) help the design of new solvers that are significantly faster and even more energy efficient. While FP16 arithmetic has been a powerful tool for deep learning applications, our designs show that it is also very useful for boosting performance and energy efficiency of linear solvers. The new developments complement the hybrid algorithms in the MAGMA library, and provide users with a wide variety of designs that fit different requirements of performance, energy efficiency, and numerical accuracy.  Back
 
Keywords:
Algorithms and Numerical Techniques, Performance Optimization, GTC Silicon Valley 2018 - ID S8478
Streaming:
Download:
Astronomy and Astrophysics
Presentation
Media
Gravitational N-body Simulations: How Massive Black Holes Interact with Stellar Systems
Alessandra Mastrobuono, Roberto Capuzzo-Dolcetta
- Sapienza Univ. of Roma
Astrophysics is a field where super computing is a must to obtain new scientific results. in particular, the study of the interaction among massive black holes and surrounding stars is a hot topic, ...Read More
Astrophysics is a field where super computing is a must to obtain new scientific results. in particular, the study of the interaction among massive black holes and surrounding stars is a hot topic, which requires heavy computations to have good representation of what happens in the inner regions of galaxies. We present the results obtained with our high precisioned N-body code, NBSymple, which exploits the joint power of a multi core CPU system together with the high performance NVIDIA Tesla C1060 GPUs. The code is available at the website: astrowww.phys.uniroma1.it/dolcetta/nbsymple.html  Back
 
Keywords:
Astronomy and Astrophysics, Developer - Algorithms, GTC Silicon Valley 2010 - ID S102000
Streaming:
Download:
 
GRASSY: Leveraging GPU Texture Units for Asteroseismic Data Analysis
Matt Sinclair
Learn how to use the hidden computation capability of GPU texture units for general purpose computation. We describe GRASSY, a system for stellar spectral synthesis where the core problem is interpolation between pre-computed intensity value. We ...Read More

Learn how to use the hidden computation capability of GPU texture units for general purpose computation. We describe GRASSY, a system for stellar spectral synthesis where the core problem is interpolation between pre-computed intensity value. We map these pre-computed tables to the GPU''s texture memory. Interpolation then becomes a texture lookup where the hardware automatically performs the interpolation, albeit at very low precision. Our mathematical framework reasons about the impact of this precision and our performance results show 500X speedups. This work generalizes the GPU texture units as computation engines and opens up new problems for GPU acceleration.

  Back
 
Keywords:
Astronomy and Astrophysics, HPC and AI, GTC Silicon Valley 2010 - ID S10044
Download:
 
Acceleration of a 3D WENO Scheme for Large-Scale Cosmological Simulations on GPU
Long Wang (Supercomputing Center, Computer Network Information Center, Chinese Academy of Sciences)
We present our implementation of a 3D 5th order finite-difference WENO scheme in double precision on CPU/GPU clusters, which targets on large-scale cosmological hydrodynamic flow simulations involving both shocks and complicated smooth solution struc ...Read More
We present our implementation of a 3D 5th order finite-difference WENO scheme in double precision on CPU/GPU clusters, which targets on large-scale cosmological hydrodynamic flow simulations involving both shocks and complicated smooth solution structures. In the level of MPI parallelization, we subdivided the domain cubically. Then on each process, we ported the WENO computation to GPU. To avoid the memory limitation of GPUs, we performed a series of optimizations. Our tests on Fermi and Kepler GPU indicate that the GPU version achieve a 12~19 speedup and the computation part is about 19~36 times faster than the Serial Fortran code. At last, we discussed some future work.  Back
 
Keywords:
Astronomy and Astrophysics, Computational Fluid Dynamics, GTC Silicon Valley 2013 - ID P3157
Download:
 
GPU-enabled Precision Measurements of the Structure of the Universe
Deborah Bard (SLAC National Accelerator Laboratory)
Future astronomical surveys will characterize tens of billions of galaxies. Calculating cosmological observables, such as correlation functions, over such vast datasets poses a significant computational challenge. Such calculations are ideally suited ...Read More
Future astronomical surveys will characterize tens of billions of galaxies. Calculating cosmological observables, such as correlation functions, over such vast datasets poses a significant computational challenge. Such calculations are ideally suited to parallelization. This poster describes the implementation of the full two-point correlation function on the GPU, and demonstrates the improvement in accuracy compared to current fast approximation methods. We take advantage of scaling capabilities of GPUS by showing how systematic errors can only be fully explored using the compute power of many GPUs.   Back
 
Keywords:
Astronomy and Astrophysics, GTC Silicon Valley 2013 - ID P3164
Download:
 
Black Holes and Star Clusters in Galactic Nuclei simulated with more than 100k GPU cores
Rainer Spurzem (National Astronomical Observatories, Chinese Academy of Sciences)
100k GPU core benchmark simulations of galactic nuclei and star clusters with high precision direct N-body; on the path to million cores and Exascale...
100k GPU core benchmark simulations of galactic nuclei and star clusters with high precision direct N-body; on the path to million cores and Exascale...  Back
 
Keywords:
Astronomy and Astrophysics, Developer - Algorithms, GTC Silicon Valley 2013 - ID P3242
Download:
 
Powering Real-Time Radio Astronomy Signal Processing with Latest GPU Architectures
Harshavardhan Reddy (NCRA)
We''ll present a summary of ongoing work that targets the use of newer GPU architecture (Pascal and Volta) features in real-time signal processing applications in radio astronomy telescopes, and outline the future growth path for this ex ...Read More

We''ll present a summary of ongoing work that targets the use of newer GPU architecture (Pascal and Volta) features in real-time signal processing applications in radio astronomy telescopes, and outline the future growth path for this exciting new application of GPUs. With Pascal and Volta architectures, we''ll discuss the advantage of using higher memory bandwidth, half-single precision, and integer arithmetic in existing GPU-based correlator pipeline code. This is an ongoing effort between the National Centre for Radio Astrophysics and NVIDIA. We''ll look at various processing stages involved in the pipeline for exploring optimization possibilities, and highlight interesting results that were achieved. We''ll address in detail the effect of using half precision with respect to accuracy of performance and required library changes.

  Back
 
Keywords:
Astronomy and Astrophysics, GTC Silicon Valley 2018 - ID S8339
Streaming:
Download:
Bioinformatics & Genomics
Presentation
Media
High-Throughput Epistasis Screening Using GPUs
Mark Seligman (Insilicos LLC)
Epistasis is the interaction of two or more genes in coding for a biological property. Epistasis is believed to be an important factor in an individual's susceptibility to disease, and the search for epistasis is a major component in the dev ...Read More

Epistasis is the interaction of two or more genes in coding for a biological property. Epistasis is believed to be an important factor in an individual's susceptibility to disease, and the search for epistasis is a major component in the development of personalized approaches to genomic medicine. Statistical tests for epistasis are typically confounded by the multiple-testing problem, that is, the aggregated loss of precision incurred through repeated hypothesis testing. One way to circumvent this problem is to simulate a false-discovery rate via resampling. We report success in using GPUs to accelerate these highly compute-intensive resampling techniques.

  Back
 
Keywords:
Bioinformatics & Genomics, GTC Silicon Valley 2012 - ID S2337
Streaming:
Download:
Climate, Weather, Ocean Modeling
Presentation
Media
GPU-based Operational Weather Model with Horizontal 500m Resolution
Takayuki Aoki
Numerical weather prediction is one of the major applications in high performance computing and demands fast and high-precision simulation over fine-grained grids. In order to drastically shorten the runtime of a weather prediction code, we have ...Read More

Numerical weather prediction is one of the major applications in high performance computing and demands fast and high-precision simulation over fine-grained grids. In order to drastically shorten the runtime of a weather prediction code, we have rewritten its huge entire code for GPU computing from scratch in CUDA. The code ASUCA is a high resolution meso-scale atmosphere model that is being developed by the Japan Meteorological Agency (JMA) for the purpose of the next-generation weather forecasting service. A benchmark on the 3996 GPUs on TSUBAME 2.0 achieves extremely high performance of 145 Tflops in single precision for 14368 × 14284 × 48 mesh. With the initial data and the boundary condition currently used in the JMA weather forecast, we have carried out the run with 500m horizontal mesh 4792 × 4696 × 48, covering whole Japan area with 437 GPUs.

  Back
 
Keywords:
Climate, Weather, Ocean Modeling, GTC China 2011 - ID GTCA1173
Streaming:
Computational Biology and Chemistry
Presentation
Media
Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software
Ross Walker (University of California San Diego)
The AMBER molecular dynamics (MD) package is one of the fastest MD packages on commodity hardware and was one of the first widely used packages to exploit GPUs. We'll discuss the history of AMBER on NVIDIA GPUs and then highlight some of the ...Read More

The AMBER molecular dynamics (MD) package is one of the fastest MD packages on commodity hardware and was one of the first widely used packages to exploit GPUs. We'll discuss the history of AMBER on NVIDIA GPUs and then highlight some of the newest advances in MD simulation that feature in the latest version 16 of AMBER. This includes extremely high-throughput thermodynamic integration free energy methods, explicit solvent constant pH simulations, advanced umbrella sampling restraints, multi-dimensional replica exchange methods, and asymmetric boundary conditions. We'll also discuss the development and validation of our latest precision model, SPXP, which is focused on maximizing the performance achievable from Maxwell-generation hardware without sacrificing accuracy.

  Back
 
Keywords:
Computational Biology and Chemistry, HPC and Supercomputing, GTC Silicon Valley 2016 - ID S6278
Streaming:
Download:
 
CANDLE: Predicting Tumor Cell Response to Drug Treatments
Fangfang Xia (Argonne National Laboratory)
We'll focus on one of the three pilots of the DOE and NCI partnership on precision oncology and the Cancer Moonshot, namely predicting tumor cell response to drug treatments with deep learning. Predicting tumor cell response to drug treatmen ...Read More

We'll focus on one of the three pilots of the DOE and NCI partnership on precision oncology and the Cancer Moonshot, namely predicting tumor cell response to drug treatments with deep learning. Predicting tumor cell response to drug treatments is a critical challenge for accomplishing the promise of precision medicine in oncology. As part of a joint project between DOE and NCI to develop advanced computing solutions for caner, we are developing a deep learning-based framework for modeling tumor-drug interaction and predicting dose response in pre-clinical screening.

  Back
 
Keywords:
Computational Biology and Chemistry, Deep Learning and AI, GTC Silicon Valley 2017 - ID S7788
Download:
Computational Fluid Dynamics
Presentation
Media
Reynolds Equation Solver on GPGPU for Gas Film Lubrication Problem
Ji-Hoon Kang (KISTI)
In the present study, we implemented a Reynolds equation solver on GPGPU for gas film lubrication problem. By using Red-Black Gauss-Siedle iteration scheme, we achieved 106x speedup for core calculation part and overall 12x speedup (double precision) ...Read More
In the present study, we implemented a Reynolds equation solver on GPGPU for gas film lubrication problem. By using Red-Black Gauss-Siedle iteration scheme, we achieved 106x speedup for core calculation part and overall 12x speedup (double precision), relative to 1 core of AMD Llano A8-3850. A small serial part becomes a critical bottleneck and degrades overall speedup as the problem size gets bigger and GPU efficiency increases. Future work will include the development of general gas film analysis solver and the development of parallelization scheme for remaining serial part, such as integration, error check, and et al.   Back
 
Keywords:
Computational Fluid Dynamics, GTC Silicon Valley 2012 - ID P2447
Download:
 
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Alexander Monakov (Institute for System Programming of RAS), Arutyun Avetisyan (Institute for System Programming of RAS)
Learn about optimizations that significantly improve performance of our CUDA conjugate gradient linear solver developed for OpenFOAM, a popular open-source CFD software toolbox. We describe the challenges present in porting iterative solvers to ...Read More

Learn about optimizations that significantly improve performance of our CUDA conjugate gradient linear solver developed for OpenFOAM, a popular open-source CFD software toolbox. We describe the challenges present in porting iterative solvers to CUDA: overhead from data structures conversion, the need for a fast GPU preconditioner, and our approaches to tackling them. We explain our optimizations: reusing the preconditioner from previous time-steps, always storing the preconditioner in low precision, etc., and their impact on performance. Finally, we show how our implementation handles solving in parallel when the number of MPI processes per node exceeds the number of GPUs.

  Back
 
Keywords:
Computational Fluid Dynamics, Developer - Algorithms, Manufacturing Technical, GTC Silicon Valley 2013 - ID S3220
Streaming:
Download:
 
Quickly Applying GPU Acceleration to Barracuda: An MP-PIC CAE Software
Andrew Larson (CPFD Software)
Learn about the challenges and possibilities of applying CUDA to a Multi-Phase Particle-In-Cell code base through (1) An applied approach to parallelizing Barracuda VR, a CAE MP-PIC code, (2) Achieved speed-ups of operation types specific to MP-PIC c ...Read More
Learn about the challenges and possibilities of applying CUDA to a Multi-Phase Particle-In-Cell code base through (1) An applied approach to parallelizing Barracuda VR, a CAE MP-PIC code, (2) Achieved speed-ups of operation types specific to MP-PIC codes (in double-precision), (3) Focused discussion on the crux of MP-PIC, i.e. mapping Lagrangian data to the Eulerian grid and (4) Demonstrated speed-up and future expectations.  Back
 
Keywords:
Computational Fluid Dynamics, GTC Silicon Valley 2014 - ID S4417
Streaming:
Download:
 
Pore-Network Simulation of Fluid Flow and Transport in Porous Media on GPUs
Hassan Dashtian (University of Southern California)
Networks of interconnected resistors, springs and beams, or pores are standard models of studying scalar and vector transport processes in heterogeneous materials and media, such as fluid flow in porous media, and conduction, deformations, and electr ...Read More
Networks of interconnected resistors, springs and beams, or pores are standard models of studying scalar and vector transport processes in heterogeneous materials and media, such as fluid flow in porous media, and conduction, deformations, and electric and dielectric breakdown in heterogeneous solids. We developed an algorithm for using the computational power of GPUs to speed up the calculations with pore and resistor networks. The same algorithm can be used with networks of springs or beams. A mixed-precision algorithm, together with the conjugate-gradient method, have been implemented on a single GPU solver. We achieve a speedup factor of 60X, and are able to simulate very large networks with several million sites.  Back
 
Keywords:
Computational Fluid Dynamics, Energy Exploration, GTC Silicon Valley 2016 - ID P6209
Download:
 
How to Prepare Weather and Climate Models for Future HPC Hardware
Peter DUEBEN (EUROPEAN WEATHER CENTRE (ECMWF))
Learn how one of the leading institutes for global weather predictions, the European Centre for Medium-Range Weather Forecasts (ECMWF), is preparing for exascale supercomputing and the efficient use of future HPC computing hardware. I will name ...Read More

Learn how one of the leading institutes for global weather predictions, the European Centre for Medium-Range Weather Forecasts (ECMWF), is preparing for exascale supercomputing and the efficient use of future HPC computing hardware. I will name the main reasons why it is difficult to design efficient weather and climate models and provide an overview on the ongoing community effort to achieve the best possible model performance on existing and future HPC architectures. I will present the EU H2020 projects ESCAPE and ESiWACE and discuss recent approaches to increase computing performance in weather and climate modelling such as the use of reduced numerical precision and deep learning.

  Back
 
Keywords:
Computational Fluid Dynamics, HPC and AI, HPC and Supercomputing, GTC Europe 2017 - ID 23348
Download:
 
Advances in Discrete Element Particle Modelling Using the GPU Based Code Blaze-DEM
Nicolin Govender (RCPE/University of Surrey), Daniel Wilke (Department of Mechanical and Aeronautical Engineering, University of Pretoria)
In this talk we will look at advances in the simulation of particulate systems in Computer Aided Engineering (CAE) applications. We will in particular be focusing on the Discrete Element Method (DEM) and the strides made in terms of the number of par ...Read More
In this talk we will look at advances in the simulation of particulate systems in Computer Aided Engineering (CAE) applications. We will in particular be focusing on the Discrete Element Method (DEM) and the strides made in terms of the number of particles and particle shape using the GPU based code Blaze-DEM. A variety of industrial applications ranging from mining, agriculture, civil engineering to pharmaceuticals will be discussed. We will also touch on how we can leverage the next wave of GPU computing namely, half precession and tensor cores in scientific computing which is still predominantly double precision based. Finally we look at the work been done by various groups to create a multi-physics GPU based platform using Blaze-DEM.  Back
 
Keywords:
Computational Fluid Dynamics, Computer Aided Engineering, GTC Silicon Valley 2018 - ID S8348
Streaming:
Computational Physics
Presentation
Media
The Fast Multipole Method on CPU and GPU Processors
Eric Darve (Stanford)
The fast multipole method (FMM) is a widely used numerical algorithm in computational engineering. Accelerating the FMM on CUDA-enabled GPUs is challenging because the FMM has a complicated data access pattern, mostly during the so-called multip ...Read More

The fast multipole method (FMM) is a widely used numerical algorithm in computational engineering. Accelerating the FMM on CUDA-enabled GPUs is challenging because the FMM has a complicated data access pattern, mostly during the so-called multipole-to-local (M2L) operation. We have created several schemes to optimize the M2L and have attained a performance of over 350 (resp. 160) Gflop/s for single (double) precision arithmetic. The optimal algorithm was incorporated into a complete FMM code, which can accept any smooth kernel as specified by the user, making it very flexible. We have also developed a highly efficient CPU version.

  Back
 
Keywords:
Computational Physics, GTC Silicon Valley 2012 - ID S2334
Streaming:
Download:
 
Accelerating Discontinuous Galerkin Method by Using Multiple Graphic Processing Units (GPUs)
Dawei Mu (University of Wyoming)
We have successfully ported the discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes with Nvidia Tesla GPUs using the Nvidia CUDA programming model. On average our ...Read More
We have successfully ported the discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes with Nvidia Tesla GPUs using the Nvidia CUDA programming model. On average our implementation obtained a speedup factor of about 24.3 for the single-precision version of our single GPU code and about 28.2 for the single-precision version of our multiple GPUs code. The implementation of this Discontinuous Galerkin method on GPU system has greatly enhanced the performance, which could make this method even more competitive among many forward simulation algorithms.  Back
 
Keywords:
Computational Physics, HPC and Supercomputing, GTC Silicon Valley 2013 - ID P3124
Download:
 
GPUs Immediately Relating Lattice QCD to Collider Experiments
Mathias Wagner (University of Bielefeld)
Discover how data from experiments at heavy-ion colliders (the Relativistic Heavy Ion Collider at Brookhaven National Lab and the Large Hadron Collider at CERN) can immediately be compared with first-principles simulations of Quantum Chromodynam ...Read More

Discover how data from experiments at heavy-ion colliders (the Relativistic Heavy Ion Collider at Brookhaven National Lab and the Large Hadron Collider at CERN) can immediately be compared with first-principles simulations of Quantum Chromodynamics (QCD) to quantitatively probe the fundamental properties of strongly interacting matter, i.e., quarks and gluons at high temperature. The conditions realized in the experiments governed the early evolution of the universe. The necessary high precision for these comparisons is obtained by completely performing our calculations on the GPU. In doing so we simultaneously face a low flop/byte ratio and high-register pressure. See how we deal with these complications and achieve high performance on the Bielefeld GPU cluster with 400 Fermi GPUs.

  Back
 
Keywords:
Computational Physics, GTC Silicon Valley 2013 - ID S3153
Streaming:
Download:
 
Columbia Physics System with QUDA
Hyung-Jin Kim (Brookhaven National Lab)
Learn how GPUs can be used to accelerate our understanding of the structure of sub-atomic physics. In this work we deploy GPUs to probe the structure of the nucleus using lattice quantum-chromodynamics (LQCD). While LQCD is the only known method ...Read More

Learn how GPUs can be used to accelerate our understanding of the structure of sub-atomic physics. In this work we deploy GPUs to probe the structure of the nucleus using lattice quantum-chromodynamics (LQCD). While LQCD is the only known method which provides a non-perturbative study of quarks (the particles that make up the nucleus) it requires extremely powerful computational resources to achieve high precision. We accelerate the CPS application (Columbia Physics System, developed by Columbia University, Brookhaven National Laboratory and UKQCD) using the QUDA (QCD on CUDA) library. QUDA is a low-level library built on the CUDA platform, and is designed to accelerate the algorithms that form the basis for LQCD applications. The conjugate gradients (CG) algorithm to solve quark propagators with the 5-d domain-wall Dirac operator is one of the most time consuming part of LQCD calculations, and it is this algorithm that this work focuses on. Running on the Kepler-enabled K20 GPU cluster at Thomas Jefferson National Lab we demonstrate sustained Tera-flops scale CG performance with less than 10 GPUs. Furthermore, we have also developed an alternative 4-d preconditioner for the domain-wall Dirac operator to implement a more efficient version of the Dirac operator (called Mobius). Lastly, we are also exploring the use of eigenvectors using Lanczos algorithm, to further accelerate CG convergence

  Back
 
Keywords:
Computational Physics, Developer - Algorithms, Developer - Tools & Libraries, GTC Silicon Valley 2013 - ID S3562
Streaming:
Download:
 
Does Antimatter Fall On The Earth? Measurement Of Antimatter Annihilation with GPUs
Akitaka Ariga (University of Bern)
One of the most important unanswered questions in physics is: Does antimatter fall in the same way as matter? At the European Organization for Nuclear Research (CERN, Geneva) the AEgIS experiment is underway to measure the gravitational force on anti ...Read More
One of the most important unanswered questions in physics is: Does antimatter fall in the same way as matter? At the European Organization for Nuclear Research (CERN, Geneva) the AEgIS experiment is underway to measure the gravitational force on antimatter and has to reach a nanometric precision in determining the free-fall of antimatter. In particular, the 3D reconstruction of particle tracks produced in matter - antimatter annihilations requires a huge amount of computing resources, that is a processing of tomographic images of 30 TByte per day. In this talk, the application of GPUs for the 3D tracking of particles in photo-emulsion detectors will be reported.  Back
 
Keywords:
Computational Physics, Astronomy and Astrophysics, GTC Silicon Valley 2014 - ID S4372
Streaming:
 
GPU-Based Lattice QCD Simulations as Thermometer for Heavy-Ion Collisions
Mathias Wagner (Bielefeld University & Indiana University)
See how advances in GPU Computing enable us to simulate Quantum Chromodynamics and learn about fundamental properties of strongly interacting matter i.e., quarks and gluons at finite temperatures. With the advances in hardware and algorithms these si ...Read More
See how advances in GPU Computing enable us to simulate Quantum Chromodynamics and learn about fundamental properties of strongly interacting matter i.e., quarks and gluons at finite temperatures. With the advances in hardware and algorithms these simulations have reached a level that allows for a quantitative comparison with experimental data from heavy-ion colliders. Discover how the Kepler architecture helps us to boost the performance of the simulations and reach new level of precision. I will discuss selected optimizations for the Kepler K20 cards and modifications to prepare the code for the Titan supercomputer. Furthermore I compare and discuss pros and cons of our in-house in comparison to available libraries like the QUDA library.   Back
 
Keywords:
Computational Physics, Numerical Algorithms & Libraries, HPC and Supercomputing, GTC Silicon Valley 2014 - ID S4453
Streaming:
Download:
 
Optimization of an Explicit Finite Differences Solver for Enabling Faster Studies of Spintronic Effects
David Claudio-Gonzalez (University of Guanajuato & Intel Corporation)
We report the acceleration of spintronic simulations in double precision based on the implementation of an explicit finite differences solver by factors of 1.6 to 13x. The smaller factor was observed when comparing a single thread implementation runn ...Read More
We report the acceleration of spintronic simulations in double precision based on the implementation of an explicit finite differences solver by factors of 1.6 to 13x. The smaller factor was observed when comparing a single thread implementation running in a Intel Xeon E5620 @ 2.4 GHz and a NVIDIA GeForce GTX 670M. The highest value was observed when comparing an Intel i7-2760QM @ 2.4 GHz NVIDIA Tesla M2070. Optimizations consisted of the reduction of access to the global device memory by the increased usage of registers and shared memory.  Back
 
Keywords:
Computational Physics, GTC Silicon Valley 2015 - ID P5226
Download:
 
Efficient Simulation of Multiple Mie-scattering Events Using NVIDIA CUDA
Simon Streicher (Heilbronn University)
We present a method using NVIDIA Thrust library to calculate multiple Mie-scattering using a GPU. Mie-scattering describes the scattering of electromagnetic waves on spheroidal particles whose diameter is close to the wavelength of the incident radia ...Read More
We present a method using NVIDIA Thrust library to calculate multiple Mie-scattering using a GPU. Mie-scattering describes the scattering of electromagnetic waves on spheroidal particles whose diameter is close to the wavelength of the incident radiation. At the current state, our implementation shows speedups of 19,5 compared to a quad-core CPU. This enables us to simulate complex optical scenarios with our software. Additionally, we have verified the simulation results using a high precision measurement device in the laboratory.  Back
 
Keywords:
Computational Physics, Developer - Algorithms, GTC Silicon Valley 2015 - ID P5281
Download:
 
Revolutionizing Lattice QCD Physics with Heterogeneous Multigrid
Kate Clark (NVIDIA), Alexei Strelchenko (Fermilab National Laboratory)
Learn how combining GPUs with advanced multi-grid solvers are revolutionizing the study of lattice quantum chromodynamics (LQCD). LQCD is a computational tool for probing nuclear and particle physics, however, it can require thousands of GPUs working ...Read More
Learn how combining GPUs with advanced multi-grid solvers are revolutionizing the study of lattice quantum chromodynamics (LQCD). LQCD is a computational tool for probing nuclear and particle physics, however, it can require thousands of GPUs working in tandem for months due to the computationally prohibitive linear solver. Using the QUDA framework, we describe how the solver can be accelerated using an adaptive multi-grid method. The optimization techniques employed are: fine-grained parallelization, mixed precision, communication reducing solvers, and reformulation of the algorithm to allow the CPU and GPU to work in parallel. Using this multitude of algorithmic innovations, we demonstrate that a 5X speedup can be realized over present state-of-the-art methods using GPUs.  Back
 
Keywords:
Computational Physics, Algorithms and Numerical Techniques, Performance Optimization, GTC Silicon Valley 2016 - ID S6667
Streaming:
Download:
 
How GPU Can Help High Energy Physics Experiments
Gianluca Lamanna (INFN)
We aim to show how the online data selection in high-energy physics experiments could benefit from real-time GPU processing. The computing power of GPUs fits the requirements to increase the ability of the trigger systems to reduce the data bandwidth ...Read More
We aim to show how the online data selection in high-energy physics experiments could benefit from real-time GPU processing. The computing power of GPUs fits the requirements to increase the ability of the trigger systems to reduce the data bandwidth. We designed a system for online processing exploiting commercial GPUs for the NA62 experiment at CERN. In particular we will show different techniques to reduce and control the latency due to data transfer in order to have synchronous response from the system. We will show recent results obtained in a physics run at CERN with high data rate. Attendees will learn how a high-energy physics trigger system works and how GPUs can increase the discovery potential of high-precision experiments.  Back
 
Keywords:
Computational Physics, Algorithms and Numerical Techniques, Press-Suggested Sessions: HPC & Science, GTC Silicon Valley 2016 - ID S6438
Streaming:
Download:
 
FMM with Periodic Boundaries Support on GPU
Bartosz Kohnke (Max Planck Institute for Biophysical Chemistry)
The direct solution of the N-body problem is a simple, yet scientifically important and ubiquitous showcase algorithm for modern GPUs. However, the computational complexity is O(N^2). The fast multipole method is an algorithm that reduces runtime and ...Read More
The direct solution of the N-body problem is a simple, yet scientifically important and ubiquitous showcase algorithm for modern GPUs. However, the computational complexity is O(N^2). The fast multipole method is an algorithm that reduces runtime and complexity to optimal O(N) for any required precision. We'll present an optimized, fully NVIDIA CUDA-enabled, templated C++ implementation of the FMM, which considers all stages of the method, from particle input to the forces extraction. We compare different parallelization approaches and show the performance improvement when going from a dynamic parallelization to a presorted list-based approach that fits particular system constraints such as periodic boundary conditions. We'll discuss how to exploit the FMM operators such that both memory access overhead and the number of complex multiplications are minimized. Thereby the kernels are led to the compute bound range, and performance is increased.  Back
 
Keywords:
Computational Physics, HPC and Supercomputing, GTC Silicon Valley 2017 - ID S7196
Download:
Computer Vision
Presentation
Media
Efficient Local Shape Features Matching using CUDA
Leonardo Chang (Instituto Nacional de Astrofisica, Optica y Electronica (INAOE), Mexico)
LISF is an invariant local shape features descriptor, which can be used to obtain a discriminative and compact representation of an object. For occlusion up to 60%, LISF method outperformed other popular shape description methods, with about 20% high ...Read More
LISF is an invariant local shape features descriptor, which can be used to obtain a discriminative and compact representation of an object. For occlusion up to 60%, LISF method outperformed other popular shape description methods, with about 20% higher bull's eye score and 25% higher precision and recall in classification. In this work, we present a massively parallel implementation in GPU of LISF algorithm. A 34x speedup is achieved compared to the CPU implementation when matching 290 vs. 290 features. GPU implementation shows linear scaling while increasing number of features, in contrast to CPU implementation that shows an exponential growth.   Back
 
Keywords:
Computer Vision, GTC Silicon Valley 2013 - ID P3116
Download:
 
Shaking and Shot Video Augmentation in Real-time
Francisco J. Hernandez-Lopez (Center of Research in Mathematics)
This session will present a method for augmenting shaking and shot videos, which is an extension of VScreen: a tool that modifies a region of any video by another image or video in real-time. The technique is initially introduced in the photo au ...Read More

This session will present a method for augmenting shaking and shot videos, which is an extension of VScreen: a tool that modifies a region of any video by another image or video in real-time. The technique is initially introduced in the photo augmentation task, next it is extended for video augmentation and finally it is applied in shaking and shot videos. Moving objects in the foreground (Fg) may occlude the augmented region in background (Bg). So that we use a procedure for Fg/Bg video segmentation, that is implemented in NVIDIA video cards to fulfill the real-time requirement. Finally, we will show a quantitative evaluation, where we compare the precision and time of our binary segmentation method (QMMF) against the Graph Cut method (available in the NPP library).

  Back
 
Keywords:
Computer Vision, Video and Image Processing, GTC Silicon Valley 2013 - ID S3147
Streaming:
Download:
 
Terrestrial 3D Mapping with Parallel Computing Approach
Janusz Bedkowski (Institute of Mathematical Machines)
This work concerns the parallel implementation of 3D mapping algorithm. Several nearest neighborhood search strategies are compared. The accuracy of final 3D mapping is evaluated with geodetic precision. This work can be used in several applications ...Read More
This work concerns the parallel implementation of 3D mapping algorithm. Several nearest neighborhood search strategies are compared. The accuracy of final 3D mapping is evaluated with geodetic precision. This work can be used in several applications such as mobile robotics and spatial design. Attendees will learn how to choose proper nearest neighbors search strategy for 3D data registration, how to build accurate 3D maps, how to evaluate 3D mapping system with geodetic precision and what the influence of parallel programming is to performance and accuracy.  Back
 
Keywords:
Computer Vision, Big Data Analytics, GTC Silicon Valley 2014 - ID S4353
Streaming:
Computer Vision and Machine Vision
Presentation
Media
Massively Accelerating Iterative Gauss-Newton Fitting
To measure three-dimensional shape data of objects, we build up a measurement system that assigns three-dimensional coordinates to the position of projected measurement labels in a camera image. ...Read More
To measure three-dimensional shape data of objects, we build up a measurement system that assigns three-dimensional coordinates to the position of projected measurement labels in a camera image. To achieve high measurement accuracy across high amounts of measurement points, we need a very quick routine to localize measurement labels with high precision. To speed up the computation, we evaluate the fits using the CUDA architecture. The final implementation speeds up the fitting of 104 two-dimensional Gauss functions by a factor of 90.  Back
 
Keywords:
Computer Vision and Machine Vision, Stereoscopic 3D, GTC Silicon Valley 2010 - ID S102065
Streaming:
Download:
 
GPU + Drones + 3D Imaging for Precision Farming and Construction
Bingcai Zhang (BAE Systems)
Agriculture and construction are two of the largest industries in the world. Democratization of 3-D imaging technology with drones, digital cameras, and GPU is applicable for precision farming and construction. Precision farming can increase crop yie ...Read More
Agriculture and construction are two of the largest industries in the world. Democratization of 3-D imaging technology with drones, digital cameras, and GPU is applicable for precision farming and construction. Precision farming can increase crop yields, reduce pollution, save water, and increase productivity. The demand for precision farming has since increased, however, with more people living on planet Earth with fixed natural resources. Timely precise 3-D measurements are important for construction. Today, most of these 3-D measurements are obtained manually. BAE Systems is developing GPU-accelerated 3-D imaging technology with drone images for precision farming and construction.  Back
 
Keywords:
Computer Vision and Machine Vision, Video and Image Processing, GTC Silicon Valley 2015 - ID S5373
Streaming:
Download:
 
How to Get Regulatory Approval for an AI-based Autonomous Car?
Alex Haag (AUTONOMOUS INTELLIGENT DRIVING GMBH)
Proving that such a complex system as an autonomous car is safe cannot be done using existing standards.   A new method needs to be invented that is much more data driven and probability based. Traditional redundant   solutio ...Read More

Proving that such a complex system as an autonomous car is safe cannot be done using existing standards.   A new method needs to be invented that is much more data driven and probability based. Traditional redundant   solutions don't apply when trying to optimize a Precision-Recall curve. Getting acceptance from the regulatory bodies and the public will be much easier if the industry converges on what this new method shall be.

  Back
 
Keywords:
Computer Vision and Machine Vision, Self-Driving Cars, Computer Vision and Machine Vision, GTC Europe 2017 - ID 23166
Consumer Engagement and Personalization
Presentation
Media
Juicing Up Ye Olde GPU Monte Carlo Code
Richard Hayden (JP Morgan Chase), Oleg Rasskazov (JP Morgan Chase)
We''ll discuss the GPU accelerated Monte Carlo compute at JP Morgan which was architected for C1060 cards and revamped a few times as new architectures were released. The key features of the code are exclusive use of double precisio ...Read More

We''ll discuss the GPU accelerated Monte Carlo compute at JP Morgan which was architected for C1060 cards and revamped a few times as new architectures were released. The key features of the code are exclusive use of double precision, data caching, and code structure where significant amount of CPU pre-compute is followed by running multiple GPU kernels. On the latest devices, memory per flop is a throughput limiting factor for a class of our GPU-accelerated models. As byte/flop ratio is continuing to fall from one generation of GPU to the next, we are exploring the ways to re-architecture Monte Carlo simulation code to decrease memory requirements and improve TCO of the GPU-enabled compute. Obvious next steps are store less, re-calculate more, and unified memory. 

  Back
 
Keywords:
Consumer Engagement and Personalization, Finance - Quantitate Risk Management, GTC Silicon Valley 2018 - ID S8802
Download:
Debugging Tools & Techniques
Presentation
Media
Efficient Search for Inputs Causing High Floating-Point Errors
Wei-Fan Chiang (School of Computing, University of Utah)
This poster shows a new approach, guided search, for floating-point error estimation. Precision (floating-point error) estimation could help GPU software developers handle the performance/precision trade-off. Traditional abstract analyses cannot scal ...Read More
This poster shows a new approach, guided search, for floating-point error estimation. Precision (floating-point error) estimation could help GPU software developers handle the performance/precision trade-off. Traditional abstract analyses cannot scale to real GPU kernels. On the other hand, our new guided search method could be applied to real GPU routines. This error estimation technique can also be used in conjunction with auto-tuning and algorithm selection.   Back
 
Keywords:
Debugging Tools & Techniques, GTC Silicon Valley 2014 - ID P4223
Download:
Deep Learning and AI
Presentation
Media
High-Performance GPU Programming for Deep Learning
Scott Gray (Nervana Systems)
This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convoluti ...Read More
This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convolution, direct convolution, and small tile GEMM (matrix multiply). In particular, we'll look at how we achieve high utilization at very small mini batches which is important for multi-gpu scaling and inference. In addition we'll talk about where and how you can effectively leverage lower and mixed precision to further increase performance without loss in accuracy.  Back
 
Keywords:
Deep Learning and AI, Performance Optimization, Algorithms and Numerical Techniques, GTC Silicon Valley 2016 - ID S6485
Streaming:
Download:
 
Efficient Deep Networks for Real-Time Classification in Constrained Platforms
Dr Jose Alvarez (Computer Vision Researcher, Data61 at CSIRO)
Employing deep learning (DL), especially deep neural networks for high performance radiological or medical image computing is the main focus of this talk. We'll present the motivation, technical details and quantitative results of our recent ...Read More

Employing deep learning (DL), especially deep neural networks for high performance radiological or medical image computing is the main focus of this talk. We'll present the motivation, technical details and quantitative results of our recent work at NIH for three core problems: 1) Improving Computer-aided Detection (CAD) using Convolutional Neural Networks and Decompositional Image Representations; 2) Robust Bottom-up Multi-level Deep Convolutional Networks for Automated Organ Segmentation; 3) Text/Image Deep Mining on a Large-Scale Radiology Image Database for Automated Image Interpretation. We validate some very promising observations of using DL to both significantly improve upon traditional CAD tasks in (1) and enable new exciting research directions as (2,3). This presentation is based on 11 recent papers published in MICCAI/CVPR/TMI/JMLR and three filed patents. We would expect their positive impacts in both preventative and precision medicine aspects.

  Back
 
Keywords:
Deep Learning and AI, AI Conference Australia 2016 - ID AUS6105
Streaming:
 
Deep Neural Networks in Medical Imaging and Radiology: Preventative and Precision Medicine Perspectives
Dr Le Lu (Scientist, Department of Radiology and Imaging Sciences, National Institutes of Health, Clinical Center, USA)
Employing deep learning (DL), especially deep neural networks for high performance radiological or medical image computing is the main focus of this talk. We'll present the motivation, technical details and quantitative results of our recent ...Read More

Employing deep learning (DL), especially deep neural networks for high performance radiological or medical image computing is the main focus of this talk. We'll present the motivation, technical details and quantitative results of our recent work at NIH for three core problems: 1) Improving Computer-aided Detection (CAD) using Convolutional Neural Networks and Decompositional Image Representations; 2) Robust Bottom-up Multi-level Deep Convolutional Networks for Automated Organ Segmentation; 3) Text/Image Deep Mining on a Large-Scale Radiology Image Database for Automated Image Interpretation. We validate some very promising observations of using DL to both significantly improve upon traditional CAD tasks in (1) and enable new exciting research directions as (2,3). This presentation is based on 11 recent papers published in MICCAI/CVPR/TMI/JMLR and three filed patents. We would expect their positive impacts in both preventative and precision medicine aspects.

  Back
 
Keywords:
Deep Learning and AI, AI Conference Australia 2016 - ID AUS6102
Streaming:
 
3D DeepObject for Precision 3D Mapping
Bingcai Zhang (BAE Systems)
3D DeepObject achieves mapping-level positional accuracy. In the geospatial intelligence space, positional accuracy is as important as precision and recall. Unfortunately, convolutional networks in deep learning are invariant to translation. In other ...Read More
3D DeepObject achieves mapping-level positional accuracy. In the geospatial intelligence space, positional accuracy is as important as precision and recall. Unfortunately, convolutional networks in deep learning are invariant to translation. In other words, the positional accuracy from deep learning object detection is inherently poor. Combining deep learning and 3D model fitting, our 3D DeepObject has the best of both worlds. Deep learning can detect object (a bounding box) with close to human-level accuracy, while 3D model fitting can achieve pixel-level positional accuracy. The output (bounding boxes) from deep learning are the input for 3D model fitting. A bounding box from deep learning can significantly reduce the search space for 3D model fitting. Our latest test indicates that 3D DeepObject can achieve much higher positional accuracy than deep learning or 3D model fitting alone can achieve.  Back
 
Keywords:
Deep Learning and AI, Federal, Computer Vision and Machine Vision, GTC Silicon Valley 2017 - ID S7149
Download:
 
Training of Deep Networks with Half-Precision Float
Boris Ginsburg (NVIDIA)
We'll describe new algorithms used to train very deep networks with half-precision float. Float16 has two major potential benefits: better training speed and reduced memory footprint. But Float16 has very narrow numerical range (0.00006,6550 ...Read More

We'll describe new algorithms used to train very deep networks with half-precision float. Float16 has two major potential benefits: better training speed and reduced memory footprint. But Float16 has very narrow numerical range (0.00006,65504). This narrow numerical range can result both in overflow ("inf/nan" problem) or underflow ("vanishing gradient") during training of deep networks. We'll describe the new scaling algorithm, implemented in nvcaffe, which prevents these negative effects. With this algorithm, we successfully trained such networks as Alexnet, GoogLeNet, Inception_v3, and Resnets without any loss in accuracy. Other contributors to this work are S. Nikolaev, M. Houston, A. Kiswani, A. Gholaminejad, S. Migacz, H. Wu, A. Fit-Florea, and U. Kapasi.

  Back
 
Keywords:
Deep Learning and AI, Algorithms and Numerical Techniques, Performance Optimization, GTC Silicon Valley 2017 - ID S7218
Download:
 
ZipML: Faster Machine Learning via Low-Precision Communication and Computation
Dan Alistarh (IST Austria & ETH Zurich), Ce Zhang (ETH Zurich)
We'll present new techniques for training machine learning models using low-precision computation and communication. We'll start by briefly outlining new theoretical results proving that, surprisingly, many fundamental machine learning t ...Read More

We'll present new techniques for training machine learning models using low-precision computation and communication. We'll start by briefly outlining new theoretical results proving that, surprisingly, many fundamental machine learning tools, such as dense generalized linear models, can be trained end-to-end (samples, model, and gradients) using low precision (as little as one bit per value), while still guaranteeing convergence. We'll then explore the implications of these techniques with respect to two key practical applications: multi-GPU training of deep neural networks, and compressed sensing for medical and astronomical data.

  Back
 
Keywords:
Deep Learning and AI, Performance Optimization, GTC Silicon Valley 2017 - ID S7580
Download:
 
Diet Networks: Thin Parameters for Fat Genomics
Adriana RomeroSoriano (University of Montreal, Montreal Institute for Learning Algorithms)
Learning tasks such as those involving genomic data often poses a serious challenge: the number of input features can be orders of magnitude larger than the number of training examples, making it difficult to avoid overfitting when training deep lear ...Read More
Learning tasks such as those involving genomic data often poses a serious challenge: the number of input features can be orders of magnitude larger than the number of training examples, making it difficult to avoid overfitting when training deep learning models. Improving the ability of deep learning to handle such datasets could have an important impact in precision medicine, where high-dimensional data regarding a particular patient is used to make predictions of interest. We propose a novel neural network parameterization, that we call Diet Networks, which considerably reduces the number of free parameters in the model. The Diet Networks parametrization is based on the idea that we can first learn or provide an embedding for each input feature and then learn how to map a feature's representation to the parameters linking the value of the feature to each of the hidden units of the classifier network. We experiment on a population stratification task of interest to medical studies and show that the proposed approach can significantly reduce both the number of parameters and the error rate of the classifier. This work was accepted at ICLR 2017.  Back
 
Keywords:
Deep Learning and AI, Computational Biology and Chemistry, GTC Silicon Valley 2017 - ID S7643
Download:
 
Deep Learning in the Healthcare Enterprise
Mark Michalski (MGH & Brigham Women's Hospital Center for Clinical Data Science)
Deep learning tools present a tremendous opportunity to improve healthcare. By increasing efficiency and accuracy of diagnostic testing, and elevating meaning from vast troves of clinical data, deep learning provides a pathway to true precision ...Read More

Deep learning tools present a tremendous opportunity to improve healthcare. By increasing efficiency and accuracy of diagnostic testing, and elevating meaning from vast troves of clinical data, deep learning provides a pathway to true precision care. However, there are challenges in the translation of this technology to the clinic: model performance, infrastructure development, data privacy, hospital policy, and vendor relationships are all critical components to this effort. We'll discuss the early experience of the MGH & BWH Center for Clinical Data Science in supporting the translation of deep learning technologies in medicine, touching upon many of the existing and emerging technical, clinical, and cultural challenges that this work presents.

  Back
 
Keywords:
Deep Learning and AI, AI in Healthcare, Healthcare and Life Sciences, Medical Imaging and Radiology, GTC Silicon Valley 2017 - ID S7722
Download:
 
Mixed Precision Training of Deep NN with Volta
Boris Ginsburg (NVIDIA)
We'll describe training of very deep networks with mixed-precision float (("float16") using Volta Tensor Core. Float16 has two major potential benefits: high training speed and reduced memory footprint. But float16 has smaller nume ...Read More

We'll describe training of very deep networks with mixed-precision float (("float16") using Volta Tensor Core. Float16 has two major potential benefits: high training speed and reduced memory footprint. But float16 has smaller numerical range than regular single precision float, which can result in overflow or underflow ("vanishing gradient") during training. We'll describe simple rescaling mechanism which solves these potential issues. With this rescaling algorithm, we successfully used mixed precision training for such networks as Alexnet, GoogLeNet, Inception_v3, and Resnets without any loss in accuracy.Other contributors to this work are S. Nikolaev, M. Houston, A. Kiswani, A. Gholaminejad, S. Migacz, H. Wu, A. Fit-Florea, and U. Kapasi.

  Back
 
Keywords:
Deep Learning and AI, GTC Israel 2017 - ID SIL7116
Download:
 
Improving Cancer Treatment with Genomics and AI on NVIDIA
Max Kelsen (Nicholas Therkelsen-Terry)
Learn how NVIDIA is enabling a new wave of cancer treatment powered by AI, Immunotherapy and Genomics. A collaboration between a large scale sequencing company, world leading researchers and AI machine learning teams is tackling the problem of whole ...Read More
Learn how NVIDIA is enabling a new wave of cancer treatment powered by AI, Immunotherapy and Genomics. A collaboration between a large scale sequencing company, world leading researchers and AI machine learning teams is tackling the problem of whole genome precision medicine by predicting whether a patient will respond to new cancer drugs. NVIDIA supports these world leading efforts giving the collaborators the ability overcome the limitations. This innovative project is a partnership between two SMEs, one of Australias largest medical research institutes and a global sequencing corporation. Each partner brings complementary strengths: genomiQa specialise in the analysis of genomic data; Max Kelsen use AI to mine big data and explore novel insights; QIMR Berghofer are leaders in immunology and cancer genomic research; and BGI are a major supplier of genomic sequencing. We will use sophisticated artificial intelligence approaches to integrate genomic, transcriptomic and patient clinical information to identify a predictor and develop a test of treatment response. The classifier will be developed using genomic data from a large melanoma project. We will then validate and refine the classifier in a second cohort - 400 lung cancer patients collected using routine practice within the Australian health system.  Back
 
Keywords:
Deep Learning and AI, AI Conference Australia 2018 - ID AUS80030
Download:
Developer - Algorithms
Presentation
Media
Accelerating Explicit FEM Shock & Blast Simulations
Nachiket Gokhale
- Weidlinger Associates Inc
Explicit finite element codes are widely used to simulate the response of structures and mechanical equipment subjected to shock, blast and wave propagation phenomena. ...Read More
Explicit finite element codes are widely used to simulate the response of structures and mechanical equipment subjected to shock, blast and wave propagation phenomena. High resolution models require run times ranging from a few seconds to a few months are common and hence the payoff from GPU acceleration is tremendous. We describe the acceleration of our commercial finite element code NLFLEX using CUDA. We developed GPU kernels in CUDA based on our production code NLFLEX, for linear elasticity, explosives, elasto-plasticity and large deformation elasticity. We attained order of magnitude (10X) acceleration in single precision and approximately (5X) in double precision mode.   Back
 
Keywords:
Developer - Algorithms, Computational Fluid Dynamics, Physics Simulation, GTC Silicon Valley 2010 - ID S102061
Streaming:
Download:
 
CUDA Research Roundtable: Mixed Precision GPU Computing
Steve McMillan
Many algorithms used in computational physics can be greatly accelerated by the use of GPUs. However, the full double-precision floating point operations to which scientists are accustomed can prove costly, especially in compute-intensive applic ...Read More

Many algorithms used in computational physics can be greatly accelerated by the use of GPUs. However, the full double-precision floating point operations to which scientists are accustomed can prove costly, especially in compute-intensive applications where floating-point computations rather than memory bandwidth limit performance. In fact, many scientific problems actually require double precision in only a small subset of the code. In these cases, the development of mixed-precision algorithms can bring substantial improvements without sacrificing overall accuracy. Double precision is used only where it is required; the remaining calculations are carried out in single precision. Although the task of identifying "necessary" double-precision code may be non-trivial, the performance payoff can be considerable. This approach also facilitates software emulation of double precision in key portions of the code, which can be effective in accelerating GPU double-precision performance. In practice, the emulation is neither complete nor IEEE-compliant, and is cumbersome to code without support at the compiler or processor levels, but impressive improvements in speed have been obtained. This roundtable will discuss progress made so far in mixed-precision calculations and emulation techniques for GPUs, and will consider the prospects for future development of these approaches.

  Back
 
Keywords:
Developer - Algorithms, Developer - Algorithms, GTC Silicon Valley 2009 - ID S09021
Download:
 
GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations
Fred Lionetti
Mathematical models describing cellular membranes form the basis of whole tissue models to describe the electrical activity of entire organs, such as the heart. ...Read More
Mathematical models describing cellular membranes form the basis of whole tissue models to describe the electrical activity of entire organs, such as the heart. Numerical simulations based on these models are useful for both basic science and increasingly for clinical diagnostic and therapeutic applications such as targeting ablation therapy for atrial arrhythmias, defibrillator design and cardiac resynchronization therapy. A common bottleneck in such simulations arises from solving large stiff systems of ordinary differential equations (ODEs) thousands of times for numerous integration points (representing cells) throughout a three-dimensional tissue or organ model. For some electrophysiology simulations, over 80% of the time is spent solving these systems of ODEs. While a cluster provides the required interactive response time to solve the ODEs, a desktop sized platform would enhance usability of the software in a laboratory setting. The audience will benefit by learning how a real-world, complex, HPC application can directly benefit by the use of CUDA technology. Participants will learn which optimization techniques yielded the best performance results on an actual application. We will also explore the benefits and limits of the use of single precision in certain scientific applications.  Back
 
Keywords:
Developer - Algorithms, Life & Material Science, Medical Imaging and Radiology, Visualization, GTC Silicon Valley 2009 - ID S09036
Streaming:
Download:
 
Using the GPU for Gradient Reconstruction of Unstructured Meshes
Michael Heck
Impressive performance gain has been obtained on field calculations for large volume of seismic survey data which is hierarchically represented [1]. ...Read More
Impressive performance gain has been obtained on field calculations for large volume of seismic survey data which is hierarchically represented [1]. Gradient reconstruction, for both scalar and vector unstructured fields, is yet another performance critical task in engineering simulations such as computational fluid dynamics (CFD) and finite element analysis (FEA). The latest GPU hardware has been improved significantly in terms of memory capacity and memory random access efficiency, which makes GPU computing attractive to engineering simulation[2]. Based on requirements drawn from cross disciplinary fields including geophysical modelling, material analysis and manufacturing, this study continues to investigate the double precision performance and its scalability across multiple GPUs. A software framework is designed where algorithms can be conveniently implemented in a heterogeneous computing environment with mixed CPU and GPU configurations. Attention has also been steered toward integration of GPU algorithms in end-user application software Avizo, which will enable the application of the algorithm in industrial aerodynamic simulations where mixed element unstructured mesh dominates.  Back
 
Keywords:
Developer - Algorithms, Computational Fluid Dynamics, Physics Simulation, Visualization, GTC Silicon Valley 2009 - ID S09110
Download:
 
Accelerating Quantum Chemistry Research Using GPUs-Two Electron Repulsion Integrals in GAMESS
Veerendra Allada
We present an implementation of Rys quadrature algorithms for Electron Repulsion Integrals (ERI) of up to g functions on Graphical Processing Units (GPUs). ...Read More
We present an implementation of Rys quadrature algorithms for Electron Repulsion Integrals (ERI) of up to g functions on Graphical Processing Units (GPUs). We outline the general GPU programming model, challenges associated with implementing the Rys quadrature on highly parallel emerging architectures, and our approach to implementing the quadrature. The performance of the implementation is evaluated for single and double precision on two GPU devices. The performance obtained is on par with the matrix-vector routine from the CUDA Basic Linear Algebra Subroutines (CUBLAS) library.   Back
 
Keywords:
Developer - Algorithms, GTC Silicon Valley 2009 - ID S09449
Download:
 
A Hybrid Method for Solving Tridiagonal Systems on GPU
Yao Zhang
- University of California, Davis
Tridiagonal linear systems are of importance to many problems in numerical analysis and computational fluid dynamics, as well as to computer graphics applications in video games and computer-animated films. ...Read More
Tridiagonal linear systems are of importance to many problems in numerical analysis and computational fluid dynamics, as well as to computer graphics applications in video games and computer-animated films. This poster presents our study on the performance of multiple tridiagonal algorithms on a GPU. We design a novel hybrid algorithm that combines a work-efficient algorithm with a step-efficient algorithm in a way well-suited for a GPU architecture. Our hybrid solver achieves 8x and 2x speedup respectively in single precision and double precision over a multi-threaded highly-optimized CPU solver and a 2x speedup over a basic GPU solver.  Back
 
Keywords:
Developer - Algorithms, GTC Silicon Valley 2010 - ID P10A07
Download:
 
GPU Algorithms for NURBS Minimum Distance and Clearance Computations
Adarsh Krishnamurthy
- University of California, Berkeley
We present GPU algorithms and strategies for accelerating distance queries and clearance computations on models made of trimmed NURBS surfaces. ...Read More
We present GPU algorithms and strategies for accelerating distance queries and clearance computations on models made of trimmed NURBS surfaces. We provide a generalized framework for using GPUs as co-processors in accelerating CAD operations. The accuracy of our algorithm is based on the model space precision, unlike earlier graphics algorithms that were based only on image space precision. Our algorithms are at least an order of magnitude faster and about two orders of magnitude more accurate than the commercial solid modeling kernel ACIS.  Back
 
Keywords:
Developer - Algorithms, GTC Silicon Valley 2010 - ID P10A15
Download:
 
Floating Point and IEEE 754 Compliance for NVIDIA GPUs: Precision & Performance
Alex Fit-Florea (NVIDIA)
As a result of continuing improvements, NVIDIA offers GPU-accelerated floating-point performance in compliance with IEEE 754. It is our experience that a number of issues related to floating point accuracy and compliance are a frequent source of ...Read More

As a result of continuing improvements, NVIDIA offers GPU-accelerated floating-point performance in compliance with IEEE 754. It is our experience that a number of issues related to floating point accuracy and compliance are a frequent source of confusion both on CPUs and GPUs. The purpose of this talk is to discuss the most common ones related to NVIDIA GPUs and to supplement the documentation in the CUDA C Programming Guide

  Back
 
Keywords:
Developer - Algorithms, GTC Silicon Valley 2012 - ID S2085
Streaming:
Download:
 
GPU-Accelerated 3-D Electromagnetic Particle-in-Cell Implementations in VORPAL
Keegan Amyx (Tech-X Corporation)
We present recent developments in implementing 3D GPU-accelerated eletromagnetic particle-in-cell particle updates in the plasma physics framework VORPAL. The primary challenge in PIC methods on GPUs is thread contention during the current depositio ...Read More
We present recent developments in implementing 3D GPU-accelerated eletromagnetic particle-in-cell particle updates in the plasma physics framework VORPAL. The primary challenge in PIC methods on GPUs is thread contention during the current deposition stage: we resolve these thread contentions by sorting particles into tiles of many cells each time step. Multiple thread blocks may be assigned to each tile, and each block accumulates the contribution from a moderate number of particles via an unsegmented Esirkepov 1st-order scheme. We achieve update times of 50 ns per-particle per-timestep for a variety of realistic self-consistent double-precision EM simulations.   Back
 
Keywords:
Developer - Algorithms, GTC Silicon Valley 2012 - ID P2512
Download:
 
Acceleration of Finite-Element Matrix-Generation on Single and Multiple GPUs
Adam Dziekonski (Gdansk University of Technology, Poland)
We demonstrate an iterative algorithm dedicated for fast generation of large sparse finite-element matrices on multiple GPUs. The proposed approach allowed us to overcome the low memory size limitation of the GPU RAM, and the problem size is limited ...Read More
We demonstrate an iterative algorithm dedicated for fast generation of large sparse finite-element matrices on multiple GPUs. The proposed approach allowed us to overcome the low memory size limitation of the GPU RAM, and the problem size is limited only by the CPU RAM. Our approach was verified on several workstations equipped with GPUs: 1x Tesla K20, 2x Tesla C2075, 2x GeForce GTX 590, CPUs: 2x 8-core Intel Xeon Sandy Bridge E5-2687W. Comparing with single-threaded and 32-threaded Intel Xeon Sandy Bridge implementations, our GPU-based (Tesla K20) implementation reduces the matrix-generation in double precision by factors of 26 and 10, respectively.  Back
 
Keywords:
Developer - Algorithms, GTC Silicon Valley 2013 - ID P3150
Download:
 
Hybrid GPU-CPU Multilevel Preconditioner for Solving Large Systems of FEM Equations
Adam Dziekonski (Gdansk University of Technology, Poland)
We demonstrate a fast implementation of the conjugate gradient iterative method with V-cycle multilevel (two and three levels) preconditioners applied to solving real symmetric and sparse systems obtained with Finite Element Method. Our approach was ...Read More
We demonstrate a fast implementation of the conjugate gradient iterative method with V-cycle multilevel (two and three levels) preconditioners applied to solving real symmetric and sparse systems obtained with Finite Element Method. Our approach was verified on two workstations equipped with GPUs: 1x Tesla K20, 1x GTX 580, CPUs: 1x Intel Xeon 5680 (6 cores), 2x 8-core Intel Xeon Sandy Bridge E5-2687W. Comparing with 6-threaded Intel Xeon 5680 implementation, hybrid GPU-CPU (GTX 580, Intel Xeon 5680) implementation reduced the solution time by factor of 4.4 and allowed us to reach a remarkable convergence of the iterative method in single and double precision.   Back
 
Keywords:
Developer - Algorithms, GTC Silicon Valley 2013 - ID P3165
Download:
 
Implementation of Nearest Neighbor Search on GPGPU Systems
Akiyoshi Wakatani (Konan University)
A nearest neighbor search with product quantization is a prominent method that achieves a high-precision search with less memory. However, in order to accomplish a large size search, this have to be accelerated by using parallel systems such as GPGPU ...Read More
A nearest neighbor search with product quantization is a prominent method that achieves a high-precision search with less memory. However, in order to accomplish a large size search, this have to be accelerated by using parallel systems such as GPGPU systems. The distance calculation is easily parallelized, but the reduction computation cannot be completely parallelized, so this leads to performance degradation. We present the parallel method with an good parameter selection and our search policy, and show the effectiveness of the autotuning.  Back
 
Keywords:
Developer - Algorithms, Signal and Audio Processing, GTC Silicon Valley 2015 - ID P5103
Download:
 
Accelerating the Dense Singular Value Decomposition Using High Performance Polar Decomposition Implementation
Dalal Sukkari (KAUST)
The poster describes the QDWH-SVD framework for computing the singular value decomposition via the polar decomposition and the eigendecomposition. QDWH-SVD does significantly more floating-point operations but exposes more concurrency and is richer i ...Read More
The poster describes the QDWH-SVD framework for computing the singular value decomposition via the polar decomposition and the eigendecomposition. QDWH-SVD does significantly more floating-point operations but exposes more concurrency and is richer in Level 3 BLAS than the standard singular value solvers. Thanks to GPU-based acceleration, the QDWH-SVD performance is still able to out perform existing state-of-the-art implementations. Mixed precisions techniques allow to further improve performance while maintaining numerical accuracy.  Back
 
Keywords:
Developer - Algorithms, Developer - Performance Optimization, GTC Silicon Valley 2015 - ID P5120
Download:
 
Faster Compressed Sparse Row (CSR)-based Sparse Matrix-Vector Multiplication Using CUDA
Jorge González-Domínguez (University of Mainz (Germany))
LightSpMV is a novel CUDA-compatible sparse matrix-vector multiplication (SpMV) algorithm using the standard compressed sparse row (CSR) sparse matrix storage format. Performance evaluation reveals that on a single Tesla K40c GPU, LightSpMV is superi ...Read More
LightSpMV is a novel CUDA-compatible sparse matrix-vector multiplication (SpMV) algorithm using the standard compressed sparse row (CSR) sparse matrix storage format. Performance evaluation reveals that on a single Tesla K40c GPU, LightSpMV is superior to the state-of-the-art CUSP and cuSPARSE libraries, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE, for single and double precision, respectively. The source code of LightSpMV is available at http://lightspmv.sourceforge.net.  Back
 
Keywords:
Developer - Algorithms, Developer - Performance Optimization, GTC Silicon Valley 2015 - ID P5267
Download:
 
Accurate Floating-Point Summation in CUB
Uri Verner (NVIDIA)
We address the problem of accurate parallel floating-point summation. Two issues with current methods for parallel summation of floating-point numbers on GPUs are (1) loss of precision due to error propagation, and (2) the bitwise-exact result is not ...Read More
We address the problem of accurate parallel floating-point summation. Two issues with current methods for parallel summation of floating-point numbers on GPUs are (1) loss of precision due to error propagation, and (2) the bitwise-exact result is not reproducible with a different architecture or configuration. We present a new efficient method for parallel accurate summation of an array of floating point numbers in CUB. The method computes a full-precision sum by recovering and keeping track of the round-off error. The method is implemented using parallel primitives such as sort and scan, and so it takes advantage of future optimizations of these primitives to new architectures. Our method can reduce the number of iterations in some iterative linear solvers, such as lattice QCD.  Back
 
Keywords:
Developer - Algorithms, Developer - Tools & Libraries, GTC Silicon Valley 2015 - ID S5211
Streaming:
Download:
 
Numerical Reproducibility Challenges on Extreme Scale Multi-Threading GPUs
Michela Taufer (University of Delaware)
Learn how to mitigate rounding errors that can hamper result reproducibility when concurrent executions burst and workflow determinism vanishes. This talk unveils the power of mathematical methods to model rounding-errors in scientific applications a ...Read More
Learn how to mitigate rounding errors that can hamper result reproducibility when concurrent executions burst and workflow determinism vanishes. This talk unveils the power of mathematical methods to model rounding-errors in scientific applications and illustrates how these methods can mitigate error drifting on new generation, many-core GPUs. We will discuss performance and accuracy issues for a diverse set of scientific applications that rely on floating point arithmetic. In particular, our experimental study will cover the following exploration space: floating point format and precision (e.g., single, double, and composite precision), numerical range used by the computation, degree of multi-threading, thread scheduling scheme, and algorithmic variant.  Back
 
Keywords:
Developer - Algorithms, Computational Physics, HPC and Supercomputing, GTC Silicon Valley 2015 - ID S5245
Streaming:
Download:
 
Synthesizing Effective Data Compression Algorithms for GPUs
Martin Burtscher (Texas State University)
Learn how to automatically generate high-performance lossless compression algorithms that are suitable for massively parallel execution on a GPU. Our technique requires no user guidance and can even be employed to synthesize a compressor that is opti ...Read More
Learn how to automatically generate high-performance lossless compression algorithms that are suitable for massively parallel execution on a GPU. Our technique requires no user guidance and can even be employed to synthesize a compressor that is optimized for a specific file or data set. While we target single- and double-precision floating-point data, our approach is equally applicable to other domains. This talk explains how the algorithm generator works and demonstrates how it can create completely novel and GPU-friendly algorithms that achieve heretofore unreached compression ratios.  Back
 
Keywords:
Developer - Algorithms, GTC Silicon Valley 2015 - ID S5260
Streaming:
Download:
Developer - Performance Optimization
Presentation
Media
High-Performance GEMV and SYMV with Auto-Tuning for Performance Stabilization on Multiple GPU Generations
Daichi Mukunoki (RIKEN Advanced Institute of Computational Sciences)
We are developing high performance level-2 BLAS routines (GEMV and SYMV) with auto-tuning for NVIDIA GPUs. Our goal is to achieve best performance at any matrix size, on any given configuration (precision, real/complex, trans, uplo), and on any GPU a ...Read More
We are developing high performance level-2 BLAS routines (GEMV and SYMV) with auto-tuning for NVIDIA GPUs. Our goal is to achieve best performance at any matrix size, on any given configuration (precision, real/complex, trans, uplo), and on any GPU architecture/product. Our GEMV and SYMV are implemented with different auto-tuning strategies. Both routines exhibit competitive performance in terms of both throughput and performance stability with respect to the matrix size on multiple GPU generations when compared to latest major implementations.  Back
 
Keywords:
Developer - Performance Optimization, Developer - Tools & Libraries, GTC Silicon Valley 2015 - ID P5320
Download:
 
Accelerating Cholesky-based Dense Matrix Inversion on GPUs
Ali Charara (King Abdullah University of Science and Technology)
We present an accelerated CUBLAS operation (TRMM) and its application in speeding up inversion of symmetric positive definite (SPD) matrices. Our implementation of TRMM brings a 6X speedup on a Kepler K20 GPU card in double precision arithmetic's ca ...Read More
We present an accelerated CUBLAS operation (TRMM) and its application in speeding up inversion of symmetric positive definite (SPD) matrices. Our implementation of TRMM brings a 6X speedup on a Kepler K20 GPU card in double precision arithmetic's calculations. Inverting an SPD matrix (POTRI) gains 5X speedup. Further enhancement of POTRI with tile-based statically scheduled algorithm brings further speedup and facilitates multi-GPU implementation for matrix inversion.  Back
 
Keywords:
Developer - Performance Optimization, Developer - Tools & Libraries, GTC Silicon Valley 2015 - ID P5324
Download:
 
GPU Source-Code Optimizations: Increase Performance, Reduce Energy
Jared Coplin (Texas State University)
Learn how source-code optimizations can work alone and in combination to improve not only the performance but also the energy consumption and power draw of a modern compute GPU. In addition, understand how lowering the GPU frequency, enabling EC ...Read More

Learn how source-code optimizations can work alone and in combination to improve not only the performance but also the energy consumption and power draw of a modern compute GPU. In addition, understand how lowering the GPU frequency, enabling ECC, and switching from single to double precision affects runtime, energy, and power.

  Back
 
Keywords:
Developer - Performance Optimization, HPC and Supercomputing, GTC Silicon Valley 2015 - ID S5110
Streaming:
Download:
 
Voting And Shuffling For Fewer Atomic Operations
Elmar Westphal (Forschungszentrum Jülich GmbH)
Even though atomic operations became much faster with the introduction of the Kepler architecture, they are still a bottleneck in many algorithms and applications. This is especially true for operations that are not natively supported on the device a ...Read More
Even though atomic operations became much faster with the introduction of the Kepler architecture, they are still a bottleneck in many algorithms and applications. This is especially true for operations that are not natively supported on the device and have to be implemented using atomicCAS loops (e.g. double precision additions), because modifying the same data by multiple threads within the same warp will, due to warp divergence, also stall the threads already done. This talk will show how to use warp votes and shuffle operations to pre-combine data within a warp by destination-address, in parallel. This can significantly reduce the total number of atomic operations in a kernel call and eliminates CAS loop iterations caused within the same warp.  Back
 
Keywords:
Developer - Performance Optimization, Developer - Algorithms, GTC Silicon Valley 2015 - ID S5151
Streaming:
Download:
Developer - Programming Languages
Presentation
Media
Accelerating miniFE: A Finite Element Mini-application
Justin Luitjens (NVIDIA)
The Mantevo performance project is a collection of self-contained proxy applications that illustrate the main performance characteristics of important algorithms. miniFE is intended to be and approximation to an unstructured implicit finite elem ...Read More

The Mantevo performance project is a collection of self-contained proxy applications that illustrate the main performance characteristics of important algorithms. miniFE is intended to be and approximation to an unstructured implicit finite element or finite volume application. Our work investigated algorithms for assembling a matrix on the GPU. Parallelization algorithms using both 1 thread and 8 threads per element were investigated. Using these approaches a significant speedup (over 60x for double precision) compared to the serial algorithm.

  Back
 
Keywords:
Developer - Programming Languages, GTC Silicon Valley 2012 - ID S2302
Streaming:
Download:
Developer - Tools & Libraries
Presentation
Media
Tridiagonal Solvers on the GPU and Applications to Fluid Simulation
Nikolai Sakharnykh
This presentation will explore the efficient GPU implementation of direct numerical simulation of turbulent viscous incompressible fluid in 3D domain. We will discuss solving the full system of Navier-Stokes and energy equations using the Altern ...Read More

This presentation will explore the efficient GPU implementation of direct numerical simulation of turbulent viscous incompressible fluid in 3D domain. We will discuss solving the full system of Navier-Stokes and energy equations using the Alternating Direction Implicit (ADI) numerical method, as well as implementation details of a fast tridiagonal matrix solver on CUDA . Finally we will compare the performance of GPU and CPU on a particular modeling problem in which the GPU outperforms the latest multicore CPUs by an order of magnitude in double precision on the whole solver.

  Back
 
Keywords:
Developer - Tools & Libraries, Astronomy and Astrophysics, Computational Fluid Dynamics, Developer - Tools & Libraries, GTC Silicon Valley 2009 - ID S09058
Streaming:
Download:
 
Improving Host-GPU Communication with Buffering Schemes
Guillermo Marcus (University of Heidelberg, ZITI)
This session will present the Buffer Management Library, a collection of C++ templates to simplify and improve the data transfers between a GPU and a host computer. The library provides multiple known algorithms for data transfers, including chu ...Read More

This session will present the Buffer Management Library, a collection of C++ templates to simplify and improve the data transfers between a GPU and a host computer. The library provides multiple known algorithms for data transfers, including chunk, double and pooled buffers. In addition, the library allows for data transformations to be performed concurrently with the transfers, simplifying the conversion of data formats between host and GPU, as is a common for double to single precision conversions, as well as AOS and SOA arrangements. Using the library reduces significantly the amount of pinned memory required for transfers and removes the limitation of having huge buffers locked for use by the GPU. Overlapping data transformations with data transfers can result in more than 20% performance improvement over separate convert and copy operations.

  Back
 
Keywords:
Developer - Tools & Libraries, GTC Silicon Valley 2013 - ID S3160
Streaming:
Download:
 
Floating-point Precision vs Performance Trade-offs
Wei-Fan Chiang (University of Utah)
Learn how to write high-performance but precise GPU programs by understanding potential floating-point imprecision that your programs could have. Floating-point accuracy is often a neglected issue in GPU program development, but plays a critical ...Read More

Learn how to write high-performance but precise GPU programs by understanding potential floating-point imprecision that your programs could have. Floating-point accuracy is often a neglected issue in GPU program development, but plays a critical role in assuring reliability. In this session, we will describe how performance tuning techniques such as changing floating-point type or changing the underlying algorithm affect floating-point precision. Furthermore, see how to estimate floating-point imprecision as well as how our novel affine-arithmetic-based and control-flow-sensitive analysis can help programmers make informed decision regarding performance/precision trade-off in their programs. These concepts will be illustrated with real examples from public GPU benchmark sets.

  Back
 
Keywords:
Developer - Tools & Libraries, Developer - Programming Languages, GTC Silicon Valley 2013 - ID S3309
Streaming:
Download:
 
CUDA Toolkit 8 Performance Overview
Pramod Ramarao (NVIDIA)
Learn how updates to the CUDA toolkit improve the performance of GPU-accelerated applications. Through benchmark results, we will review the impact of new libraries, updates to memory management and mixed precision programming. The session will ...Read More

Learn how updates to the CUDA toolkit improve the performance of GPU-accelerated applications. Through benchmark results, we will review the impact of new libraries, updates to memory management and mixed precision programming. The session will cover performance of CUDA toolkit components including libraries and the compiler.

  Back
 
Keywords:
Developer - Tools & Libraries, GTC Webinars 2016 - ID GTCE120
Streaming:
Download:
Earth Systems Modeling
Presentation
Media
Unleashing the Performance Potential of GPU for Atmospheric Dynamic Solvers
Haohuan Fu (Tsinghua University)
We'll demonstrate our efforts on developing highly efficient solvers for atmospheric dynamics on the GPU platforms. Besides general optimizations for GPU-based scientific computing applications, we apply optimization strategies that are specifical ...Read More
We'll demonstrate our efforts on developing highly efficient solvers for atmospheric dynamics on the GPU platforms. Besides general optimizations for GPU-based scientific computing applications, we apply optimization strategies that are specifically customized for atmospheric dynamic solvers. We'll show that by combining both algorithmic and architectural considerations, our optimization improves the computation efficiency from the original 2.24% to around 16% at the peak, with a sustained double-precision performance of 1.04 Tflops within one CPU-GPU node. We think this work demonstrates a huge potential for performing more efficient climate modeling work on GPU platforms.  Back
 
Keywords:
Earth Systems Modeling, Performance Optimization, HPC and Supercomputing, GTC Silicon Valley 2016 - ID S6354
Streaming:
Download:
Embedded
Presentation
Media
Early Evaluation of the Jetson TK1 Development Board for Power and Performance
Jee Choi (Georgia Tech)
In this session, we will describe our experience in evaluating the Jetson TK1 development for performance, energy and power. We first describe the benchmarks used in our evaluation and present the performance and power results for various throughputs ...Read More
In this session, we will describe our experience in evaluating the Jetson TK1 development for performance, energy and power. We first describe the benchmarks used in our evaluation and present the performance and power results for various throughputs, including single- and double- precision compute, memory bandwidth, and more. We will also present our model for predicting the energy costs of various operations under different frequency and voltage settings, and show how different settings map to different arithmetic intensity regimes in terms of performance and energy efficiencies. Finally, we present preliminary results in using the Jetson TK1 for computing the fast multipole method (FMM) kernel and compare its performance and energy efficiency against that of high-end Tesla GPUs.  Back
 
Keywords:
Embedded, GTC Silicon Valley 2015 - ID S5407
Streaming:
Download:
 
A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC
Kristoffer RobinStokke (University of Oslo, Simula Research Laboratory)
Learn how to build a high-precision power model for the Tegra K1 heterogeneous multicore SoC. Our methodology considers actual (measured) rail voltages, which vary with GPU, CPU and RAM operating frequency, power gating, hardware configurations, leak ...Read More
Learn how to build a high-precision power model for the Tegra K1 heterogeneous multicore SoC. Our methodology considers actual (measured) rail voltages, which vary with GPU, CPU and RAM operating frequency, power gating, hardware configurations, leakage currents, dynamic power as well as traditional hardware performance counters. Therefore, our model is able to predict power usage of individual units such as the Kepler-based GPU, dual-cluster CPU and memory (RAM) with an estimation accuracy above 98% for several CUDA kernels and software workloads. The main take-away point from our talk is to learn how the Tegra K1 consumes energy under various software workloads, which can be used to optimise code and applications.  Back
 
Keywords:
Embedded, Performance Optimization, Intelligent Machines and IoT, GTC Silicon Valley 2016 - ID S6324
Streaming:
Download:
Energy Exploration
Presentation
Media
GPU-Accelerated Parallel Computing for Simulation of Seismic Wave Propagation
Taro Okamoto (Department of Earth and Planetary Sciences, Tokyo Institute of Technology)
We adopted GPU to accelerate large-scale, parallel finite-difference (FDTD) simulation of seismic wave propagation. Effective parallel implementation is needed because the size of the memory of a single GPU is too small for real applications. Th ...Read More

We adopted GPU to accelerate large-scale, parallel finite-difference (FDTD) simulation of seismic wave propagation. Effective parallel implementation is needed because the size of the memory of a single GPU is too small for real applications. Thus we describe the memory optimization, the three-dimensional domain decomposition, and overlapping the communication and computation adopted in our program. We achieved so far a high performance (single-precision) of about 61 TFlops by using 1200 GPUs of TSUBAME-2.0, the GPU supercomputer in Tokyo Institute of Technology, Japan. As an important application, we show the results of the simulation of the 2011 Tohoku-Oki mega-quake.

  Back
 
Keywords:
Energy Exploration, GTC Silicon Valley 2012 - ID S2352
Streaming:
Download:
Federal
Presentation
Media
Optimized Deep Learning Deployment with TensorRT
Robert Keating (Solution Architect, NVIDIA)
NVIDIA TensorRT™ is a high performance neural network inference engine for production deployment of deep learning applications. TensorRT can be used to rapidly optimize, validate and deploy trained neural network for inference to hyperscal ...Read More

NVIDIA TensorRT™ is a high performance neural network inference engine for production deployment of deep learning applications. TensorRT can be used to rapidly optimize, validate and deploy trained neural network for inference to hyperscale data centers, embedded, or automotive product platforms. Developers can use TensorRT to deliver fast inference using INT8 or FP16 optimized precision that significantly reduces latency, as demanded by real-time services such as streaming video categorization on the cloud or object detection and segmentation on embedded and automotive platforms. With TensorRT developers can focus on developing novel AI-powered applications rather than performance tuning for inference deployment. TensorRT runtime ensures optimal inference performance that can meet the needs of even the most demanding throughput requirements.

  Back
 
Keywords:
Federal, GTC Washington D.C. 2016 - ID DCS16193
Streaming:
Finance
Presentation
Media
GPU Implementation of Explicit and Implicit Finite Difference Methods in Finance
Mike Giles (University of Oxford)
This talk will explain how to achieve excellent performance with GPU implementations of standard explicit and implicit finite difference methods in computational finance. Implicit methods are much harder to implement efficiently, but the task is mad ...Read More
This talk will explain how to achieve excellent performance with GPU implementations of standard explicit and implicit finite difference methods in computational finance. Implicit methods are much harder to implement efficiently, but the task is made easier through the development of library software for the solution of multiple tridiagonal systems in parallel. The implementation strategies depend on the size and dimensionality of the problems being solved. 1D problems can be solved within one SMX unit of a GPU, 2D problems usually require more than one SMX, and 3D / 4D problems require the entire GPU for their solution. Computational performance results will be given for Kepler GPUs, and the talk will also discuss whether single precision arithmetic provides sufficient accuracy.  Back
 
Keywords:
Finance, GTC Silicon Valley 2014 - ID S4227
Streaming:
 
Recent Progress in Accelerating Monte Carlo Simulation on GPU for Pricing and Risk Management of Financial Instruments
Serguei Issakov (Numerix)
Learn about recent progress in accelerating Monte Carlo simulation on the GPU in applications for pricing financial instruments and risk management. We'll focus on the forward Monte Carlo simulation, which allows for a natural parallelization across ...Read More
Learn about recent progress in accelerating Monte Carlo simulation on the GPU in applications for pricing financial instruments and risk management. We'll focus on the forward Monte Carlo simulation, which allows for a natural parallelization across CUDA cores, and present a recent extension of our implementation to a broad selection of industry standard valuation models for different asset classes, including hybrid models that can be used to price multi-currency and multi-asset portfolios. Even with increasing complexity and dimensionality of valuation models, our benchmarks show stable GPU speedup factors in the ranges of 20x and 30x for calculations with floating point double precision FP64 and single precision FP32, respectively. We also briefly summarize a most recent research project on a more complex backward (/American / Least Squares) Monte Carlo simulation method, based on regression algorithms used to price general financial instruments with optionality. The latter method heavily relies on matrix calculations and benefits from using GPU- accelerated libraries, cuBLAS for linear algebra and cuSOLVER for solvers.  Back
 
Keywords:
Finance, Finance - Quantitate Risk Management, GTC Silicon Valley 2018 - ID S8587
Streaming:
GPU Virtualization
Presentation
Media
Explore Dell Wyse Datacenter's Graphics Options for Virtual Desktop Computing (Presented by Dell)
Gary Radburn (Dell, Inc.)
In this session we will explore the various options Dell supports in its Dell Wyse DataCenter solution offerings. This session will describe various platform offerings, such as the PowerEdge 720, PowerEdge C8220x and Precision 7610 with the various g ...Read More
In this session we will explore the various options Dell supports in its Dell Wyse DataCenter solution offerings. This session will describe various platform offerings, such as the PowerEdge 720, PowerEdge C8220x and Precision 7610 with the various graphics cards options. In addition, we will discuss the solution offerings around VMware View, Citrix XenDeskop, and Microsoft with Dell vWorkspace. Lastly, we will detail the capabilities of those solutions offerings with various hypervisors, such as VMware vSphere, Citrix XenServer, and Microsoft Windows 2012. This will provide attendees with an overall view of what Dell can offer, giving customers multiple options that they can pick from.  Back
 
Keywords:
GPU Virtualization, Big Data Analytics, GTC Silicon Valley 2014 - ID S4850
Streaming:
HPC and AI
Presentation
Media
Acceleration of SIMULIA's Abaqus Solver on NVIDIA GPUs
Chris Mason
- Acceleware
Learn about Acceleware''s and Dassault Systemes'' integrated solution that performs an LDL^T factorization on GPUs within the Abaqus software package. ...Read More
Learn about Acceleware''s and Dassault Systemes'' integrated solution that performs an LDL^T factorization on GPUs within the Abaqus software package. We will discuss efficient GPU parallelization of the factorization algorithm and enabling the CPU and GPU to overlap their computations and data transfers. Includes an end user simulation case study and GPU performance measurements including 300 GFlops in single precision and 145 GFlops in double precision on NVIDIA Tesla C2050.  Back
 
Keywords:
HPC and AI, GTC Silicon Valley 2010 - ID S102208
Download:
 
Accelerating LS-DYNA with MPI, OpenMP, and CUDA
Bob Lucas
- USC
When solving implicit problems, the computational bottleneck in LS-DYNA is the multifrontal linear solver. ...Read More
When solving implicit problems, the computational bottleneck in LS-DYNA is the multifrontal linear solver. These operations are performed with double precision arithmetic, hence until the arrival of the Tesla 2050, experiments with GPU acceleration were only a curiosity. This is no longer the case, and in this talk we will describe how LS-DYNA''s hybrid (MPI and OpenMP) solver is further accelerated using GPUs to factor large dense frontal matrices.  Back
 
Keywords:
HPC and AI, Developer - Algorithms, GTC Silicon Valley 2010 - ID S102240
Streaming:
Download:
 
An MPI/CUDA Implementation of Discontinuous Galerkin Time Domain Method for Maxwell's Equations
Stylianos Dosopoulos
- Ohio State University
We describe an MPI/CUDA approach to solve Maxwell''s equations in time domain by means of an Interior Penalty Discontinuous Galerkin Time Domain Methods and a local time stepping algorithm. ...Read More
We describe an MPI/CUDA approach to solve Maxwell''s equations in time domain by means of an Interior Penalty Discontinuous Galerkin Time Domain Methods and a local time stepping algorithm. We show that MPI/CUDA provides 10x speed up versus MPI/CPU, in double precision. Moreover, we present scalability results and an 85% parallelization efficiency up to 40 GPUs on the Glenn cluster of Ohio Supercomputing Center. Finally, we study an electromagnetic cloaking example for a broad band signal(8-11GHz), to show the potential of our approach to solve real life examples in short simulation times.  Back
 
Keywords:
HPC and AI, GTC Silicon Valley 2010 - ID P10I20
Download:
 
Deep Patient: Predict the Medical Future of Patients with Deep Learning
Joel Dudley (Associate Professor, Icahn School of Medicine at Mount Sinai, New York)
This talk focuses on advances in deep learning applied to precision medicine and, especially, on "deep patient", a general-purpose patient representation derived from the electronic health records (EHRs) that facilitates clinical predi ...Read More

This talk focuses on advances in deep learning applied to precision medicine and, especially, on "deep patient", a general-purpose patient representation derived from the electronic health records (EHRs) that facilitates clinical predictive modeling. Precision medicine raises big challenges in dealing with large and massive data from heterogeneous sources, such as EHRs, genomics, and wearables. Deep learning provides a unique opportunity to retrieve information from these complex and heterogeneous sources. Here, in particular, we show how a deep architecture was able to process aggregated EHRs from the Mount Sinai Health System data warehouse to derive domain-free patient representations that can improve automatic medical predictions given the patient clinical status.

  Back
 
Keywords:
HPC and AI, HPC and AI, GTC Washington D.C. 2016 - ID DCS16115
Streaming:
 
Deep Neural Networks in Medical Imaging and Radiology: Preventative and Precision Medicine Perspectives
Le Lu (Staff Scientist, National Institutes of Health)
Employing deep learning (DL), especially deep neural networks for high performance radiological image computing is the main focus. We'll present the motivation, method details, and quantitative results for three core problems: 1) Improving c ...Read More

Employing deep learning (DL), especially deep neural networks for high performance radiological image computing is the main focus. We'll present the motivation, method details, and quantitative results for three core problems: 1) Improving computer-aided detection (CAD) using convolutional neural networks and decompositional image representations; 2) Integrated bottom-up deep convolutional networks for robust automated organ segmentation; 3) Text/Image deep mining on a large-scale radiology image database for automated interpretation. We validate some very promising observations of using DL to both significantly improve upon CAD tasks in (1) and enable exciting new research directions as (2,3). We would discuss their positive impacts in both preventative and precision medicine aspects.

  Back
 
Keywords:
HPC and AI, HPC and AI, GTC Washington D.C. 2016 - ID DCS16103
Streaming:
HPC and Supercomputing
Presentation
Media
Automatic Generation of FFT Libraries for GPUs
Christos Angelopoulos (Carnegie Mellon University)
In this poster we present an extension of the Spiral code generation system to GPUs. We address the key problems of GPU memory hierarchy and parallelism, and we introduce a variety of FFT algorithms which avoid shared memory bank conflicts without wa ...Read More
In this poster we present an extension of the Spiral code generation system to GPUs. We address the key problems of GPU memory hierarchy and parallelism, and we introduce a variety of FFT algorithms which avoid shared memory bank conflicts without wasting space using padding and optimized global memory bandwidth transfer with minimum register allocation even in low occupancy. We demonstrate high performance results against cuFFT 1-D and 2-D DFTs for single precision. This research is still in progress, but at the moment we are able to match and beat cuFFT library on sizes we have generated optimized code.  Back
 
Keywords:
HPC and Supercomputing, GTC Silicon Valley 2012 - ID P2399
Download:
 
Beyond Tsubame 2.0
Satoshi Matsuoka (Tokyo Institute of Technology)
Tsubame2.0 has been in successful production for the last 2 years, producing numerous research results and accolades. With possible upgrade of the GPUs to Kepler 2s, it will have the capability to surpass the 10 petaflops-class supercomputers in ...Read More

Tsubame2.0 has been in successful production for the last 2 years, producing numerous research results and accolades. With possible upgrade of the GPUs to Kepler 2s, it will have the capability to surpass the 10 petaflops-class supercomputers in single-precision applications, without any increase in the power consumption of 1MW average.

  Back
 
Keywords:
HPC and Supercomputing, Supercomputing 2012 - ID SC2031
Download:
 
Being Very Green with Tsubame 2.5 Towards 3.0 and Beyond to Exascale
Satoshi Matsuoka (Tokyo Institute of Technology)
TSUBAME 2.5 succeeded TSUBAME 2.0 by upgrading all 4224 Tesla M2050 GPUs to Kepler K20x GPUs, achieving 5.76 / 17.1 Petaflops peak in double / single point precision respectively, latter the fastest in Japan. By overcoming several technical chal ...Read More

TSUBAME 2.5 succeeded TSUBAME 2.0 by upgrading all 4224 Tesla M2050 GPUs to Kepler K20x GPUs, achieving 5.76 / 17.1 Petaflops peak in double / single point precision respectively, latter the fastest in Japan. By overcoming several technical challenges, TSUBAME 2.5 exhibits x2-3 speedup and multi-petaflops performance for many applications, leading to TSUBAME 3.0 in 2015-16.

  Back
 
Keywords:
HPC and Supercomputing, Supercomputing 2013 - ID SC3105
Streaming:
Download:
 
PARALUTION - A Library for Iterative Sparse Methods on Multi-Core CPU, GPU and MIC
Dimitar Lukarski (Uppsala University)
PARALUTION is a library which enables you to perform various sparse iterative solvers and preconditioners on multi/many-core CPU and GPU devices. Based on C++, it provides generic and flexible design which allows seamless integration with other scien ...Read More
PARALUTION is a library which enables you to perform various sparse iterative solvers and preconditioners on multi/many-core CPU and GPU devices. Based on C++, it provides generic and flexible design which allows seamless integration with other scientific software packages. PARALUTION contains Krylov subspace and Multigrid solvers, Fixed-point iteration schemes, Mixed-precision schemes and fine-grained parallel preconditioners based on splitting, ILU factorization with levels, multi-elimination ILU factorization and approximate inverse.   Back
 
Keywords:
HPC and Supercomputing, GTC Silicon Valley 2014 - ID P4155
Download:
 
The LAZARUS GPU Cluster
John Calhoun (Texas Tech University)
Beginning in the summer of 2013 the mathematics department of Texas Tech University along with several NSF Grant programs went through the process of building a GPU Cluster for the simulation of diseases. Code named LAZARUS, this cluster would provid ...Read More
Beginning in the summer of 2013 the mathematics department of Texas Tech University along with several NSF Grant programs went through the process of building a GPU Cluster for the simulation of diseases. Code named LAZARUS, this cluster would provide needed computational resources and experiences for students and faculty and for use as an outreach tool to prospective students. With a budget of just under $30,000 a water-cooled GPU cluster with 24 computing nodes was built with over 100 teraflops of single precision computing power.   Back
 
Keywords:
HPC and Supercomputing, GTC Silicon Valley 2014 - ID P4242
Download:
 
FMM Goes GPU: Smooth Trip or Bumpy Ride?
Bartosz Kohnke (Max Planck Institute for Biophysical Chemistry)
The N-body problem provides a very simple, yet scientific algorithm to utilize modern GPUs. However, the computational complexity is O(N^2). An algorithm reducing runtime and complexity to optimal O(N) for any required precision is the Fast Multipole ...Read More
The N-body problem provides a very simple, yet scientific algorithm to utilize modern GPUs. However, the computational complexity is O(N^2). An algorithm reducing runtime and complexity to optimal O(N) for any required precision is the Fast Multipole Method (FMM). In this talk, we present our CUDA-enabled, templated C++ implementation. The algorithm requires several operators, partly depending up on each other, to exchange information in a tree-like data structure. We especially focus on the utilization of unified memory to minimize the porting efforts and the employment of dynamic parallelism to achieve a better computational workload. We will present timings/scalings for all FMM operators and will discuss remaining bottlenecks, like tree dependencies or redundancies in the kernel setup.  Back
 
Keywords:
HPC and Supercomputing, Developer - Algorithms, Computational Physics, GTC Silicon Valley 2015 - ID S5548
Streaming:
Download:
 
Early Experience with P100 on Power8
Christian Trott (Sandia National Laboratory)
We will present early results from IBM Power8 systems equipped with NVLink connected NVIDIA P100 GPUs. We will show comparative results with previous NVIDIA GPU g ...Read More

We will present early results from IBM Power8 systems equipped with NVLink connected NVIDIA P100 GPUs. We will show comparative results with previous NVIDIA GPU generations for a set of synthetic and application benchmarks, highlighting in particular the advances in the memory subsystem of P100. The talk will in particular demonstrate the impact of the new double precision atomic add capabilities, and will discuss some early exploration of the behavior of NVLink between the Power8 CPUs and the P100 GPUs. 

  Back
 
Keywords:
HPC and Supercomputing, Supercomputing 2016 - ID SC6120
Streaming:
 
CANDLE the CANcer Distributed Learning Environment -- Scalable Deep Learning for Precision Medicine
Brian Van Essen (Lawrence Livermore National Lab)
In this talk we will describe our recently funded effort to create the CANcer Distributed Learning Environment (CANDLE toolkit) to address one of the challenges i ...Read More

In this talk we will describe our recently funded effort to create the CANcer Distributed Learning Environment (CANDLE toolkit) to address one of the challenges identified in the presidential “Precision Medicine Initiative" (PMI). The DOE laboratories in this project are drawing on their strengths in HPC, machine learning and data analytics, and coupling those to the domain strengths of the NCI, particularly in cancer biology and cancer healthcare delivery to bring the full promise of exascale computing to the problem of cancer and precision medicine. This project will focus on three driver cancer problems: RAS protein pathway, drug response, and treatment strategies. We will provide a highlight of these problems, as well as a roadmap for the projects intended research and development efforts.

  Back
 
Keywords:
HPC and Supercomputing, Deep Learning and AI, Supercomputing 2016 - ID SC6124
Streaming:
 
Understanding Neutrino Interactions Using Deep Learning
Adam Aurisano (University of Cincinnati)
The 2015 Nobel Prize in Physics was awarded for the discovery of neutrino oscillations, which indicates that neutrinos have mass. This phenomenon was unexpected and is one of the clearest signs of new physics beyond the Standard Model. The NOvA exper ...Read More
The 2015 Nobel Prize in Physics was awarded for the discovery of neutrino oscillations, which indicates that neutrinos have mass. This phenomenon was unexpected and is one of the clearest signs of new physics beyond the Standard Model. The NOvA experiment aims to deepen our understanding of neutrino oscillations by measuring the properties of a muon neutrino beam produced at Fermi National Accelerator Laboratory at a Near Detector close to the beam source, and measuring the rate that muon neutrinos oscillate into electron neutrinos over an 810 km trip to a 14,000 ton Far Detector in Ash River, MN. Understanding this process may explain why the universe is made of matter instead of antimatter. Performing this measurement requires a high-precision method for classifying neutrino interactions. To this end, we developed a convolutional neural network that gave a 30 percent improvement in electron neutrino selection over previous methods equivalent increasing the Far Detector mass by 4,000 tons.  Back
 
Keywords:
HPC and Supercomputing, GTC Washington D.C. 2017 - ID DC7230
Download:
 
Accelerated Deep Learning Advances in HPC
William Tang (Princeton University)
Recent advances in the deployment of deep learning recurrent nets have been demonstrated in scaling studies of Princeton's new Deep Learning Code -- "FRNN (Fusion Recurrent Neural Net) Code on modern GPU systems. This is a "big-data" ...Read More
Recent advances in the deployment of deep learning recurrent nets have been demonstrated in scaling studies of Princeton's new Deep Learning Code -- "FRNN (Fusion Recurrent Neural Net) Code on modern GPU systems. This is a "big-data" project in that it has access to the huge EUROFUSION/JET disruption data base of over a half-petabyte to drive these studies. FRNN implements a distributed data parallel synchronous stochastic gradient approach with TensorFlow and Theano libraries at the backend and MPI for communication. This deep learning software has recently demonstrated excellent scaling up to 6,000 GPUs on Titan at Oak Ridge National Lab. The associated accomplishments exhibit clear progress toward the goal of establishing the practical feasibility of using leadership-class supercomputers to greatly enhance training of neural nets for transformational impact on key discovery science application domains such as fusion energy science. Powerful systems expected to be engaged for near-future deployment of this deep learning software include: (1) NVIDIA's SATURN V featuring its nearly 1,000 Pascal P100 GPUs; (2) Switzerland's Piz Daint Cray XC50 system with 4,500 P100 GPUs; (3) Japan's Tsubame 3 system with 3,000 P100 GPUs; (4) and OLCF's Summit-Dev system. Summarily, deep learning software trained on large scientific datasets hold exciting promise for delivering much-needed predictive tools capable of accelerating knowledge discovery. The associated creative methods being developed including a new half-precision capability -- also has significant potential for cross-cutting benefit to a number of important application areas in science and industry.  Back
 
Keywords:
HPC and Supercomputing, GTC Washington D.C. 2017 - ID DC7243
Download:
 
The DOE and NCI Partnership on Precision Oncology and the Cancer Moonshot
Fangfang Xia (Argonne National Lab)
The Cancer Moonshot was established in 2016 with the goal to double the rate of progress in cancer research -- to do in five years what normally would take 10. A major area for the acceleration of progress is the strategy to use modeling, simulation, ...Read More
The Cancer Moonshot was established in 2016 with the goal to double the rate of progress in cancer research -- to do in five years what normally would take 10. A major area for the acceleration of progress is the strategy to use modeling, simulation, and machine learning to advance our understanding of cancer biology and to integrate what is known into predictive models that can inform research and guide therapeutic developments. In 2015, the U.S. Department of Energy formed a collaboration with the National Cancer Institute for the joint development of advanced computing solutions for cancer.  Back
 
Keywords:
HPC and Supercomputing, SIGGRAPH 2017 - ID SC1711
Healthcare and Life Sciences
Presentation
Media
Improving Medicine, Saving Lives: Developing Visual Computing Technologies for Health Care
Amitabh Varshney (Professor and Director, University of Maryland)
Come and learn about how we're using the GPUs to enable advances in a wide-variety of healthcare technologies ranging from stem cell classification for regenerative medicine, to modeling of the optical forces applied by lasers onto micro par ...Read More

Come and learn about how we're using the GPUs to enable advances in a wide-variety of healthcare technologies ranging from stem cell classification for regenerative medicine, to modeling of the optical forces applied by lasers onto micro particles for assembly of nano component devices, to using GPUs to understand traumatic brain injury via study of patterns in brain imaging data, using diffusion kurtosis imaging (DKI) data. I will conclude with some of our ongoing research in the use of GPUs to create high-precision next-generation virtual and augmented reality environments for surgery, medical training, and telemedicine.

  Back
 
Keywords:
Healthcare and Life Sciences, HPC and AI, GTC Washington D.C. 2016 - ID DCS16109
Streaming:
 
Deep Patient: Predict the Medical Future of Patients with Deep Learning
Joel Dudley (Icahn School of Medicine at Mount Sinai, New York), Riccardo Miotto (Icahn School of Medicine at Mount Sinai, New York)
Precision medicine initiatives bring tremendous opportunities to speed up scientific discovery and promote quality improvement in medicine. However, it also raises big challenges in dealing with massive data from heterogeneous sources, such as electr ...Read More
Precision medicine initiatives bring tremendous opportunities to speed up scientific discovery and promote quality improvement in medicine. However, it also raises big challenges in dealing with massive data from heterogeneous sources, such as electronic health records (EHRs), -omics, and wearables. Traditional data mining and statistical learning methods tend to favor clean and structured data, which may not be able to effectively utilize the rich information embedded in biomedical data. The latest breakthrough in deep learning technologies provides a unique opportunity to retrieve information from complex and heterogeneous sources. We'll review advances in deep learning applied to precision medicine and next-generation healthcare, with a special focus on Deep Patient, a general-purpose patient representation from EHRs that facilitates clinical predictive modeling and medical analysis.  Back
 
Keywords:
Healthcare and Life Sciences, AI in Healthcare, Deep Learning and AI, GTC Silicon Valley 2017 - ID S7563
Download:
 
The DOE and NCI Partnership on Precision Oncology and the Cancer Moonshot
Rick Stevens (Argonne National Laboratory and University of Chicago)
The Cancer Moonshot was established in 2016 with the goal to double the rate of progress in cancer research -- to do in five years what normally would take 10. A major area for the acceleration of progress is the strategy to use modeling, simula ...Read More

The Cancer Moonshot was established in 2016 with the goal to double the rate of progress in cancer research -- to do in five years what normally would take 10. A major area for the acceleration of progress is the strategy to use modeling, simulation, and machine learning to advance our understanding of cancer biology and to integrate what is known into predictive models that can inform research and guide therapeutic developments. In 2015, the U.S. Department of Energy formed a collaboration with the National Cancer Institute for the joint development of advanced computing solutions for cancer.

  Back
 
Keywords:
Healthcare and Life Sciences, Deep Learning and AI, HPC and Supercomputing, GTC Silicon Valley 2017 - ID S7782
Download:
Intelligent Machines and IoT
Presentation
Media
Autonomous Capabilities of the Joint Tactical Aerial Resupply Vehicle
Shawn Recker (Survice)
We'll discuss the latest updates and features to the autonomous precision landing and GPS-denied navigation capabilities of the Joint Tactical Aerial Resupply Vehicle (JTARV) platform. These capabilities are enabled by our high-performance computer ...Read More
We'll discuss the latest updates and features to the autonomous precision landing and GPS-denied navigation capabilities of the Joint Tactical Aerial Resupply Vehicle (JTARV) platform. These capabilities are enabled by our high-performance computer vision libraries, Sentinel and HawkEye, both of which capitalize on NVIDIA's mobile GPUs and optimized deep learning frameworks. Autonomous navigation for aerial vehicles demand that core algorithms provide not only relevant, actionable information, but that they do so in a timely manner -- that is, the algorithms must operate in real time. We'll discuss how Sentinel object detection networks limit processing requirements for the autonomous precision landing capability. The requirement for high performance dictates optimization at every level, which is the focus of our ongoing research and development efforts.  Back
 
Keywords:
Intelligent Machines and IoT, Computer Vision and Machine Vision, GTC Washington D.C. 2017 - ID DC7143
Download:
Intelligent Video Analytics and Smart Cities
Presentation
Media
Robust Moving Object Detection on Tegra K1
Cevahir Cigla (Aselsan Inc.), Burak Ozkalayci (Aselsan Inc.)
We present a novel background modeling-based moving object detection and segmentation approach with real-time implementation on the recent NVIDIA Tegra K1 mobile GPU platform. The proposed solution introduces pixel-wise adaptive background learning r ...Read More
We present a novel background modeling-based moving object detection and segmentation approach with real-time implementation on the recent NVIDIA Tegra K1 mobile GPU platform. The proposed solution introduces pixel-wise adaptive background learning rates as well as reinforced re-learning of the models. In that manner, especially dynamic backgrounds are modeled robustly where fake alarms due to irrelevant motion and the learning rate of these regions are increased. Detection is followed by shadow removal and dual background modeling approaches to detect abandoned objects with high precision. Each algorithmic step is implemented on GPU and real-time performance is achieved (detection, shadow removal, and abandoned object detection) on Jetson TK1 for 720x576 videos.  Back
 
Keywords:
Intelligent Video Analytics and Smart Cities, Intelligent Machines and IoT, Video and Image Processing, GTC Silicon Valley 2016 - ID P6147
Download:
Leadership in AI
Presentation
Media
Precision Healthcare
Andrea DeSouza (NVIDIA), Vern De Biasi (GSK), Thomas Fuchs (Memorial Sloan Kettering Cancer Center), Nathan Hubbard (Roche Partnering), Joel Saltz (Stony Brook Medicine and College of Engineering and Applied Sciences Stony Brook)
As the industry and government collects massive amounts of data to help provide faster and more accurate clinical care, the gap in managing and exchanging these data types remains a challenge for the industry. Key highlights of this panel discussion ...Read More
As the industry and government collects massive amounts of data to help provide faster and more accurate clinical care, the gap in managing and exchanging these data types remains a challenge for the industry. Key highlights of this panel discussion will include how AI can advance treatment and prevention, what scientific and regulatory hurdles remain for industry success, and how Congress can best address possible privacy and security issues.  Back
 
Keywords:
Leadership in AI, AI in Healthcare, Medical Imaging and Radiology, GTC Washington D.C. 2017 - ID DC7159
Download:
Life & Material Science
Presentation
Media
Single Precision Hybrid Model for Molecular Dynamics Simulations
Ross Walker (University of California San Diego), Scott LeGrand (Amazon)
In this talk we will highlight the work we have done to develop what we term the SPXP precision model. This is the first fully single and fixed precision hybrid model to provide conservation of energy in MD simulations equivalent to full double ...Read More

In this talk we will highlight the work we have done to develop what we term the SPXP precision model. This is the first fully single and fixed precision hybrid model to provide conservation of energy in MD simulations equivalent to full double precision runs but without the need for double precision arithmetic. By exploiting the nature of fixed precision arithmetic and custom machine code accumulator functions we can effectively emulate double precision performance in the latest generation GPUs.

  Back
 
Keywords:
Life & Material Science, Developer - Performance Optimization, GTC Silicon Valley 2015 - ID S5226
Streaming:
Download:
Machine Learning & Deep Learning
Presentation
Media
Optimized GPU Kernels for Deep Learning
Amir Khosrowshahi (Nervana Systems)
Deep learning has recently achieved great success in domains such as images, speech, and text. These gains have been made possible by efficient GPU implementations such as cuDNN. We show optimizations at the assembly level that result in significant ...Read More
Deep learning has recently achieved great success in domains such as images, speech, and text. These gains have been made possible by efficient GPU implementations such as cuDNN. We show optimizations at the assembly level that result in significant performance improvements over existing methods. In particular, we show how operations such as convolutions and dense matrix multiply can be efficiently implemented using a custom assembler to attain state-of-the-art performance on the NVIDIA Maxwell GPU architecture. Additionally, we can significantly reduce memory bandwidth and run much larger models by using limited precision with a minimal tradeoff in model accuracy.  Back
 
Keywords:
Machine Learning & Deep Learning, Developer - Performance Optimization, Computer Vision and Machine Vision, GTC Silicon Valley 2015 - ID S5873
Streaming:
Download:
Medical Imaging and Radiology
Presentation
Media
GPU Acceleration of Non-Iterative and Iterative Algorithms in Fluorescence Lifetime Imaging Microscopy
Gang Wu (University of Sussex)
Fluorescence lifetime imaging microscopy (FLIM) plays a significant role in biological sciences, chemistry, and medical research. We propose a GPU-based FLIM analysis tool suitable for high-speed and flexible FLIM applications. With a large number of ...Read More
Fluorescence lifetime imaging microscopy (FLIM) plays a significant role in biological sciences, chemistry, and medical research. We propose a GPU-based FLIM analysis tool suitable for high-speed and flexible FLIM applications. With a large number of parallel processors, GPUs can significantly speed up lifetime calculations compared to CPU-OpenMP (parallel computing with multiple CPU cores) based analysis. The implemented algorithms have been tested on both synthesized and experimental FLIM data. The results show that at the same precision the GPU analysis can be up to 24x faster than its CPU-OpenMP counterpart.  Back
 
Keywords:
Medical Imaging and Radiology, Algorithms and Numerical Techniques, GTC Silicon Valley 2016 - ID P6114
Download:
 
From Bits to Bedside: Translating Large-Scale Routine Clinical Datasets into Precision Mammography
Dexter Hadley (UCSF)
We'll demonstrate how to use deep learning (DL) approaches to translate big data from routine clinical care into medical innovation that directly improves routine clinical care. Typically, large healthcare institutions have sufficient quantities of ...Read More
We'll demonstrate how to use deep learning (DL) approaches to translate big data from routine clinical care into medical innovation that directly improves routine clinical care. Typically, large healthcare institutions have sufficient quantities of clinical data to facilitate precision medicine through a DL paradigm. However, this clinical data is hardly translated into direct clinical innovation because computer algorithms cannot readily ingest or reason over it. Using routine mammographic screening data for breast cancer as an example, we first downloaded over 30,000 free text pathology reports and used long short term memory DL algorithms to infer cancer outcomes for individual patients. We then labeled over 700,000 mammographic views of breast imaging with our inferred pathology outcomes. Finally, we trained convolutional neural network DL algorithms to directly predict pathology outcomes from breast imaging. With our approach, we demonstrate how to leverage DL to realize precision oncology and significantly improve the interpretation of routine screening mammography for millions of women using routine clinical big data.  Back
 
Keywords:
Medical Imaging and Radiology, GTC Silicon Valley 2018 - ID S8471
Streaming:
Molecular Dynamics
Presentation
Media
GPU-Accelerated Molecular Dynamics Simulation of Solid Covalent Crystals
Wei Ge (Institute of Process Engineering, Chinese Academy of Sciences)
An efficient and highly scalable algorithm for molecular dynamics (MD) simulation (using sophisticated many-body potentials) of solid covalent crystals is presented. Its effective memory throughput on a single C2050 GPU board reached 102 GB/s (8 ...Read More

An efficient and highly scalable algorithm for molecular dynamics (MD) simulation (using sophisticated many-body potentials) of solid covalent crystals is presented. Its effective memory throughput on a single C2050 GPU board reached 102 GB/s (81% of the peak), the instruction throughput reached 412 Ginstr/s (80% of the peak), and 27% of the peak flops of a single GPU was obtained. Parallel efficiency of the algorithm can be as high as 95% on all 7168 GPUs of Tianhe-1A, reaching possibly a record in high performance of MD simulations, 1.87Pflops in single precision.

  Back
 
Keywords:
Molecular Dynamics, GTC Silicon Valley 2012 - ID S2057
Streaming:
Download:
 
Single vs. Double Precision MD Simulations: Correlation is Length-Scale Dependent
Anqi Zou (Wake Forest University)
This poster evaluates how single vs. double precision operations affect Molecular Dynamics simulations using a GPU-optimized MD simulation software by performing coarse-grained MD simulations of many biologically relevant systems of various size. Thr ...Read More
This poster evaluates how single vs. double precision operations affect Molecular Dynamics simulations using a GPU-optimized MD simulation software by performing coarse-grained MD simulations of many biologically relevant systems of various size. Three different measures of structural similarity are used to analyze structure of trajectories and to determine when single precision calculations would be appropriate and when would not. The conclusion is that the increased performance of single-precision implementations of MD simulations makes no significant difference in the accuracy and precision of MD simulations if the system size is sufficiently large.  Back
 
Keywords:
Molecular Dynamics, GTC Silicon Valley 2012 - ID P2443
Download:
Numerical Algorithms & Libraries
Presentation
Media
Fast Evaluation of the Inverse Poisson Cumulative Distribution Function
Mike Giles (University of Oxford)
The inverse of the Poisson cumulative distribution function maps uniformly-distributed random numbers to Poisson random variates. This talk describes a fast implementation for GPUs which is based on some novel approximations of the inverse of the cl ...Read More
The inverse of the Poisson cumulative distribution function maps uniformly-distributed random numbers to Poisson random variates. This talk describes a fast implementation for GPUs which is based on some novel approximations of the inverse of the closely-related incomplete gamma function for the case of large Poisson rates. Both single-precision and double-precision versions have been developed, and in each case the computational cost is not much more than the cost of the corresponding function for inverting the Normal cumulative distribution function. The software is freely available as open source from http://people.maths.ox.ac.uk/gilesm/poissinv/  Back
 
Keywords:
Numerical Algorithms & Libraries, GTC Silicon Valley 2014 - ID S4173
Streaming:
Download:
 
Multifrontal Sparse QR Factorization on the GPU
Tim Davis (University of Florida)
Sparse matrix factorization involves a mix of regular and irregular computation, which is a particular challenge when trying to obtain high-performance on the highly parallel general-purpose computing cores available on graphics processing units (GPU ...Read More
Sparse matrix factorization involves a mix of regular and irregular computation, which is a particular challenge when trying to obtain high-performance on the highly parallel general-purpose computing cores available on graphics processing units (GPUs). We present a sparse multi-frontal QR factorization method that meets this challenge, and is up to eleven times faster than a highly optimized method on a multicore CPU. Our method is unique compared with prior methods, since it factorizes many frontal matrices in parallel, and keeps all the data transmitted between frontal matrices on the GPU. A novel bucket scheduler algorithm extends the communication-avoiding QR factorization for dense matrices, by exploiting more parallelism and by exploiting the staircase form present in the frontal matrices of a sparse multifrontal method. Peak performance is over 80 Gflops on an Fermi Tesla C2070, in double precision. This is joint work with Nuri Yeralan and Sanjay Ranka.  Back
 
Keywords:
Numerical Algorithms & Libraries, GTC Silicon Valley 2014 - ID S4204
Streaming:
 
Parallelizing a Real-Time 3D Finite Element Algorithm using CUDA: Limitations, Challenges and Opportunities
Vukasin Strbac (KULeuven University, Leuven)
Learn about the challenges of parallelizing a Finite Element problem using the Total Lagrangian Explicit Dynamic formulation. We examine the algorithm and perform a detailed analysis of the performance limiting factors of parallelization using CUDA. ...Read More
Learn about the challenges of parallelizing a Finite Element problem using the Total Lagrangian Explicit Dynamic formulation. We examine the algorithm and perform a detailed analysis of the performance limiting factors of parallelization using CUDA. Potential optimization benefits are elucidated in terms of register usage thresholds and other factors for better performance. Results of a larger usability study are presented on a simple problem examining single/double precision tradeoff on a wide range of GPUs and problem sizes. Discover the impact that real-time FE can bring to the intraoperative surgical setting with in-the-loop computation facilitating surgical robotics.  Back
 
Keywords:
Numerical Algorithms & Libraries, Computational Physics, Computational Structural Mechanics, GTC Silicon Valley 2014 - ID S4497
Streaming:
Download:
 
Linear Algebra Operations Using Quadruple-Precision Arithmetic on GPU
Daichi Mukunoki (Japan Society for the Promotion of Science)
This poster presents the performance of linear algebraic operations, such as BLAS, SpMV and Krylov subspace methods, using quadruple-precision floating-point arithmetic on a Tesla K20c GPU.
This poster presents the performance of linear algebraic operations, such as BLAS, SpMV and Krylov subspace methods, using quadruple-precision floating-point arithmetic on a Tesla K20c GPU.  Back
 
Keywords:
Numerical Algorithms & Libraries, GTC Silicon Valley 2014 - ID P4177
Download:
 
Adaptivity and Compression: A Recipe for Sparse Matrix-Vector Multiplication on GPUs
Marco Maggioni (University of Illinois at Chicago)
In this work, we propose a one-size-fits-all format that combines adaptivity and compression into an ELL-based data structure to improve the state-of-the-art of the Sparse Matrix-Vector multiplication (SpMV) on GPUs, a fundamental computational kerne ...Read More
In this work, we propose a one-size-fits-all format that combines adaptivity and compression into an ELL-based data structure to improve the state-of-the-art of the Sparse Matrix-Vector multiplication (SpMV) on GPUs, a fundamental computational kernel used in science and engineering. Those two techniques directly target the two fundamental challenges of the SpMV kernel, matrix irregularity and memory boundedness. Embedding a differential index compression scheme into our Adaptive ELL format, we create a novel sparse format called CoAdELL. Finally, we tested our work against the state-of-the-art framework clSpMV, showing a 31% performance improvement for double-precision calculation on a GTX580.  Back
 
Keywords:
Numerical Algorithms & Libraries, GTC Silicon Valley 2014 - ID P4266
Download:
Science and Research
Presentation
Media
Deployment of Semantic Segmentation Network Using TensorRT
Joohoon Lee (NVIDIA), Chethan NINGARAJU (NVIDIA)
NVIDIA TensorRT is a high-performance neural network inference engine for production deployment of deep learning applications. This lab provides hands-on experience using TensorRT to convert the neural network model to INT8 precision, calibrate, vali ...Read More
NVIDIA TensorRT is a high-performance neural network inference engine for production deployment of deep learning applications. This lab provides hands-on experience using TensorRT to convert the neural network model to INT8 precision, calibrate, vali date and deploy for inference in a self-driving car application.  Back
 
Keywords:
Science and Research, GTC Europe 2017 - ID 53021
Download:
Self-Driving Cars
Presentation
Media
Creating Unique Customers Relationships with Deep Learning in the Cloud and in the Car
Nick Black (CloudMade)
The car presents a particular challenge for creators of learning systems -- it is incredibly rich in data and context, its hardware and software environments are heterogeneous and fragmented, and drivers expect incredible precision from its inte ...Read More

The car presents a particular challenge for creators of learning systems -- it is incredibly rich in data and context, its hardware and software environments are heterogeneous and fragmented, and drivers expect incredible precision from its interactions. CloudMade has pioneered an approach to machine learning in the automotive context that leverages the richness of car data, the emerging computational power of the car, and the existing computational power of the cloud to deliver an automotive-grade machine learning toolset. With CloudMade's solutions, automotive OEMs can deliver personalized experiences to customers that together create a self-learning car that anticipates the needs and desires of the user.

  Back
 
Keywords:
Self-Driving Cars, Deep Learning and AI, Embedded, Automotive, GTC Silicon Valley 2016 - ID S6565
Streaming:
Download:
Signal and Audio Processing
Presentation
Media
A Flexible IIR filtering Implementation for Audio Processing
Juergen Schmidt (Technicolor Research & Innovation)
Infinite impulse response (IIR) filters are used in almost any signal processing area. In the field of audio applications they are used for loudspeaker equalization, crossover filtering or for sound control in mixing consoles. Modern audio applicatio ...Read More
Infinite impulse response (IIR) filters are used in almost any signal processing area. In the field of audio applications they are used for loudspeaker equalization, crossover filtering or for sound control in mixing consoles. Modern audio applications like 3D sound require many audio channels to be processed in parallel in high precision. This is often implemented using high order IIR filter chains. A straight forward implementation of IIR filters would lead to bad utilization of the GPU system due to OpenCL's inability of recursive processing, though. In this contribution, an efficient implementation will be presented circumventing the recursive processing problem of OpenCL. It allows the processing of more than 64 audio channels with IIR filters of order 40 or more with scalable latency. It is implemented for all major operating systems in a flexible OpenCL/C++ framework.   Back
 
Keywords:
Signal and Audio Processing, GTC Silicon Valley 2014 - ID S4382
Streaming:
Download:
Simulation for Autonomous Vehicles
Presentation
Media
Mixed-precision GPU Krylov solver for lattice QCD
Clark , Mike Roberts
- Boston University (United States)
 
Keywords:
Simulation for Autonomous Vehicles, GTC Silicon Valley 2009 - ID P0944
Download:
Tools and Libraries
Presentation
Media
CLBlast: A Tuned BLAS Library for Faster Deep Learning
Cedric Nugteren (TomTom)
We'll demonstrate how to accelerate dense linear algebra computations using CLBlast, an open-source OpenCL BLAS library providing optimized routines for a wide variety of devices. It is targeted at deep learning training and inference and thus provi ...Read More
We'll demonstrate how to accelerate dense linear algebra computations using CLBlast, an open-source OpenCL BLAS library providing optimized routines for a wide variety of devices. It is targeted at deep learning training and inference and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the convolutional layers: the computational heart of all deep-learning frameworks (TensorFlow, Caffe, etc.). CLBlast has three main advantages over other BLAS libraries: 1) it can be explicitly tuned for specific matrix-sizes and hardware platforms, 2) it runs on less common devices (and it is fast), such as embedded and low-power GPUs, and 3) it can perform operations in half-precision FP16 format, saving precious bandwidth, time, and power.  Back
 
Keywords:
Tools and Libraries, Deep Learning and AI, GTC Silicon Valley 2017 - ID S7280
Download:
Video and Image Processing
Presentation
Media
Projected Conjugate Gradient Solvers on GPU and its Applications
Youzuo Lin
- Arizona State University
In this work, the focus is specifically on how to speedup the projected CG algorithm utilizing the GPU. ...Read More
In this work, the focus is specifically on how to speedup the projected CG algorithm utilizing the GPU. It is shown that the projected CG method can be used within the single precision accuracy of the current GPU. One benefit gained through use of the projected CG is that it reduces the total number of matrix vector multiplications, which is usually a bottleneck for an efficient GPU-based Krylov-based algorithm. A modified projection based CG algorithm in the thesis is further proposed which shows a better performance. Numerical results using the GPU are provided to support the proposed algorithm.  Back
 
Keywords:
Video and Image Processing, GTC Silicon Valley 2010 - ID P10J03
Download:
 
GPUs Revolutionizing the Diamond Industry
Rupali Deshpande (NVIDIA)
Throughout the ages, the diamond''s dance of light has captured our fascination. This beauty does not happen by chance but is clearly determined by the design and craft of the diamond''s cut. This presentation will provide an ove ...Read More

Throughout the ages, the diamond''s dance of light has captured our fascination. This beauty does not happen by chance but is clearly determined by the design and craft of the diamond''s cut. This presentation will provide an overview of how GPUs are accelerating the process of converting rough stones to polished & sparkling diamonds. Technology brought a huge spike in the number of diamond pieces being processed from a meager 5 pieces per worker per day to around 150 pieces being cleared everyday by each worker. The industry challenge is to get the rough stones converted into diamonds with high precision and quality, minimum wastage and at speed of light. The range of computational visualization techniques where NVIDIA GPUs have helped address these challenges are GPU-based reconstruction, CUDA based ray tracing, surface rendering of voxel data, density contrast, and OpenGL Rendering. In this presentation attendees will learn how GPU technologies are determining the 4C''s of the diamond industry - Clarity, Color, Cut and Carat.

  Back
 
Keywords:
Video and Image Processing, Rendering and Ray Tracing, GTC Silicon Valley 2013 - ID S3541
Streaming:
Download:
Virtual Reality and Augmented Reality
Presentation
Media
CanvoX: High-Resolution VR Painting for Large Volumetric Canvas
Yeojin Kim (Ewha Womans University)
As Tilt Brush and Quill are not voxel based, a new VR-based voxel painting system with large (40km^3) and detailed (0.3mm^3) canvas would be interesting. We develop an array of octree of depth 24, using 5 indices per cell: parent, child, and 3-neighb ...Read More
As Tilt Brush and Quill are not voxel based, a new VR-based voxel painting system with large (40km^3) and detailed (0.3mm^3) canvas would be interesting. We develop an array of octree of depth 24, using 5 indices per cell: parent, child, and 3-neighbors to accelerate ray traversal. We adaptively refine or coarsen the octree in CPU and sync it with GPU, and then ray cast front to back. To accelerate, we develop a foveated rendering algorithm. We design a quadtree render target whose resolution is dynamically adjusted to heat map, traverse ray, and then interpolate the color in screen space. We traverse ray through upper-level cells as the ray cone widens. We analyze floating point error propagations to thoroughly understand precision problems in deep cells and ray intersections.  Back
 
Keywords:
Virtual Reality and Augmented Reality, GTC Silicon Valley 2017 - ID S7698
Download: