GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
Most large companies use online analytical processing (OLAP) to gain insight from available data and guide business decisions. To support time-critical business decisions, companies must answer queries as quickly as possible. For OLAP, the performance bottlenecks are joins of large relations. GPUs can significantly accelerate these joins, but often the speed or memory capacity of a single GPU is not sufficient to join input tables or unable to do it quickly enough. We'll discuss how we're addressing these problems by proposing join algorithms that scale to multiple GPUs.
Most large companies use online analytical processing (OLAP) to gain insight from available data and guide business decisions. To support time-critical business decisions, companies must answer queries as quickly as possible. For OLAP, the performance bottlenecks are joins of large relations. GPUs can significantly accelerate these joins, but often the speed or memory capacity of a single GPU is not sufficient to join input tables or unable to do it quickly enough. We'll discuss how we're addressing these problems by proposing join algorithms that scale to multiple GPUs.  Back
 
Topics:
Accelerated Data Science, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9557
Streaming:
Download:
Share:
 
Abstract:
Unified Memory significantly improves productivity, while explicit memory management often provides better performance. We'll examine performance of Unified Memory applications from key AI domains and describe memory-optimization techniques to find the right balance of productivity and performance when you're developing applications. Unified Memory was designed for data analytics, to keep frequently accessed data in GPU memory. We'll analyze performance of large analytic workloads and review bottlenecks for GPU oversubscription on PCIe and NVLINK systems. We'll also discuss results from our study integrating Unified Memory in PyTorch for training deep neural networks. We found that Unified Memory matches explicit cudaMalloc for workloads that fit on GPU memory. In addition, applications can oversubscribe the GPU, which facilitates using bigger batch sizes or training deeper models.
Unified Memory significantly improves productivity, while explicit memory management often provides better performance. We'll examine performance of Unified Memory applications from key AI domains and describe memory-optimization techniques to find the right balance of productivity and performance when you're developing applications. Unified Memory was designed for data analytics, to keep frequently accessed data in GPU memory. We'll analyze performance of large analytic workloads and review bottlenecks for GPU oversubscription on PCIe and NVLINK systems. We'll also discuss results from our study integrating Unified Memory in PyTorch for training deep neural networks. We found that Unified Memory matches explicit cudaMalloc for workloads that fit on GPU memory. In addition, applications can oversubscribe the GPU, which facilitates using bigger batch sizes or training deeper models.  Back
 
Topics:
Performance Optimization, Deep Learning & AI Frameworks, Accelerated Data Science
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9726
Streaming:
Download:
Share:
 
Abstract:
Confused about how Unified Memory works on modern GPU architectures? Did you try Unified Memory some time ago and never wanted to return to it? We'll explain how the last few generations of GPU architectures and software improvements have opened up new ways to manage CPU and GPU memories. We will dive into the advantages and disadvantages of various OS and CUDA memory allocators, explore how memory is managed by the driver, and examine user controls to tune it. Learn about software enhancements for Unified Memory developed over the past year, how HMM is different from ATS, and how to use Unified Memory with multiple processes.
Confused about how Unified Memory works on modern GPU architectures? Did you try Unified Memory some time ago and never wanted to return to it? We'll explain how the last few generations of GPU architectures and software improvements have opened up new ways to manage CPU and GPU memories. We will dive into the advantages and disadvantages of various OS and CUDA memory allocators, explore how memory is managed by the driver, and examine user controls to tune it. Learn about software enhancements for Unified Memory developed over the past year, how HMM is different from ATS, and how to use Unified Memory with multiple processes.  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9727
Streaming:
Download:
Share:
 
Abstract:
We'll cover all the things you need to know about Unified Memory: fundamental principles, important use cases, advances in the latest GPU architectures, HMM and ATS details, performance considerations and optimization ideas, and new application results, including data analytics and deep learning. 2018 is going to be the year of Unified Memory. Both HMM and ATS will be available and developers will start using the true Unified Memory model with the system allocator "the way it's meant to be played." We'll discuss all the caveats and differences between cudaMallocManaged and malloc. A big part of the talk will be related to performance aspects of Unified Memory: from migration throughput optimizations to improving the overlap between kernels and prefetches.
We'll cover all the things you need to know about Unified Memory: fundamental principles, important use cases, advances in the latest GPU architectures, HMM and ATS details, performance considerations and optimization ideas, and new application results, including data analytics and deep learning. 2018 is going to be the year of Unified Memory. Both HMM and ATS will be available and developers will start using the true Unified Memory model with the system allocator "the way it's meant to be played." We'll discuss all the caveats and differences between cudaMallocManaged and malloc. A big part of the talk will be related to performance aspects of Unified Memory: from migration throughput optimizations to improving the overlap between kernels and prefetches.  Back
 
Topics:
Performance Optimization, Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8430
Streaming:
Download:
Share:
 
Abstract:

Learn strategies for efficiently employing various cascaded compression algorithms on the GPU. Many database input fields are amenable to compression since they have repeating or gradually increasing pattern, such as dates and quantities. Fast implementations of decompression algorithms such as RLE-Delta will be presented. By utilizing compression, we can achieve 10 times greater effective read bandwidth than the interconnect allows for raw data transfers. However, I/O bottlenecks still play a big role in the overall performance and data has to be moved efficiently in and out of the GPU to ensure optimal decompression rate. After a deep dive into the implementation, we'll show a real-world example of how BlazingDB leverages these compression strategies to accelerate database operations.

Learn strategies for efficiently employing various cascaded compression algorithms on the GPU. Many database input fields are amenable to compression since they have repeating or gradually increasing pattern, such as dates and quantities. Fast implementations of decompression algorithms such as RLE-Delta will be presented. By utilizing compression, we can achieve 10 times greater effective read bandwidth than the interconnect allows for raw data transfers. However, I/O bottlenecks still play a big role in the overall performance and data has to be moved efficiently in and out of the GPU to ensure optimal decompression rate. After a deep dive into the implementation, we'll show a real-world example of how BlazingDB leverages these compression strategies to accelerate database operations.

  Back
 
Topics:
Accelerated Data Science, AI Startup, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8417
Streaming:
Share:
 
Abstract:
Early on, memory bandwidths, more than an order of magnitude higher than conventional processors have made GPUs an attractive platform for data-intensive applications. While there are many success stories about GPU-accelerated databases built from scratch, GPU-accelerated operations for large-scale, general-purpose databases are rather an exception than the norm. We characterize fundamental database operators like scan, filter, join, and group-by based on their memory access patterns. From these characteristics, we derive their potential for GPU acceleration, such as upper bounds for performance on current and future architectures. Starting from basic GPU implementations, we deep dive into aspects like optimizing data transfers, access, and layout, etc.
Early on, memory bandwidths, more than an order of magnitude higher than conventional processors have made GPUs an attractive platform for data-intensive applications. While there are many success stories about GPU-accelerated databases built from scratch, GPU-accelerated operations for large-scale, general-purpose databases are rather an exception than the norm. We characterize fundamental database operators like scan, filter, join, and group-by based on their memory access patterns. From these characteristics, we derive their potential for GPU acceleration, such as upper bounds for performance on current and future architectures. Starting from basic GPU implementations, we deep dive into aspects like optimizing data transfers, access, and layout, etc.  Back
 
Topics:
Accelerated Data Science, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8289
Streaming:
Download:
Share:
 
Abstract:
Learn about the new features of Unified Memory programming model for heterogeneous architectures. We'll deep dive into architecture and software changes related to Unified Memory, what it means for developers, and how it enables easier data management and new capabilities for your applications. We'll cover in detail Unified Memory features such as on-demand paging, memory oversubscription, memory coherence, and system-wide atomics. Use cases in HPC, deep learning, and graph analytics will be provided along with initial performance results. We'll also discuss common pitfalls and optimization guidelines so you can take full advantage of Unified Memory to increase your productivity.
Learn about the new features of Unified Memory programming model for heterogeneous architectures. We'll deep dive into architecture and software changes related to Unified Memory, what it means for developers, and how it enables easier data management and new capabilities for your applications. We'll cover in detail Unified Memory features such as on-demand paging, memory oversubscription, memory coherence, and system-wide atomics. Use cases in HPC, deep learning, and graph analytics will be provided along with initial performance results. We'll also discuss common pitfalls and optimization guidelines so you can take full advantage of Unified Memory to increase your productivity.  Back
 
Topics:
Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7285
Download:
Share:
 
Abstract:
Maximizing data flow is one of the most important graph problems and has numerous applications across various computational domains: transportation networks, power routing, image segmentation, social network clustering, and recommendation systems. There are many efficient algorithms that have been developed for this problem, most of them trying to minimize computational complexity. However, not all these algorithms map well to massively parallel architectures like GPUs. We'll present a novel GPU-friendly approach based on the MPM algorithm that achieves from 5 to 20 times speedup over the state-of-the-art multithreaded CPU implementation from Galois library on general graphs with various diameters. We'll also discuss some real-world applications of the maximum flow problem in computer vision for image segmentation and in data analytics to find communities in social networks.
Maximizing data flow is one of the most important graph problems and has numerous applications across various computational domains: transportation networks, power routing, image segmentation, social network clustering, and recommendation systems. There are many efficient algorithms that have been developed for this problem, most of them trying to minimize computational complexity. However, not all these algorithms map well to massively parallel architectures like GPUs. We'll present a novel GPU-friendly approach based on the MPM algorithm that achieves from 5 to 20 times speedup over the state-of-the-art multithreaded CPU implementation from Galois library on general graphs with various diameters. We'll also discuss some real-world applications of the maximum flow problem in computer vision for image segmentation and in data analytics to find communities in social networks.  Back
 
Topics:
Accelerated Data Science, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7370
Download:
Share:
 
Abstract:
Learn about the new Unified Memory programming model for heterogeneous architectures. We'll deep dive into architecture and software changes in Unified Memory, what it means for developers, and how it enables new features for GPU applications, including on-demand paging and memory oversubscription. Use cases in HPC and other domains will be provided with the initial performance projections. Unified Memory performance optimizations, such as data prefetching and location hints, will be covered along with real-world application examples.
Learn about the new Unified Memory programming model for heterogeneous architectures. We'll deep dive into architecture and software changes in Unified Memory, what it means for developers, and how it enables new features for GPU applications, including on-demand paging and memory oversubscription. Use cases in HPC and other domains will be provided with the initial performance projections. Unified Memory performance optimizations, such as data prefetching and location hints, will be covered along with real-world application examples.  Back
 
Topics:
Programming Languages, Performance Optimization, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6216
Streaming:
Download:
Share:
 
Abstract:
Join us for a deep-dive analysis of geometric multigrid methods on GPUs. These methods have numerous applications in computer science, including combustion codes based on adaptive mesh refinement techniques. High-order multigrid schemes are actively being explored for production in many linear algebra packages, and may become a commodity within the next few years. We'll discuss performance bottlenecks and key implementation choices for current and future generation GPUs. Our analysis is based on a well-known high-performance multigrid benchmark (HPGMG). Hybrid CPU/GPU implementation using unified memory, high-order stencil optimizations, and multi-GPU scaling will be covered in detail.
Join us for a deep-dive analysis of geometric multigrid methods on GPUs. These methods have numerous applications in computer science, including combustion codes based on adaptive mesh refinement techniques. High-order multigrid schemes are actively being explored for production in many linear algebra packages, and may become a commodity within the next few years. We'll discuss performance bottlenecks and key implementation choices for current and future generation GPUs. Our analysis is based on a well-known high-performance multigrid benchmark (HPGMG). Hybrid CPU/GPU implementation using unified memory, high-order stencil optimizations, and multi-GPU scaling will be covered in detail.  Back
 
Topics:
HPC and Supercomputing, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6248
Streaming:
Download:
Share:
 
Abstract:

Running the latest versions of GPU accelerated applications maximizes performance and improves user productivity. The latest version, NAMD 2.11, provides up to 7x* speedup on GPUs over CPU-only systems and up to 2x performance over NAMD 2.10. Watch this on-demand webinar to hear experts from NVIDIA and NAMD answer your NAMD and GPU related questions ranging from installation to job optimization.  *Dual CPU server, Intel E5-2698 v3@2.3GHz, NVIDIA Tesla K80 with ECC off, Autoboost On; STMV datasetoolkit to date.

Running the latest versions of GPU accelerated applications maximizes performance and improves user productivity. The latest version, NAMD 2.11, provides up to 7x* speedup on GPUs over CPU-only systems and up to 2x performance over NAMD 2.10. Watch this on-demand webinar to hear experts from NVIDIA and NAMD answer your NAMD and GPU related questions ranging from installation to job optimization.  *Dual CPU server, Intel E5-2698 v3@2.3GHz, NVIDIA Tesla K80 with ECC off, Autoboost On; STMV datasetoolkit to date.

  Back
 
Topics:
Computational Biology & Chemistry
Type:
Webinar
Event:
GTC Webinars
Year:
2016
Session ID:
GTCE126
Streaming:
Download:
Share:
 
Abstract:
We present a novel sparse matrix formulation that uses modified merge algorithms. In contrast to conventional sparse matrix algorithms, which suffer from data divergence within large work arrays, this method allows us to maintain contiguous data layouts at all stages of the process. This also allows us to take advantage of ideas from optimized parallel merge algorithms for efficient GPU performance. Performance comparisons are presented. We are motivated by quantum mechanical simulations of atomic systems, which are limited by the computational cost of the eigenvalue solution. Linear scaling methods have been developed which require multiplication of large sparse matrices, where the number of non-zeros per row can be relatively large although still much less than the matrix dimension.
We present a novel sparse matrix formulation that uses modified merge algorithms. In contrast to conventional sparse matrix algorithms, which suffer from data divergence within large work arrays, this method allows us to maintain contiguous data layouts at all stages of the process. This also allows us to take advantage of ideas from optimized parallel merge algorithms for efficient GPU performance. Performance comparisons are presented. We are motivated by quantum mechanical simulations of atomic systems, which are limited by the computational cost of the eigenvalue solution. Linear scaling methods have been developed which require multiplication of large sparse matrices, where the number of non-zeros per row can be relatively large although still much less than the matrix dimension.  Back
 
Topics:
Life & Material Science, Developer - Algorithms, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5443
Streaming:
Share:
 
Abstract:

Learn about various methods and trade-offs in the distributed GPU implementation of molecular dynamics proxy application that achieves more than 90% weak scaling efficiency on 512 GPU nodes. CoMD represents a reference implementation of classical molecular dynamics algorithms and workloads. It is created and maintained by The Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx) and is part of the R&D100 Award-winning Mantevo 1.0 software suite. In this talk we will discuss the main techniques and methods that are involved in GPU implementation of CoMD, including (1) cell-based and neighbor list approaches for neighbor particles search, (2) different thread-mapping strategies and memory layouts. An efficient distributed implementation will be covered in detail. Interior/boundary cells separation is used to allow efficient asynchronous processing and concurrent execution of kernels, memory copies and MPI transfers.

Learn about various methods and trade-offs in the distributed GPU implementation of molecular dynamics proxy application that achieves more than 90% weak scaling efficiency on 512 GPU nodes. CoMD represents a reference implementation of classical molecular dynamics algorithms and workloads. It is created and maintained by The Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx) and is part of the R&D100 Award-winning Mantevo 1.0 software suite. In this talk we will discuss the main techniques and methods that are involved in GPU implementation of CoMD, including (1) cell-based and neighbor list approaches for neighbor particles search, (2) different thread-mapping strategies and memory layouts. An efficient distributed implementation will be covered in detail. Interior/boundary cells separation is used to allow efficient asynchronous processing and concurrent execution of kernels, memory copies and MPI transfers.

  Back
 
Topics:
Molecular Dynamics, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4465
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next