GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:

NVIDIA's DGX-2 system offers a unique architecture which connects 16 GPUs together via the high-speed NVLink interface, along with NVSwitch which enables unprecedented bandwidth between processors. This talk will take an in depth look at the properties of this system along with programming techniques to take maximum advantage of the system architecture.

NVIDIA's DGX-2 system offers a unique architecture which connects 16 GPUs together via the high-speed NVLink interface, along with NVSwitch which enables unprecedented bandwidth between processors. This talk will take an in depth look at the properties of this system along with programming techniques to take maximum advantage of the system architecture.

  Back
 
Topics:
Performance Optimization, Programming Languages, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9241
Streaming:
Download:
Share:
 
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6176
Streaming:
Download:
Share:
 
Abstract:
In this session, we present the implementation in CUDA of a backprojection kernel for Cone-Beam Computed Tomography (CBCT) and study the performance of this kernel from a GPU architectural point of view. We will explain how we measured the utilization of different components of a GPU by this kernel. Our focus will be on Kepler and Maxwell architectures.
In this session, we present the implementation in CUDA of a backprojection kernel for Cone-Beam Computed Tomography (CBCT) and study the performance of this kernel from a GPU architectural point of view. We will explain how we measured the utilization of different components of a GPU by this kernel. Our focus will be on Kepler and Maxwell architectures.  Back
 
Topics:
Medical Imaging & Radiology, Signal and Audio Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5534
Streaming:
Download:
Share:
 
Abstract:
With computational rates in the teraflops, GPUs can accumulate round-off errors at an alarming rate. The errors are no different than those on other IEEE-754-compliant hardware, but GPUs are commonly used for much more intense calculations, so the concern for error is or should be significantly increased. In this talk, we'll examine the accumulation of round-off errors in the n-body application from the CUDA SDK, showing how varied the results can be, depending on the order of operations. We'll then explore a solution that tracks the accumulated errors, motivated by the methods suggested by Kahan (Kahan Summation) and Gustavson, Moreira & Enekel (from their work on stability and accuracy regarding Java portability). The result is a dramatic reduction in round-off error, typically resulting in the nearest floating-point value to the infinitely-precise answer. Furthermore, we will show the performance impact of tracking the errors, which is small, even on numerically-intense algorithms such as the n-body algorithm.
With computational rates in the teraflops, GPUs can accumulate round-off errors at an alarming rate. The errors are no different than those on other IEEE-754-compliant hardware, but GPUs are commonly used for much more intense calculations, so the concern for error is or should be significantly increased. In this talk, we'll examine the accumulation of round-off errors in the n-body application from the CUDA SDK, showing how varied the results can be, depending on the order of operations. We'll then explore a solution that tracks the accumulated errors, motivated by the methods suggested by Kahan (Kahan Summation) and Gustavson, Moreira & Enekel (from their work on stability and accuracy regarding Java portability). The result is a dramatic reduction in round-off error, typically resulting in the nearest floating-point value to the infinitely-precise answer. Furthermore, we will show the performance impact of tracking the errors, which is small, even on numerically-intense algorithms such as the n-body algorithm.  Back
 
Topics:
Numerical Algorithms & Libraries, Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4370
Streaming:
Share:
 
Abstract:
In an effort to explore the power of atomic memory operations on the GPU, we have created a sorting algorithm based on insertion-sort, using concurrent skiplists implemented with atomic memory operations. Skiplists provide two key features: insertion requires only log(n) work, and skiplists can be implemented in a lock-free style, allowing thousands of concurrent threads to perform insertions with minimal interference. Atomic memory operations are the essential for achieving lock-free and wait-free algorithms. We present the skiplist-insertion-sort algorithm, its performance, and some key statistics about atomic memory operations on NVIDIA GPUs.
In an effort to explore the power of atomic memory operations on the GPU, we have created a sorting algorithm based on insertion-sort, using concurrent skiplists implemented with atomic memory operations. Skiplists provide two key features: insertion requires only log(n) work, and skiplists can be implemented in a lock-free style, allowing thousands of concurrent threads to perform insertions with minimal interference. Atomic memory operations are the essential for achieving lock-free and wait-free algorithms. We present the skiplist-insertion-sort algorithm, its performance, and some key statistics about atomic memory operations on NVIDIA GPUs.   Back
 
Topics:
Developer - Algorithms, Programming Languages
Type:
Poster
Event:
GTC Silicon Valley
Year:
2013
Session ID:
P3254
Download:
Share:
 
Abstract:

Atomic memory operations provide powerful communication and coordination capabilities for parallel programs, including the well-known operations compare-and-swap and fetch-and-add. The atomic operations enable the creation of parallel algorithms and data structures that would otherwise be very difficult (or impossible) to express without them - for example: shared parallel data structures, parallel data aggregation, and control primitives such as semaphores and mutexes. In this talk we will use examples to describe atomic operations, explain how they work, and discuss performance considerations and pitfalls when using them.

Atomic memory operations provide powerful communication and coordination capabilities for parallel programs, including the well-known operations compare-and-swap and fetch-and-add. The atomic operations enable the creation of parallel algorithms and data structures that would otherwise be very difficult (or impossible) to express without them - for example: shared parallel data structures, parallel data aggregation, and control primitives such as semaphores and mutexes. In this talk we will use examples to describe atomic operations, explain how they work, and discuss performance considerations and pitfalls when using them.

  Back
 
Topics:
Programming Languages, Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S2313
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next