GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
We'll discuss cuTENSOR, a high-performance CUDA library for tensor operations that efficiently handles the ubiquitous presence of high-dimensional arrays (i.e., tensors) in today's HPC and DL workloads. This library supports highly efficient tensor operations such as tensor contractions (a generalization of matrix-matrix multiplications), point-wise tensor operations such as tensor permutations, and tensor decompositions (a generalization of matrix decompositions). While providing high performance, cuTENSOR also allows users to express their mathematical equations for tensors in a straightforward way that hides the complexity of dealing with these high-dimensional objects behind an easy-to-use API. CUDA 10.1 enables CUDA programmers to utilize Tensor Cores directly with the new mma.sync instruction. In this presentation, we describe the functionality of mma.sync and present strategies for implementing efficient matrix multiply computations in CUDA that maximize performance on NVIDIA Volta GPUs. We then describe how CUTLASS 1.3 provides reusable components embodying these strategies. CUTLASS 1.3 demonstrates a median 44% speedup of CUDA kernels executing layers from real-world Deep Learning workloads.
We'll discuss cuTENSOR, a high-performance CUDA library for tensor operations that efficiently handles the ubiquitous presence of high-dimensional arrays (i.e., tensors) in today's HPC and DL workloads. This library supports highly efficient tensor operations such as tensor contractions (a generalization of matrix-matrix multiplications), point-wise tensor operations such as tensor permutations, and tensor decompositions (a generalization of matrix decompositions). While providing high performance, cuTENSOR also allows users to express their mathematical equations for tensors in a straightforward way that hides the complexity of dealing with these high-dimensional objects behind an easy-to-use API. CUDA 10.1 enables CUDA programmers to utilize Tensor Cores directly with the new mma.sync instruction. In this presentation, we describe the functionality of mma.sync and present strategies for implementing efficient matrix multiply computations in CUDA that maximize performance on NVIDIA Volta GPUs. We then describe how CUTLASS 1.3 provides reusable components embodying these strategies. CUTLASS 1.3 demonstrates a median 44% speedup of CUDA kernels executing layers from real-world Deep Learning workloads.  Back
 
Topics:
Computational Biology & Chemistry, Tools & Libraries, HPC and AI
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9593
Streaming:
Download:
Share:
 
Abstract:
Audience members will learn how to implement efficient Deep Learning computations using CUDA C++ in the context of CUTLASS. CUTLASS is an open-source collection of C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) at all levels of the CUDA thread hierarchy. We will describe many of the algorithmic strategies used by cuBLAS and cuDNN, and how they can be implemented using C++ templates to cover an extensive space of problem sizes, data layouts, and data types. In particular, we will emphasize how to support alternative and mixed precision math operations such as Pascal's integer DP4A operation and Volta's TensorCores. Finally, we will illustrate how CUTLASS primitives can be combined with custom functionality to implement related algorithms such as convolution. Although this talk highlights CUTLASS, the architecture concepts and algorithm details are relevant to any CUDA programmer focused on Deep Learning.
Audience members will learn how to implement efficient Deep Learning computations using CUDA C++ in the context of CUTLASS. CUTLASS is an open-source collection of C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) at all levels of the CUDA thread hierarchy. We will describe many of the algorithmic strategies used by cuBLAS and cuDNN, and how they can be implemented using C++ templates to cover an extensive space of problem sizes, data layouts, and data types. In particular, we will emphasize how to support alternative and mixed precision math operations such as Pascal's integer DP4A operation and Volta's TensorCores. Finally, we will illustrate how CUTLASS primitives can be combined with custom functionality to implement related algorithms such as convolution. Although this talk highlights CUTLASS, the architecture concepts and algorithm details are relevant to any CUDA programmer focused on Deep Learning.  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8854
Streaming:
Share:
 
Abstract:

GPU Ocelot is an open-source dynamic JIT compilation framework for GPU compute applications targeting a range of GPU and non-GPU execution targets. Ocelot supports CUDA applications and provides an implementation of the CUDA Runtime API enabling seamless integration with existing CUDA applications. Its JIT compiler supports four backend execution targets - (1) an emulator that implements NVIDIAs Parallel Thread Execution (PTX) instruction set architecture, (2) NVIDIA GPUs, (3) AMD GPUs, and (4) a translator to LLVM for efficient parallel execution of GPU kernels on multicore CPUs. Existing CUDA applications are seamlessly supported.

GPU Ocelot is an open-source dynamic JIT compilation framework for GPU compute applications targeting a range of GPU and non-GPU execution targets. Ocelot supports CUDA applications and provides an implementation of the CUDA Runtime API enabling seamless integration with existing CUDA applications. Its JIT compiler supports four backend execution targets - (1) an emulator that implements NVIDIAs Parallel Thread Execution (PTX) instruction set architecture, (2) NVIDIA GPUs, (3) AMD GPUs, and (4) a translator to LLVM for efficient parallel execution of GPU kernels on multicore CPUs. Existing CUDA applications are seamlessly supported.

  Back
 
Topics:
Programming Languages
Type:
Poster
Event:
GTC Silicon Valley
Year:
2012
Session ID:
P2534
Download:
Share:
 
Speakers:
Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili
- Georgia Institute of Technology
Abstract:
Learn how to debug and profile CUDA applications using GPU-Ocelot. Ocelot is a compilation and emulation framework for CUDA that includes debugging and profiling tools as well as backend compilers for NVIDIA GPUs and x86 CPUs. We will present examples of applications developed on x86 CPUs and deployed on NVIDIA GPUs. We will also discuss memory checking, race detection, and deadlock detection tools available within Ocelot.
Learn how to debug and profile CUDA applications using GPU-Ocelot. Ocelot is a compilation and emulation framework for CUDA that includes debugging and profiling tools as well as backend compilers for NVIDIA GPUs and x86 CPUs. We will present examples of applications developed on x86 CPUs and deployed on NVIDIA GPUs. We will also discuss memory checking, race detection, and deadlock detection tools available within Ocelot.  Back
 
Topics:
Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2010
Session ID:
S102210
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next