GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:

Less code, more performance! Runtime compilation with NVRTC offers many potential benefits to new and existing codes, but also presents challenges when it comes to implementation. To help solve this dilemma, we've developed a small C++ library called "Jitify" that hides the complexities of runtime compilation behind a simple, high-level interface. Jitify takes care of issues like kernel caching, template instantiation, type reflection, and compilation of host code for the device. It also provides a convenient parallel_for function and lambda wrapper that enables dynamic runtime selection of host or device execution. Since source code passed to NVRTC does not require CUDA-specific annotations, porting a large C++ code to CUDA using Jitify can be as simple as replacing a for loop with Jitify's parallel_for construct. We'll present some examples of Jitify in action, demonstrating how it enables better code generation, faster compilation times, and rapid code porting.

Less code, more performance! Runtime compilation with NVRTC offers many potential benefits to new and existing codes, but also presents challenges when it comes to implementation. To help solve this dilemma, we've developed a small C++ library called "Jitify" that hides the complexities of runtime compilation behind a simple, high-level interface. Jitify takes care of issues like kernel caching, template instantiation, type reflection, and compilation of host code for the device. It also provides a convenient parallel_for function and lambda wrapper that enables dynamic runtime selection of host or device execution. Since source code passed to NVRTC does not require CUDA-specific annotations, porting a large C++ code to CUDA using Jitify can be as simple as replacing a for loop with Jitify's parallel_for construct. We'll present some examples of Jitify in action, demonstrating how it enables better code generation, faster compilation times, and rapid code porting.

  Back
 
Topics:
Tools & Libraries, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7716
Download:
Share:
 
Abstract:
We'll present Bifrost, a lightweight new framework designed to ease the development and deployment of pipeline applications that demand sustained peak utilization of network, CPU, and GPU resources under soft real-time constraints. Such applications are common in experimental science and computer vision, where processing must keep up with acquisition systems to avoid data loss. Bifrost enables operations to be wrapped in a simple task container with metadata-rich inputs and outputs. By connecting tasks together, complex branching pipelines can be constructed, with asynchronous communication handled by efficient ring buffers in host or device memory. We'll demonstrate Bifrost using a high-performance radio astronomy application that has been deployed as part of the LEDA project.
We'll present Bifrost, a lightweight new framework designed to ease the development and deployment of pipeline applications that demand sustained peak utilization of network, CPU, and GPU resources under soft real-time constraints. Such applications are common in experimental science and computer vision, where processing must keep up with acquisition systems to avoid data loss. Bifrost enables operations to be wrapped in a simple task container with metadata-rich inputs and outputs. By connecting tasks together, complex branching pipelines can be constructed, with asynchronous communication handled by efficient ring buffers in host or device memory. We'll demonstrate Bifrost using a high-performance radio astronomy application that has been deployed as part of the LEDA project.  Back
 
Topics:
Astronomy & Astrophysics, Tools & Libraries, Signal and Audio Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6627
Streaming:
Download:
Share:
 
Abstract:
We will present the results of of an investigation to speed up and improve power efficiency of dense matrix multiplications in CUDA. These techniques give an effective compute rate greater than the peak performance of a GPU, allowing us to approach 10 TFLOPS sustained in matrix multiplication on a single GPU. Techniques applied include exploitation of Gauss's complex multiplication algorithm and implementing a Strassen-like algorithm to reduce the computational cost from the naive O(n^3). We will discuss how the power efficiency of these dense-linear algebra computations can improved through tile size and input word size choice. Results from the Tesla K80 will show improving power efficiency is the same as improving absolute performance.
We will present the results of of an investigation to speed up and improve power efficiency of dense matrix multiplications in CUDA. These techniques give an effective compute rate greater than the peak performance of a GPU, allowing us to approach 10 TFLOPS sustained in matrix multiplication on a single GPU. Techniques applied include exploitation of Gauss's complex multiplication algorithm and implementing a Strassen-like algorithm to reduce the computational cost from the naive O(n^3). We will discuss how the power efficiency of these dense-linear algebra computations can improved through tile size and input word size choice. Results from the Tesla K80 will show improving power efficiency is the same as improving absolute performance.  Back
 
Topics:
Developer - Algorithms, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5601
Streaming:
Share:
 
Abstract:
How do you cross-correlate 10,000 signals 100 million times per second? This is an example of the type of compute-bound problem facing modern radio astronomy, which, paralleling the paradigm shift in computing architectures, has transitioned from monolithic single-dish telescopes to massive arrays of smaller antennas. In this session we will describe how general-purpose HPC installations can be used to achieve scaling of a cross-correlation pipeline to petascale with all the flexibility of a purely-software implementation. Optimisations we will discuss include tuning of the GPU cross-correlation kernel, maximising concurrency between compute and network operations, and minimising bandwidth bottlenecks in a streaming application. GPUs are already powering the world's biggest radio telescope arrays, and this work paves the way for entirely off-the-shelf correlators for the future exascale-generation of instruments.
How do you cross-correlate 10,000 signals 100 million times per second? This is an example of the type of compute-bound problem facing modern radio astronomy, which, paralleling the paradigm shift in computing architectures, has transitioned from monolithic single-dish telescopes to massive arrays of smaller antennas. In this session we will describe how general-purpose HPC installations can be used to achieve scaling of a cross-correlation pipeline to petascale with all the flexibility of a purely-software implementation. Optimisations we will discuss include tuning of the GPU cross-correlation kernel, maximising concurrency between compute and network operations, and minimising bandwidth bottlenecks in a streaming application. GPUs are already powering the world's biggest radio telescope arrays, and this work paves the way for entirely off-the-shelf correlators for the future exascale-generation of instruments.  Back
 
Topics:
Astronomy & Astrophysics, Signal and Audio Processing, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4511
Streaming:
Download:
Share:
 
Abstract:

Radio astronomy is a real-time signal processing application that requires extreme supercomputing. While today''s radio telescopes require 10-100 Tflops of computational power, by the end of the decade this will increase into the Exaflops regime, driven by the Hydrogen Epoch of Reionization Array (HERA) and the Square Kilometer Array (SKA). The most compute intensive part of this problem is the so-called cross-correlation algorithm, which can be recast as a linear-algebra problem similar in spirit to DGEMM. In this session we describe the cross-correlation engine that powers the pathfinder LEDA radio telescope and has been (re)optimized for the Kepler GK110 architecture to achieve over 2.5 Tflops in sustained performance. This level of efficiency is critical to meeting strict power and space constraints imposed by the instrument''s remote location.

Radio astronomy is a real-time signal processing application that requires extreme supercomputing. While today''s radio telescopes require 10-100 Tflops of computational power, by the end of the decade this will increase into the Exaflops regime, driven by the Hydrogen Epoch of Reionization Array (HERA) and the Square Kilometer Array (SKA). The most compute intensive part of this problem is the so-called cross-correlation algorithm, which can be recast as a linear-algebra problem similar in spirit to DGEMM. In this session we describe the cross-correlation engine that powers the pathfinder LEDA radio telescope and has been (re)optimized for the Kepler GK110 architecture to achieve over 2.5 Tflops in sustained performance. This level of efficiency is critical to meeting strict power and space constraints imposed by the instrument''s remote location.

  Back
 
Topics:
Astronomy & Astrophysics, Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3497
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next