GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
We'll share what we learned by running the QUDA open source library for lattice quantum chromodynamics (LQCD) on modern GPU systems, which provide myriad hardware and software techniques to improve strong scaling. We ran QUDA which provides GPU acceleration for LQCD applications like MILC and Chromaon on six generations of GPUs with various network and node configurations, including IBM POWER9- and x86-based systems and NVLink- and NVSwitch-based systems. Based on those experiences, we'll discuss best practices for scaling to hundreds and even thousands of GPUs. We'll also cover peer-to-peer memory access, GPUDirect RDMA, and NVSHMEM, as well as the techniques such as auto-tuning kernel launch configurations.
We'll share what we learned by running the QUDA open source library for lattice quantum chromodynamics (LQCD) on modern GPU systems, which provide myriad hardware and software techniques to improve strong scaling. We ran QUDA which provides GPU acceleration for LQCD applications like MILC and Chromaon on six generations of GPUs with various network and node configurations, including IBM POWER9- and x86-based systems and NVLink- and NVSwitch-based systems. Based on those experiences, we'll discuss best practices for scaling to hundreds and even thousands of GPUs. We'll also cover peer-to-peer memory access, GPUDirect RDMA, and NVSHMEM, as well as the techniques such as auto-tuning kernel launch configurations.  Back
 
Topics:
HPC and Supercomputing, Performance Optimization, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9708
Streaming:
Download:
Share:
 
Abstract:
In this session we explore how to analyze and optimize the performance of kernels running on the GPU. Working with a real-world example, we will walk through an analysis-driven process leading to a series of kernel-level optimizations, using NVIDIA's profiling tools as an example. Attendees will learn about the fundamental performance limiters-instruction throughput, memory throughput, and latency and we will present strategies to identify and tackle each type of limiter.
In this session we explore how to analyze and optimize the performance of kernels running on the GPU. Working with a real-world example, we will walk through an analysis-driven process leading to a series of kernel-level optimizations, using NVIDIA's profiling tools as an example. Attendees will learn about the fundamental performance limiters-instruction throughput, memory throughput, and latency and we will present strategies to identify and tackle each type of limiter.  Back
 
Topics:
Performance Optimization, Tools & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8630
Streaming:
Download:
Share:
 
Abstract:

In this session we explore how to analyze and optimize the performance of GPU-accelerated applications. Working with a real-world example, attendees will learn how to analyze application performance by measuring data transfers, unified memory page migrations, inter-GPU communication, and performing critical path analysis. Using the example application, and using NVIDIA's profiling tools as an example tool set, we will walk through various optimizations and discuss their impact on the performance of the whole application. This session is accompanied by Session S7444, which considers performance optimization of GPU kernels.

In this session we explore how to analyze and optimize the performance of GPU-accelerated applications. Working with a real-world example, attendees will learn how to analyze application performance by measuring data transfers, unified memory page migrations, inter-GPU communication, and performing critical path analysis. Using the example application, and using NVIDIA's profiling tools as an example tool set, we will walk through various optimizations and discuss their impact on the performance of the whole application. This session is accompanied by Session S7444, which considers performance optimization of GPU kernels.

  Back
 
Topics:
Performance Optimization, Algorithms & Numerical Techniques, Tools & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7445
Download:
Share:
 
Abstract:

On the path to exascale, high performance computing adapts wider and wider processors that need more parallelism. The energy required to move data and the available bandwidth pose significant challenges. See how an efficient implementation of iterative Krylov solvers can help deal with these issues. As an example, we the block conjugate gradient solver in QUDA, a library for lattice quantum chromodynamics. We demonstrate how an efficient implementation can overcome scaling issues and achieve a 10X speedup compared to a regular conjugate gradient solver.

On the path to exascale, high performance computing adapts wider and wider processors that need more parallelism. The energy required to move data and the available bandwidth pose significant challenges. See how an efficient implementation of iterative Krylov solvers can help deal with these issues. As an example, we the block conjugate gradient solver in QUDA, a library for lattice quantum chromodynamics. We demonstrate how an efficient implementation can overcome scaling issues and achieve a 10X speedup compared to a regular conjugate gradient solver.

  Back
 
Topics:
Computational Physics, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7387
Download:
Share:
 
Abstract:

We'll present a real CUDA application and use NVIDIA Nsight Eclipse Edition on Linux to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.

We'll present a real CUDA application and use NVIDIA Nsight Eclipse Edition on Linux to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.

  Back
 
Topics:
Performance Optimization, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6111
Streaming:
Download:
Share:
 
Abstract:

Accelerators have become a key ingredient in HPC. GPUs had a head start and are already widely used in HPC applications but now are facing competition from Intel's Xeon Phi accelerators. The latter promise comparable performance and easier portability and even feature a higher memory bandwidth - key to good performance for a wide range of bandwidth-bound HPC applications. In this session we compare their performance using a Lattice QCD application as a case study. We give a short overview of the relevant features of the architectures and discuss some implementation details. Learn about the effort it takes to achieve great performance on both architectures. See which accelerator is more energy efficient and which one takes the performance crown at about 500 GFlop/s.

Accelerators have become a key ingredient in HPC. GPUs had a head start and are already widely used in HPC applications but now are facing competition from Intel's Xeon Phi accelerators. The latter promise comparable performance and easier portability and even feature a higher memory bandwidth - key to good performance for a wide range of bandwidth-bound HPC applications. In this session we compare their performance using a Lattice QCD application as a case study. We give a short overview of the relevant features of the architectures and discuss some implementation details. Learn about the effort it takes to achieve great performance on both architectures. See which accelerator is more energy efficient and which one takes the performance crown at about 500 GFlop/s.

  Back
 
Topics:
Computational Physics, Performance Optimization, Data Center & Cloud Infrastructure
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5447
Streaming:
Download:
Share:
 
Abstract:
See how advances in GPU Computing enable us to simulate Quantum Chromodynamics and learn about fundamental properties of strongly interacting matter i.e., quarks and gluons at finite temperatures. With the advances in hardware and algorithms these simulations have reached a level that allows for a quantitative comparison with experimental data from heavy-ion colliders. Discover how the Kepler architecture helps us to boost the performance of the simulations and reach new level of precision. I will discuss selected optimizations for the Kepler K20 cards and modifications to prepare the code for the Titan supercomputer. Furthermore I compare and discuss pros and cons of our in-house in comparison to available libraries like the QUDA library.
See how advances in GPU Computing enable us to simulate Quantum Chromodynamics and learn about fundamental properties of strongly interacting matter i.e., quarks and gluons at finite temperatures. With the advances in hardware and algorithms these simulations have reached a level that allows for a quantitative comparison with experimental data from heavy-ion colliders. Discover how the Kepler architecture helps us to boost the performance of the simulations and reach new level of precision. I will discuss selected optimizations for the Kepler K20 cards and modifications to prepare the code for the Titan supercomputer. Furthermore I compare and discuss pros and cons of our in-house in comparison to available libraries like the QUDA library.   Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4453
Streaming:
Download:
Share:
 
Abstract:

Discover how data from experiments at heavy-ion colliders (the Relativistic Heavy Ion Collider at Brookhaven National Lab and the Large Hadron Collider at CERN) can immediately be compared with first-principles simulations of Quantum Chromodynamics (QCD) to quantitatively probe the fundamental properties of strongly interacting matter, i.e., quarks and gluons at high temperature. The conditions realized in the experiments governed the early evolution of the universe. The necessary high precision for these comparisons is obtained by completely performing our calculations on the GPU. In doing so we simultaneously face a low flop/byte ratio and high-register pressure. See how we deal with these complications and achieve high performance on the Bielefeld GPU cluster with 400 Fermi GPUs.

Discover how data from experiments at heavy-ion colliders (the Relativistic Heavy Ion Collider at Brookhaven National Lab and the Large Hadron Collider at CERN) can immediately be compared with first-principles simulations of Quantum Chromodynamics (QCD) to quantitatively probe the fundamental properties of strongly interacting matter, i.e., quarks and gluons at high temperature. The conditions realized in the experiments governed the early evolution of the universe. The necessary high precision for these comparisons is obtained by completely performing our calculations on the GPU. In doing so we simultaneously face a low flop/byte ratio and high-register pressure. See how we deal with these complications and achieve high performance on the Bielefeld GPU cluster with 400 Fermi GPUs.

  Back
 
Topics:
Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3153
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next