GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
Learn about the Kokkos C++ Performance Portability EcoSystem, a production-level solution for writing modern C++ applications in a hardware-agnostic way. The ecosystem is part of the U.S. Department of Energy's Exascale Project, a national effort to prepare the HPC community for the next generation of supercomputing platforms. We'll give an overview of what the Kokkos EcoSystem provides, including its programming model, math kernels library, tools, and training resources. We'll provide success stories for Kokkos adoption in large production applications on the leading supercomputing platforms in the U.S. We'll focus particularly on early results from two of the world's most powerful supercomputers, Summit and Sierra, both powered by NVIDIA Tesla V100 GPUs. We will also describe how the Kokkos EcoSystem anticipates the next generation of architectures and share early experiences of using NVSHMEM incorporated into Kokkos.
Learn about the Kokkos C++ Performance Portability EcoSystem, a production-level solution for writing modern C++ applications in a hardware-agnostic way. The ecosystem is part of the U.S. Department of Energy's Exascale Project, a national effort to prepare the HPC community for the next generation of supercomputing platforms. We'll give an overview of what the Kokkos EcoSystem provides, including its programming model, math kernels library, tools, and training resources. We'll provide success stories for Kokkos adoption in large production applications on the leading supercomputing platforms in the U.S. We'll focus particularly on early results from two of the world's most powerful supercomputers, Summit and Sierra, both powered by NVIDIA Tesla V100 GPUs. We will also describe how the Kokkos EcoSystem anticipates the next generation of architectures and share early experiences of using NVSHMEM incorporated into Kokkos.  Back
 
Topics:
Programming Languages, HPC and AI
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9662
Streaming:
Download:
Share:
 
Abstract:
The C++17 and Fortran 2018 language standards include parallel programming constructs well-suited for GPU computing. The C++17 parallel STL (pSTL) was designed with intent to support GPU parallel programming. The F18 do concurrent construct with its shared and private variable clauses can be used to express loop-level parallelism across multiple array index ranges. We will share our experiences and results implementing support for these constructs in the PGI C++ and Fortran compilers for NVIDIA GPUs, and explain the capabilities and limitations they offer HPC programmers. You will learn how to use OpenACC as a bridge to GPU and parallel programming with standard C++ and Fortran, and we will present additional features we hope and expect will become a part of those standards. 
The C++17 and Fortran 2018 language standards include parallel programming constructs well-suited for GPU computing. The C++17 parallel STL (pSTL) was designed with intent to support GPU parallel programming. The F18 do concurrent construct with its shared and private variable clauses can be used to express loop-level parallelism across multiple array index ranges. We will share our experiences and results implementing support for these constructs in the PGI C++ and Fortran compilers for NVIDIA GPUs, and explain the capabilities and limitations they offer HPC programmers. You will learn how to use OpenACC as a bridge to GPU and parallel programming with standard C++ and Fortran, and we will present additional features we hope and expect will become a part of those standards. 
  Back
 
Topics:
Programming Languages
Type:
Talk
Event:
Supercomputing
Year:
2018
Session ID:
SC1838
Download:
Share:
 
Abstract:
Kokkos is a programming model developed at Sandia National Laboratories for enabling application developers to achieve performance portability for C++ codes. It is now the primary programming model at Sandia to port production-level applications to modern architectures, including GPUs. We'll discuss the core abstractions of Kokkos for parallel execution as well as data management, and how they are used to provide a critically important set of capabilities for the efficient implementation of a wide range of HPC algorithms. We'll present performance evaluations on a range of platforms to demonstrate the state of the art of performance portability. This will include data from Intel KNL-based systems as well as IBM Power8 with NVIDIA NVLink-connected NVIDIA Tesla P100 GPUs. We'll also provide an overview of how Kokkos fits into the larger exascale project at the Department of Energy, and how it is used to advance the development of parallel programming support in the C++ language standa
Kokkos is a programming model developed at Sandia National Laboratories for enabling application developers to achieve performance portability for C++ codes. It is now the primary programming model at Sandia to port production-level applications to modern architectures, including GPUs. We'll discuss the core abstractions of Kokkos for parallel execution as well as data management, and how they are used to provide a critically important set of capabilities for the efficient implementation of a wide range of HPC algorithms. We'll present performance evaluations on a range of platforms to demonstrate the state of the art of performance portability. This will include data from Intel KNL-based systems as well as IBM Power8 with NVIDIA NVLink-connected NVIDIA Tesla P100 GPUs. We'll also provide an overview of how Kokkos fits into the larger exascale project at the Department of Energy, and how it is used to advance the development of parallel programming support in the C++ language standa  Back
 
Topics:
HPC and Supercomputing, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7344
Download:
Share:
 
Abstract:

Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU programming. This panel will throw light on what would be the primary objective(s) for a choice of model, whether its availability across multiple platforms, its rich feature set or its applicability for a certain type of scientific code or compilers' stability or other factors. This will be an interactive Q/A session where participants can discuss their experiences with programming model experts and developers.

Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU programming. This panel will throw light on what would be the primary objective(s) for a choice of model, whether its availability across multiple platforms, its rich feature set or its applicability for a certain type of scientific code or compilers' stability or other factors. This will be an interactive Q/A session where participants can discuss their experiences with programming model experts and developers.

  Back
 
Topics:
HPC and Supercomputing, Programming Languages
Type:
Panel
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7564
Download:
Share:
 
Abstract:

We will present early results from IBM Power8 systems equipped with NVLink connected NVIDIA P100 GPUs. We will show comparative results with previous NVIDIA GPU generations for a set of synthetic and application benchmarks, highlighting in particular the advances in the memory subsystem of P100. The talk will in particular demonstrate the impact of the new double precision atomic add capabilities, and will discuss some early exploration of the behavior of NVLink between the Power8 CPUs and the P100 GPUs. 

We will present early results from IBM Power8 systems equipped with NVLink connected NVIDIA P100 GPUs. We will show comparative results with previous NVIDIA GPU generations for a set of synthetic and application benchmarks, highlighting in particular the advances in the memory subsystem of P100. The talk will in particular demonstrate the impact of the new double precision atomic add capabilities, and will discuss some early exploration of the behavior of NVLink between the Power8 CPUs and the P100 GPUs. 

  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
Supercomputing
Year:
2016
Session ID:
SC6120
Streaming:
Share:
 
Abstract:
Learn about strategies to keep codes maintainable and performant in a diverse high performance computing environment. Using the example of LAMMPS we will demonstrate how the use of Kokkos can reduce code redundancy compared to reimplementing capabilities in hardware specific variants, while delivering similar performance. We will show how new features supported by Kokkos are closing some of the remaining gaps to the native models, with a particular focus on overlapping hybrid execution on GPU and CPU. You will also learn how the Kokkos model provides build-in instrumentation for an application, which supports kernel based analysis of applications across diverse architectures. Performance data will be shown for Intel Haswell, ARM and OpenPower based systems, with and without GPUs.
Learn about strategies to keep codes maintainable and performant in a diverse high performance computing environment. Using the example of LAMMPS we will demonstrate how the use of Kokkos can reduce code redundancy compared to reimplementing capabilities in hardware specific variants, while delivering similar performance. We will show how new features supported by Kokkos are closing some of the remaining gaps to the native models, with a particular focus on overlapping hybrid execution on GPU and CPU. You will also learn how the Kokkos model provides build-in instrumentation for an application, which supports kernel based analysis of applications across diverse architectures. Performance data will be shown for Intel Haswell, ARM and OpenPower based systems, with and without GPUs.  Back
 
Topics:
Performance Optimization, Tools & Libraries, Programming Languages
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6449
Streaming:
Download:
Share:
 
Abstract:
The Kokkos library enables development of HPC scientific applications that are performance portable across disparate manycore devices such as NVIDIA Kepler and Intel Xeon Phi. Portable programming models such as OpenMP, OpenACC, OpenCL, and Thrust focus on parallel execution but fail to address memory access patterns critical for best performance. In contrast Kokkos integrates compile-time polymorphic data layout with parallel execution policies to portably manage memory access patterns. This year we present recently added Kokkos concepts & capabilities with an emphasis on improvements to usability, such as C++11 lambda support now available in CUDA 7.0 for portable nested parallelism. We will also present a new application of Kokkos to Sandia's multithreaded graph library (MTGL).
The Kokkos library enables development of HPC scientific applications that are performance portable across disparate manycore devices such as NVIDIA Kepler and Intel Xeon Phi. Portable programming models such as OpenMP, OpenACC, OpenCL, and Thrust focus on parallel execution but fail to address memory access patterns critical for best performance. In contrast Kokkos integrates compile-time polymorphic data layout with parallel execution policies to portably manage memory access patterns. This year we present recently added Kokkos concepts & capabilities with an emphasis on improvements to usability, such as C++11 lambda support now available in CUDA 7.0 for portable nested parallelism. We will also present a new application of Kokkos to Sandia's multithreaded graph library (MTGL).  Back
 
Topics:
Tools & Libraries, Developer - Algorithms, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5166
Streaming:
Download:
Share:
 
Abstract:

In this talk we demonstrate how LAMMPS uses the many-core device performance portability library Kokkos to implement a single code base for CPUs, NVIDIA GPUs and Intel Xeon Phi co-processors. This portable code base has equal or better performance compared to LAMMPS' current generation of hardware specific add-on packages.

In this talk we demonstrate how LAMMPS uses the many-core device performance portability library Kokkos to implement a single code base for CPUs, NVIDIA GPUs and Intel Xeon Phi co-processors. This portable code base has equal or better performance compared to LAMMPS' current generation of hardware specific add-on packages.

  Back
 
Topics:
HPC and Supercomputing
Type:
Talk
Event:
Supercomputing
Year:
2013
Session ID:
SC3103
Streaming:
Download:
Share:
 
Abstract:

The session will present implementation and optimization strategies for molecular dynamics many-body potentials. It will concentrate on the new SNAP potential for LAMMPS, which is based on the GAP bispectrum analysis of Bartok et al. [PRL 104, 136403 (2010)]. SNAP is fit to large amounts of quantum-based DFT data and is capable of reproducing the accuracy of DFT while still exhibiting linear scaling with the system size. By exploiting multiple parallelisation layers it is possible to mitigate its high cost of 500,000 flops per interaction through excellent strong scaling behaviour down to 16 atoms per GPU. Thus the achievable time to solution on GPU clusters using SNAP is comparable to running simple Lennard Jones simulations.

The session will present implementation and optimization strategies for molecular dynamics many-body potentials. It will concentrate on the new SNAP potential for LAMMPS, which is based on the GAP bispectrum analysis of Bartok et al. [PRL 104, 136403 (2010)]. SNAP is fit to large amounts of quantum-based DFT data and is capable of reproducing the accuracy of DFT while still exhibiting linear scaling with the system size. By exploiting multiple parallelisation layers it is possible to mitigate its high cost of 500,000 flops per interaction through excellent strong scaling behaviour down to 16 atoms per GPU. Thus the achievable time to solution on GPU clusters using SNAP is comparable to running simple Lennard Jones simulations.

  Back
 
Topics:
Quantum Chemistry
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3080
Streaming:
Download:
Share:
 
Abstract:

Performance on manycore devices is dependent data access patterns where different devices (NVIDIA, Intel-Phi, NUMA) require different data access patterns. A performance-portable programming model does not force a false-choice between arrays-of-structures or structures-of-arrays, instead it defines abstractions to transparently adapt data structures to meet device requirements. The KokkosArray library implements this strategy through simple and intuitive multidimensional array abstractions. Usability and performance-portability is demonstrated with proxy-applications for finite element and molecular dynamics codes. MiniMD, a proxy-application for the LAMMPS molecular dynamic code, has implementations in OpenMP, OpenCL, CUDA, and now KokkosArray. A comparison of miniMD''s KokkosArray implementation with the previous three versions demonstrate the relative strengths and weaknesses of KokkosArray, and that how the portable version retains about 95% of the performance of the "native" versions. Multiphysics applications with heterogeneous finite element discretizations have complex and highly irregular data structures. A KokkosArray-based prototype unstructured heterogeneous finite element mesh library and its support for heterogeneous manycore parallel computations will be presented.

Performance on manycore devices is dependent data access patterns where different devices (NVIDIA, Intel-Phi, NUMA) require different data access patterns. A performance-portable programming model does not force a false-choice between arrays-of-structures or structures-of-arrays, instead it defines abstractions to transparently adapt data structures to meet device requirements. The KokkosArray library implements this strategy through simple and intuitive multidimensional array abstractions. Usability and performance-portability is demonstrated with proxy-applications for finite element and molecular dynamics codes. MiniMD, a proxy-application for the LAMMPS molecular dynamic code, has implementations in OpenMP, OpenCL, CUDA, and now KokkosArray. A comparison of miniMD''s KokkosArray implementation with the previous three versions demonstrate the relative strengths and weaknesses of KokkosArray, and that how the portable version retains about 95% of the performance of the "native" versions. Multiphysics applications with heterogeneous finite element discretizations have complex and highly irregular data structures. A KokkosArray-based prototype unstructured heterogeneous finite element mesh library and its support for heterogeneous manycore parallel computations will be presented.

  Back
 
Topics:
Tools & Libraries, Programming Languages, Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3426
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next