GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
Data centers today benefit from highly optimized hardware architectures and performance metrics that enable efficient provisioning and tuning of compute resources. But these architectures and metrics, honed over decades, are sternly challenged by the rapid increase of AI applications and neural net workloads, where the impact of memory metrics like bandwidth, capacity, and latency on overall performance is not yet well understood. Get the perspectives of AI HW/SW co-design experts from Google, Microsoft, Facebook and Baidu, and technologists from NVIDIA and Samsung, as they evaluate the AI hardware challenges facing data centers and brainstorm current and necessary advances in architectures with particular emphasis on memory's impact on both training and inference.
Data centers today benefit from highly optimized hardware architectures and performance metrics that enable efficient provisioning and tuning of compute resources. But these architectures and metrics, honed over decades, are sternly challenged by the rapid increase of AI applications and neural net workloads, where the impact of memory metrics like bandwidth, capacity, and latency on overall performance is not yet well understood. Get the perspectives of AI HW/SW co-design experts from Google, Microsoft, Facebook and Baidu, and technologists from NVIDIA and Samsung, as they evaluate the AI hardware challenges facing data centers and brainstorm current and necessary advances in architectures with particular emphasis on memory's impact on both training and inference.  Back
 
Topics:
Data Center & Cloud Infrastructure, Performance Optimization, Speech & Language Processing, HPC and AI, HPC and Supercomputing
Type:
Panel
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S91018
Streaming:
Download:
Share:
 
Abstract:
Get to know two different techniques in retrieving parallelism hidden in a general purpose linear programs (LPs) that are broadly used in operations research, computer vision, and machine learning. With conventional solvers often being restricted to serial computation, we'll show two ways of retrieving inherent parallelism, using: (1) parallel sparse linear algebra techniques with an interior-point method, and (2) a higher-level automatic LP decomposition. After a quick introduction to the topic, we'll present details and results for a diverse range of applications on the GPU.
Get to know two different techniques in retrieving parallelism hidden in a general purpose linear programs (LPs) that are broadly used in operations research, computer vision, and machine learning. With conventional solvers often being restricted to serial computation, we'll show two ways of retrieving inherent parallelism, using: (1) parallel sparse linear algebra techniques with an interior-point method, and (2) a higher-level automatic LP decomposition. After a quick introduction to the topic, we'll present details and results for a diverse range of applications on the GPU.  Back
 
Topics:
Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7303
Download:
Share:
 
Abstract:
The Depth-First Search (DFS) algorithm is a fundamental building block used in many higher level applications, such as topological sort and connectivity and planarity testing of graphs. We'll briefly review prior results and propose two novel variations of parallel DFS on DAGs. The first traverses the graph three times in a breadth-first search-like fashion. The second assigns a weight to each edge, such that the shortest path from root to a node corresponds to the DFS path. The parallel algorithm visits all nodes in the graph multiple times and as a result computes the DFS parent relationship, pre- (discovery) and post-order (finish) time for every node. In some cases, the parallel DFS on GPU can outperform sequential DFS on CPU by up to 6x. However, the performance of the algorithm depends highly on the structure of the graph, and is related to the length of the longest path and the degree of nodes in the graph.
The Depth-First Search (DFS) algorithm is a fundamental building block used in many higher level applications, such as topological sort and connectivity and planarity testing of graphs. We'll briefly review prior results and propose two novel variations of parallel DFS on DAGs. The first traverses the graph three times in a breadth-first search-like fashion. The second assigns a weight to each edge, such that the shortest path from root to a node corresponds to the DFS path. The parallel algorithm visits all nodes in the graph multiple times and as a result computes the DFS parent relationship, pre- (discovery) and post-order (finish) time for every node. In some cases, the parallel DFS on GPU can outperform sequential DFS on CPU by up to 6x. However, the performance of the algorithm depends highly on the structure of the graph, and is related to the length of the longest path and the degree of nodes in the graph.  Back
 
Topics:
Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7469
Download:
Share:
 
Abstract:
We'll explore techniques for expressing graph clustering as an eigenvalue problem. Attendees will learn how to express different metrics, including minimum balanced cut, modularity, and Jaccard, through associated matrices and how to use their eigenvectors to find the clustering of the graph into multiple partitions. We'll also show how to take advantage of efficient implementation of Lanczos and LOBPCG eigenvalue solvers and k-means algorithm on the GPU to compute clustering using our general framework. Finally, we'll highlight the performance and quality of our approach versus existing state-of-the-art techniques.
We'll explore techniques for expressing graph clustering as an eigenvalue problem. Attendees will learn how to express different metrics, including minimum balanced cut, modularity, and Jaccard, through associated matrices and how to use their eigenvectors to find the clustering of the graph into multiple partitions. We'll also show how to take advantage of efficient implementation of Lanczos and LOBPCG eigenvalue solvers and k-means algorithm on the GPU to compute clustering using our general framework. Finally, we'll highlight the performance and quality of our approach versus existing state-of-the-art techniques.  Back
 
Topics:
Accelerated Data Science, Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7241
Download:
Share:
 
Abstract:

In this talk we will introduce the basic concepts behind The Simulation Program with Integrated Circuit Emphasis (SPICE) and discuss in detail the two most time consuming parts of the circuit simulation: the device model evaluation and the solution of large sparse linear systems. In particular, we focus on the evaluation of the basic models, such as resistor, capacitor and inductor as well as more complex transistor (BSIM4v7) model on the GPU. Also, we discuss the solution of sets of linear systems that are performed throughout the simulation. We take advantage of the fact that the coefficient matrices in these linear systems have the same sparsity pattern (and often end up with the same pivoting strategy) and show how to obtain their solution using a direct method on the GPU. Finally, we present numerical experiments and discuss future work. Co-authors Francesco Lannutti, Sharanyan Chetlur, Lung Sheng Chien, Philippe Vandermersch.

In this talk we will introduce the basic concepts behind The Simulation Program with Integrated Circuit Emphasis (SPICE) and discuss in detail the two most time consuming parts of the circuit simulation: the device model evaluation and the solution of large sparse linear systems. In particular, we focus on the evaluation of the basic models, such as resistor, capacitor and inductor as well as more complex transistor (BSIM4v7) model on the GPU. Also, we discuss the solution of sets of linear systems that are performed throughout the simulation. We take advantage of the fact that the coefficient matrices in these linear systems have the same sparsity pattern (and often end up with the same pivoting strategy) and show how to obtain their solution using a direct method on the GPU. Finally, we present numerical experiments and discuss future work. Co-authors Francesco Lannutti, Sharanyan Chetlur, Lung Sheng Chien, Philippe Vandermersch.

  Back
 
Topics:
Electronic Design Automation, Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3364
Streaming:
Download:
Share:
 
Abstract:

The libraries distributed in the CUDA SDK and offered by third parties provide a wealth for functions commonly encountered in a GPU acceleration project. Using these libraries can often significantly shorten the development time of a GPU project while leading to high-performance, high-quality software. In the CUDA 5.0 release, NVIDIA introduced enhancements across many libraries to improve performance and take advantage of new features available in the Kepler-series GPUs. In this tutorial, we will provide an overview of the libraries in the CUDA SDK, including cuBLAS, cuRAND, cuSPARSE, cuFFT, NPP and Thrust, as well as libraries provided by 3rd parties. The audience will not only learn about the strengths of the individual libraries, but also learn about the decision making process to select the best suited library for their project.

The libraries distributed in the CUDA SDK and offered by third parties provide a wealth for functions commonly encountered in a GPU acceleration project. Using these libraries can often significantly shorten the development time of a GPU project while leading to high-performance, high-quality software. In the CUDA 5.0 release, NVIDIA introduced enhancements across many libraries to improve performance and take advantage of new features available in the Kepler-series GPUs. In this tutorial, we will provide an overview of the libraries in the CUDA SDK, including cuBLAS, cuRAND, cuSPARSE, cuFFT, NPP and Thrust, as well as libraries provided by 3rd parties. The audience will not only learn about the strengths of the individual libraries, but also learn about the decision making process to select the best suited library for their project.

  Back
 
Topics:
Programming Languages
Type:
Tutorial
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S2629
Streaming:
Download:
Share:
 
Abstract:

A parallel algorithm for solving a sparse triangular linear system on the GPU is proposed. It implements the solution of the triangular system in two phases. The analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. The solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each level are obtained in parallel. The numerical experiments are presented and it is shown that the incomplete-LU and Cholesky preconditioned iterative methods can achieve a 2x speedup on the GPU over their CPU implementation.

A parallel algorithm for solving a sparse triangular linear system on the GPU is proposed. It implements the solution of the triangular system in two phases. The analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. The solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each level are obtained in parallel. The numerical experiments are presented and it is shown that the incomplete-LU and Cholesky preconditioned iterative methods can achieve a 2x speedup on the GPU over their CPU implementation.

  Back
 
Topics:
Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2012
Session ID:
S2149
Streaming:
Download:
Share:
 
Speakers:
Maxim Naumov
- NVIDIA
Abstract:
The CUSPARSE library can impact and enable software solutions for computational science and engineering problems in the fields of energy exploration, physical simulations and life sciences among many others. It provides sparse linear algebra primitives that can be used to implement iterative linear system and eigenvalue solvers and can also serve as a building block for the state-of-the-art sparse direct solvers. CUSPARSE library is implemented using CUDA parallel programming model and provides sparse analogs to BLAS level-1,2,3 operations, such as matrix-vector multiplication, triangular solve and format conversion routines.
The CUSPARSE library can impact and enable software solutions for computational science and engineering problems in the fields of energy exploration, physical simulations and life sciences among many others. It provides sparse linear algebra primitives that can be used to implement iterative linear system and eigenvalue solvers and can also serve as a building block for the state-of-the-art sparse direct solvers. CUSPARSE library is implemented using CUDA parallel programming model and provides sparse analogs to BLAS level-1,2,3 operations, such as matrix-vector multiplication, triangular solve and format conversion routines.   Back
 
Topics:
Tools & Libraries, Developer - Algorithms, HPC and AI
Type:
Talk
Event:
GTC Silicon Valley
Year:
2010
Session ID:
S102070
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next