GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
<div> Learn how to implement state-of-the-art preconditioners for iterative solvers of large-scale linear systems in CUDA. Previously most preconditioners were set up on CPUs because this task was not considered suitable for fine-grain parallelization. We&#39;ll show how it&#39;s possible to implement efficient CUDA kernels for techniques like the adaptive factorized sparse approximate inverse by adopting an approach that dramatically reduces the amount of memory required to run in parallel. We&#39;ll describe how our GPU-only preconditioners and solvers can be used to solve real-world problems in science and engineering. We&#39;ll provide single and multi-GPU implementations. Our method makes it possible to obtain about an order-of-magnitude speedup on high-end multi-core CPUs like the Intel Xeon Platinum 8176.</div> <div> &nbsp;</div>
<div> Learn how to implement state-of-the-art preconditioners for iterative solvers of large-scale linear systems in CUDA. Previously most preconditioners were set up on CPUs because this task was not considered suitable for fine-grain parallelization. We&#39;ll show how it&#39;s possible to implement efficient CUDA kernels for techniques like the adaptive factorized sparse approximate inverse by adopting an approach that dramatically reduces the amount of memory required to run in parallel. We&#39;ll describe how our GPU-only preconditioners and solvers can be used to solve real-world problems in science and engineering. We&#39;ll provide single and multi-GPU implementations. Our method makes it possible to obtain about an order-of-magnitude speedup on high-end multi-core CPUs like the Intel Xeon Platinum 8176.</div> <div> &nbsp;</div>  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9192
Streaming:
Download:
Share:
 
Abstract:

Learn how to use (multi) GPU and CUDA to speed up the process of stitching very large images (up to TeraBytes in size). Image stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Image stitching is widely used in many important fields, like high resolution photo mosaics in digital maps and satellite photos or medical images. Motivated by the need to combine images produced in the study of the brain, we developed and released for free the TeraStitcher tool that we recently enhanced with a CUDA plugin that allows an astonishing speedup of the most computing intensive part of the procedure. The code can be easily adapted to compute different kinds of convolution. We describe how we leverage shuffle operations to guarantee an optimal load balancing among the threads and CUDA streams to hide the overhead of moving back and forth images from the CPU to the GPU when their size exceeds the amount of available memory. The speedup we obtain is such that jobs that took several hours are now completed in a few minutes.

Learn how to use (multi) GPU and CUDA to speed up the process of stitching very large images (up to TeraBytes in size). Image stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Image stitching is widely used in many important fields, like high resolution photo mosaics in digital maps and satellite photos or medical images. Motivated by the need to combine images produced in the study of the brain, we developed and released for free the TeraStitcher tool that we recently enhanced with a CUDA plugin that allows an astonishing speedup of the most computing intensive part of the procedure. The code can be easily adapted to compute different kinds of convolution. We describe how we leverage shuffle operations to guarantee an optimal load balancing among the threads and CUDA streams to hide the overhead of moving back and forth images from the CPU to the GPU when their size exceeds the amount of available memory. The speedup we obtain is such that jobs that took several hours are now completed in a few minutes.

  Back
 
Topics:
AI in Healthcare, Medical Imaging & Radiology
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8182
Streaming:
Download:
Share:
 
Abstract:
Learn how to use (multi) GPU and CUDA to speed up the process of ranking the importance of each node in a large scale network. You will see how to solve an extraordinary challenge, that is the exact computation of Betweenness Centrality, by using as building blocks relatively simple algorithms, like the Breadth First Search, that have been highly tuned for latest generation GPU cards. Our approach is fully scalable and overcomes the limitation on the size of the graph that can be studied on a single GPU. We'll present results obtained on both synthetic and real-world graphs.
Learn how to use (multi) GPU and CUDA to speed up the process of ranking the importance of each node in a large scale network. You will see how to solve an extraordinary challenge, that is the exact computation of Betweenness Centrality, by using as building blocks relatively simple algorithms, like the Breadth First Search, that have been highly tuned for latest generation GPU cards. Our approach is fully scalable and overcomes the limitation on the size of the graph that can be studied on a single GPU. We'll present results obtained on both synthetic and real-world graphs.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6157
Streaming:
Download:
Share:
 
Abstract:

Learn how to use multi-GPU and CUDA to speed-up text analysis, indexing and searching of textual data. We present a new framework to index large data sets of heterogeneous data. Our approach is based on a combination of of HPC techniques aimed at improving efficiency and reliability of the indexing process.The solution we propose is scalable and exploits in-memory computing to minimize I/O operations and enhance performance. Moreover we describe the CUDA-based parallelization of the most compute-intensive tasks involved in the indexing process. The integration of the CUDA components within an architecture that is mostly Java-based led us to develop a technique for Java-CUDA interoperability that can be applied to other applications. Some visualisation results will also be presented.

Learn how to use multi-GPU and CUDA to speed-up text analysis, indexing and searching of textual data. We present a new framework to index large data sets of heterogeneous data. Our approach is based on a combination of of HPC techniques aimed at improving efficiency and reliability of the indexing process.The solution we propose is scalable and exploits in-memory computing to minimize I/O operations and enhance performance. Moreover we describe the CUDA-based parallelization of the most compute-intensive tasks involved in the indexing process. The integration of the CUDA components within an architecture that is mostly Java-based led us to develop a technique for Java-CUDA interoperability that can be applied to other applications. Some visualisation results will also be presented.

  Back
 
Topics:
Big Data Analytics, Data Center & Cloud Infrastructure, Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5212
Streaming:
Download:
Share:
 
Abstract:
Learn how to use GPUs as batch processors to simulate thousands of independent systems having a complex dynamics but relatively limited computing requirements. By using an apparently naive approach with a single CUDA thread simulating an entire system, it is possible to obtain excellent global performances and minimize, at the same time, the differences in the results with respect to the original, serial, implementation of the same application. Crucial for the success of the porting is a proper choice of the data structures that need to be designed so that the global memory of the GPU can be accessed effectively even if threads work on distinct problems. The application we present simulates products of primary migration and the expulsion of hydrocarbons from source rock but the idea can be applied to other fields. The final result in our case is a highly scalable code that runs transparently on multiple GPUs and that can be more easily updated when the underlining model changes.
Learn how to use GPUs as batch processors to simulate thousands of independent systems having a complex dynamics but relatively limited computing requirements. By using an apparently naive approach with a single CUDA thread simulating an entire system, it is possible to obtain excellent global performances and minimize, at the same time, the differences in the results with respect to the original, serial, implementation of the same application. Crucial for the success of the porting is a proper choice of the data structures that need to be designed so that the global memory of the GPU can be accessed effectively even if threads work on distinct problems. The application we present simulates products of primary migration and the expulsion of hydrocarbons from source rock but the idea can be applied to other fields. The final result in our case is a highly scalable code that runs transparently on multiple GPUs and that can be more easily updated when the underlining model changes.  Back
 
Topics:
Seismic & Geosciences, Numerical Algorithms & Libraries, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4316
Streaming:
Download:
Share:
 
Abstract:

Graphs with billions of edges do not fit within the device memory of a single GPU. So to explore large graphs, it is necessary to resort to multiple GPUs. Besides the techniques required to improve the load balancing among threads, it is necessary to reduce the communication overhead among GPUs. To that purpose we resort to a pruning procedure to eliminate redundant data and to a new interconnection technology, called APEnet, that is the first, non-NVIDIA, device to exploit the possibilities offered by the GPUdirect technology. Our results show that APEnet performs better than Infiniband and may become a viable alternative for the connectivity of future GPU clusters.

Graphs with billions of edges do not fit within the device memory of a single GPU. So to explore large graphs, it is necessary to resort to multiple GPUs. Besides the techniques required to improve the load balancing among threads, it is necessary to reduce the communication overhead among GPUs. To that purpose we resort to a pruning procedure to eliminate redundant data and to a new interconnection technology, called APEnet, that is the first, non-NVIDIA, device to exploit the possibilities offered by the GPUdirect technology. Our results show that APEnet performs better than Infiniband and may become a viable alternative for the connectivity of future GPU clusters.

  Back
 
Topics:
Developer - Algorithms
Type:
Talk
Event:
GTC Silicon Valley
Year:
2013
Session ID:
S3089
Streaming:
Download:
Share:
 
Speakers:
Massimo Bernaschi
- Institute for Applied Computing National Research & Council & Computer Science Dept., University La Sapienza
 
Topics:
HPC and AI
Type:
Talk
Event:
ISC
Year:
2011
Session ID:
ISC1106
Download:
Share:
 
Speakers:
Massimo Bernaschi
- Istituto Applicazioni del Calcolo - C.N.R.
Abstract:
Dive into implementations of the 3D Heisenberg spin glass model for GPUs. We will discuss results showing that fast shared memory gives better performance with respect to slow global memory only under certain conditions. Covers careful kernel tuning to achieve significant speedup with respect to a state-of-art high end multicore processor.
Dive into implementations of the 3D Heisenberg spin glass model for GPUs. We will discuss results showing that fast shared memory gives better performance with respect to slow global memory only under certain conditions. Covers careful kernel tuning to achieve significant speedup with respect to a state-of-art high end multicore processor.  Back
 
Topics:
Physics Simulation
Type:
Talk
Event:
GTC Silicon Valley
Year:
2010
Session ID:
S102112
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next