GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
We will present the results of of an investigation to speed up and improve power efficiency of dense matrix multiplications in CUDA. These techniques give an effective compute rate greater than the peak performance of a GPU, allowing us to approach 10 TFLOPS sustained in matrix multiplication on a single GPU. Techniques applied include exploitation of Gauss's complex multiplication algorithm and implementing a Strassen-like algorithm to reduce the computational cost from the naive O(n^3). We will discuss how the power efficiency of these dense-linear algebra computations can improved through tile size and input word size choice. Results from the Tesla K80 will show improving power efficiency is the same as improving absolute performance.
We will present the results of of an investigation to speed up and improve power efficiency of dense matrix multiplications in CUDA. These techniques give an effective compute rate greater than the peak performance of a GPU, allowing us to approach 10 TFLOPS sustained in matrix multiplication on a single GPU. Techniques applied include exploitation of Gauss's complex multiplication algorithm and implementing a Strassen-like algorithm to reduce the computational cost from the naive O(n^3). We will discuss how the power efficiency of these dense-linear algebra computations can improved through tile size and input word size choice. Results from the Tesla K80 will show improving power efficiency is the same as improving absolute performance.  Back
 
Topics:
Developer - Algorithms, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2015
Session ID:
S5601
Streaming:
Share:
 
Abstract:
Graphics Processing Units (GPUs) are an increasingly popular platform upon which to deploy lattice quantum chromodynamics calculations. While there has been much progress to date in developing solver algorithms to improve strong scaling on such platforms, there has been less focus on deploying 'mathematically optimal' algorithms. A good example of this are hierarchical solver algorithms such as adaptive multigrid, which are known to solve the Dirac operator with optimal O(N) complexity. We describe progress to date in deploying adaptive multigrid solver algorithms to NVIDIA GPU architectures and discuss in general the suitability of heterogeneous architectures for hierarchical algorithms.
Graphics Processing Units (GPUs) are an increasingly popular platform upon which to deploy lattice quantum chromodynamics calculations. While there has been much progress to date in developing solver algorithms to improve strong scaling on such platforms, there has been less focus on deploying 'mathematically optimal' algorithms. A good example of this are hierarchical solver algorithms such as adaptive multigrid, which are known to solve the Dirac operator with optimal O(N) complexity. We describe progress to date in deploying adaptive multigrid solver algorithms to NVIDIA GPU architectures and discuss in general the suitability of heterogeneous architectures for hierarchical algorithms.  Back
 
Topics:
Computational Physics, Numerical Algorithms & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2014
Session ID:
S4327
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next