GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Presentation
Media
Abstract:
Learn how to speed up a deep feedforward sequential memory network (FSMN) on Volta. We'll describe how to use Tensor Cores to speed up GEMM operations and explain how to optimize an FSMN kernel by increasing its locatiy and reducing its math workload. Although RNNs are a powerful tool to process sequence-to-sequence problems, their recurrent structure increases computational complexity. As an alternative, FSMN can effectively model long-term dependency without using any recurrent structure. We'll show how GPU-friendly FSMN can outperform RNN in both accuracy and speed. Our work is based on Alibaba's deep FSMN model.
Learn how to speed up a deep feedforward sequential memory network (FSMN) on Volta. We'll describe how to use Tensor Cores to speed up GEMM operations and explain how to optimize an FSMN kernel by increasing its locatiy and reducing its math workload. Although RNNs are a powerful tool to process sequence-to-sequence problems, their recurrent structure increases computational complexity. As an alternative, FSMN can effectively model long-term dependency without using any recurrent structure. We'll show how GPU-friendly FSMN can outperform RNN in both accuracy and speed. Our work is based on Alibaba's deep FSMN model.  Back
 
Topics:
Performance Optimization, AI Application, Deployment & Inference, Speech & Language Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9113
Streaming:
Share:
 
Abstract:
We'll discuss Alibaba's PAI tensor accelerator and optimizer (PAI-Tao), an elaborately implemented and optimized AI engine for deep learning training and inference tasks. PAI-Tao is designed with a data-driven and compiler-oriented approach. It periodically collects online running statistics to provide insights for optimization and uses collected statistics to help drive the real optimization work. We'll outline how PAI-Tao's compiler-oriented design can better accommodate diversified and fast-changing AI workloads.
We'll discuss Alibaba's PAI tensor accelerator and optimizer (PAI-Tao), an elaborately implemented and optimized AI engine for deep learning training and inference tasks. PAI-Tao is designed with a data-driven and compiler-oriented approach. It periodically collects online running statistics to provide insights for optimization and uses collected statistics to help drive the real optimization work. We'll outline how PAI-Tao's compiler-oriented design can better accommodate diversified and fast-changing AI workloads.  Back
 
Topics:
AI & Deep Learning Research, Accelerated Data Science
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9280
Streaming:
Download:
Share:
 
Abstract:
We'll share experiences of end-to-end deep learning optimization on Alibaba's platform of artificial intelligence (PAI), including both offline training and online inference. For offline training, dedicated optimization is made for local and distributed environment. For online inference, the optimization is done through both algorithm and system perspectives. Both the methodology and benchmark number are shared during this session. We'll share several business applications driven by these optimizations to ensure learning to bridge the gap between low-level optimization and real business scenarios.
We'll share experiences of end-to-end deep learning optimization on Alibaba's platform of artificial intelligence (PAI), including both offline training and online inference. For offline training, dedicated optimization is made for local and distributed environment. For online inference, the optimization is done through both algorithm and system perspectives. Both the methodology and benchmark number are shared during this session. We'll share several business applications driven by these optimizations to ensure learning to bridge the gap between low-level optimization and real business scenarios.  Back
 
Topics:
Deep Learning & AI Frameworks, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8113
Streaming:
Share:
 
Abstract:
We'll introduce Pluto, a distributed heterogeneous deep learning framework developed by Alibaba. Learn our practice of optimizing large-scale deep learning training processes. We'll introduce several distributed optimization strategies, which have already been deployed into our production training pipeline and serve data scientists inside Alibaba. We'll cover technical details as well as some of our benchmark results.
We'll introduce Pluto, a distributed heterogeneous deep learning framework developed by Alibaba. Learn our practice of optimizing large-scale deep learning training processes. We'll introduce several distributed optimization strategies, which have already been deployed into our production training pipeline and serve data scientists inside Alibaba. We'll cover technical details as well as some of our benchmark results.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Data Center & Cloud Infrastructure
Type:
Talk
Event:
GTC Silicon Valley
Year:
2017
Session ID:
S7650
Download:
Share:
 
Abstract:
在本演讲中,会分享 PAI 团队过去一年多时间里在通用推理优化工具 Blade 上围绕 NVIDIA GPU 硬件的工作进展,包括模型优化、编译优化以及其他系统优化工作,并会结合一些业务案例来展示这些优化工作的成果。我们首先会分享模型结构驱动的优化方法论,基于这套方法论,会针对具体模型,结合底层 GPU 的硬件特性,为不同的计算 building block 选择合理的优化策略,从而使得上层业务模型和底层硬件之间建立更高效性能映射成为可能。基于选定的优化策略,会执行具体的优化实现细节,比如对于高频计算 pattern 会充分挖掘手工优化和底层计算库的性能极限,对于长尾计算 pattern 会基于编译优化技术来进行通用性打击覆盖,对于部分计算热点子图会通过模型压缩的方式来降低理论算力需求并通过系统优化进行配合以确保理论加速比和实际加速比之间的对齐。在优化技术之外同时会介绍在 Blade 开发布署过程中一些实际的工程相关经验,因为只有优化,欠缺了整体系统的考量,很可能会导致优化工作因为最后一公里不能真正在业务中发挥价值,并结合一些实际业务案例来展示完整的优化工作成果。
在本演讲中,会分享 PAI 团队过去一年多时间里在通用推理优化工具 Blade 上围绕 NVIDIA GPU 硬件的工作进展,包括模型优化、编译优化以及其他系统优化工作,并会结合一些业务案例来展示这些优化工作的成果。我们首先会分享模型结构驱动的优化方法论,基于这套方法论,会针对具体模型,结合底层 GPU 的硬件特性,为不同的计算 building block 选择合理的优化策略,从而使得上层业务模型和底层硬件之间建立更高效性能映射成为可能。基于选定的优化策略,会执行具体的优化实现细节,比如对于高频计算 pattern 会充分挖掘手工优化和底层计算库的性能极限,对于长尾计算 pattern 会基于编译优化技术来进行通用性打击覆盖,对于部分计算热点子图会通过模型压缩的方式来降低理论算力需求并通过系统优化进行配合以确保理论加速比和实际加速比之间的对齐。在优化技术之外同时会介绍在 Blade 开发布署过程中一些实际的工程相关经验,因为只有优化,欠缺了整体系统的考量,很可能会导致优化工作因为最后一公里不能真正在业务中发挥价值,并结合一些实际业务案例来展示完整的优化工作成果。   Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC China
Year:
2019
Session ID:
CN9992
Streaming:
Download:
Share:
 
 
Previous
  • Amazon Web Services
  • IBM
  • Cisco
  • Dell EMC
  • Hewlett Packard Enterprise
  • Inspur
  • Lenovo
  • SenseTime
  • Supermicro Computers
  • Synnex
  • Autodesk
  • HP
  • Linear Technology
  • MSI Computer Corp.
  • OPTIS
  • PNY
  • SK Hynix
  • vmware
  • Abaco Systems
  • Acceleware Ltd.
  • ASUSTeK COMPUTER INC
  • Cray Inc.
  • Exxact Corporation
  • Flanders - Belgium
  • Google Cloud
  • HTC VIVE
  • Liqid
  • MapD
  • Penguin Computing
  • SAP
  • Sugon
  • Twitter
Next