SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC On-Demand

Presentation
Media
Abstract:
Learn how GPU Coder produces high-performance CUDA code automatically from a high-level algorithm description in MATLAB. Write your deep learning application with the expressive power of MATLAB, which allows you to describe not just the use of your trained deep learning model in inference mode, but also perform data-augmentation and post-processing of the results to create a complete deployment-ready application. With MATLAB running on your host machine, communicate and control peripheral devices on your Jetson Xavier and DRIVE Xavier platforms to bring in live data from sensors for visualization and analysis. GPU Coder can then generate optimized inference code for the whole application. The deep learning inference model is compiled down to TensorRT's inference engine, while the rest of the application logic is parallelized through creation of CUDA kernels and integrated with other CUDA optimized libraries like cuBLAS, cuFFT, etc. GPU Coder provides a clean, elegant solution to go from algorithm to application deployment, unleashing the performance of CUDA, TensorRT, and the Xavier device architecture.
Learn how GPU Coder produces high-performance CUDA code automatically from a high-level algorithm description in MATLAB. Write your deep learning application with the expressive power of MATLAB, which allows you to describe not just the use of your trained deep learning model in inference mode, but also perform data-augmentation and post-processing of the results to create a complete deployment-ready application. With MATLAB running on your host machine, communicate and control peripheral devices on your Jetson Xavier and DRIVE Xavier platforms to bring in live data from sensors for visualization and analysis. GPU Coder can then generate optimized inference code for the whole application. The deep learning inference model is compiled down to TensorRT's inference engine, while the rest of the application logic is parallelized through creation of CUDA kernels and integrated with other CUDA optimized libraries like cuBLAS, cuFFT, etc. GPU Coder provides a clean, elegant solution to go from algorithm to application deployment, unleashing the performance of CUDA, TensorRT, and the Xavier device architecture.  Back
 
Topics:
AI Application Deployment and Inference, Deep Learning and AI Frameworks, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9281
Streaming:
Download:
Share:
 
Abstract:
Learn how to design a deep learning algorithm in MATLAB and deploy to an embedded Tegra platform, including Jetson TK1, TX1, TX2, and DRIVE PX boards. The workflow starts with algorithm design in MATLAB, which enjoys universal appeal among engineers and scientists because of its expressive power and ease of use. Algorithms used include deep learning augmented with traditional computer vision. Then, networks are trained using NVIDIA GPUs and parallel computing support in MATLAB either on the desktop, a local compute cluster, or in the cloud. Finally, a compiler auto-generates portable and optimized CUDA code from the MATLAB algorithm, which is then cross-compiled and deployed to the Tegra board. Generated code is highly optimized and we present benchmarks that show that performance of generated code is about two-and-a-half times faster than mxNet, about five times faster than Caffe2; about seven times faster than TensorFlow; and is on par with an optimized TensorRT implementation.
Learn how to design a deep learning algorithm in MATLAB and deploy to an embedded Tegra platform, including Jetson TK1, TX1, TX2, and DRIVE PX boards. The workflow starts with algorithm design in MATLAB, which enjoys universal appeal among engineers and scientists because of its expressive power and ease of use. Algorithms used include deep learning augmented with traditional computer vision. Then, networks are trained using NVIDIA GPUs and parallel computing support in MATLAB either on the desktop, a local compute cluster, or in the cloud. Finally, a compiler auto-generates portable and optimized CUDA code from the MATLAB algorithm, which is then cross-compiled and deployed to the Tegra board. Generated code is highly optimized and we present benchmarks that show that performance of generated code is about two-and-a-half times faster than mxNet, about five times faster than Caffe2; about seven times faster than TensorFlow; and is on par with an optimized TensorRT implementation.  Back
 
Topics:
Computer Vision and Machine Vision, Intelligent Machines and IoT, Deep Learning and AI
Type:
Talk
Event:
GTC Washington D.C.
Year:
2017
Session ID:
DC7151
Download:
Share: