In this talk, we will introduce NVIDIA VisionWorks toolkit, a software development package for computer vision (CV) and image processing. VisionWorks(TM) implements and extends the Khronos OpenVX standard, and it is optimized for CUDA-capable GPUs and SOCs enabling computer vision applications on a scalable and flexible platform. VisionWorks implements a thread-safe API and framework for seamlessly adding user defined primitives. The talk will give an overview of the VisionWorks toolkit, OpenVX API and framework, VisionWorks-plus modules including VisionWorks Structure From Motion and Object Tracker modules, and computer vision pipeline samples showing integration of the library API into a computer vision pipeline on Tegra platforms.
The new NVIDIA® CUDA® Toolkit 8 presents major improvements to the memory model, profiling tools, and new libraries. This enables you to improve performance, simplify memory usage, profile and debug your application more efficiently.
Learn how updates to the CUDA toolkit improve the performance of GPU-accelerated applications. Through benchmark results, we will review the impact of new libraries, updates to memory management and mixed precision programming. The session will cover performance of CUDA toolkit components including libraries and the compiler.
This talk will provide an overview of new debugging and profiling features added in the CUDA 8.0 Toolkit.
The new CUDA Toolkit 8 includes support for Pascal GPUs, up to 2TB of Unified Memory and new automated critical path analysis for effortless performance optimization. This is the most powerful and easy version of the CUDA Toolkit to date.
New to CUDA? Join this free foundational webinar on Wednesday, June 8 to gain essential programming knowledge.
Even those with some CUDA experience can benefit by refreshing the key concepts required for future optimization tutorials.
The course begins with a brief overview of CUDA and data-parallelism before focusing on the GPU programming model, fundamentals of GPU kernels, host and device responsibilities, CUDA syntax and thread hierarchy.
In this session you will learn the main challenges that we have overcome at the BSC to successfully accelerate two large applications by using CUDA and NVIDIA GPUs: WARIS (a Volcanic Ash Transportation Model) and PELE (a Drug Molecule Interaction Simulator). We show that leveraging asynchronous execution is key to achieve a high utilization of the GPU resources (even for very small problem sizes) and to overlap CPU and GPU execution. We also explain some techniques to introduce Unified Virtual Memory in your data structures for seamless CPU/GPU data sharing. Our results show an execution time improvement in WARIS of 8.6x for a 4-GPU node compared to a 16-core CPU node (using by-hand AVX vectorization and MPI). Preliminary experiments in PELE already show a 2x speedup.
Join this session to learn how to use GPUs and CUDA programming to achieve order-of-magnitude speedup even for large codes that are more complex than tutorial examples. We'll cover our multi-year effort on heterogeneous CPU-GPU accelerating for the GROMACS package for molecular dynamics simulations on a wide range of architectures. We'll introduce new results where CUDA has made it possible to accelerate the costly 3D image reconstruction used in single-particle cryo-electron microscopy (cryo-EM) by 20-200X. You'll learn how you can use these tools in your application work, and what strategies to pursue to accelerate difficult codes where neither libraries nor directives use useful, and even moving computational kernels to CUDA seems to fail.
We'll present a real CUDA application and use NVIDIA Nsight Eclipse Edition on Linux to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.
We'll present a real CUDA application and use NVIDIA Nsight Visual Studio Edition on Windows to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.
Learn how to deploy GPU clustering at scale by integrating Chelsio's 40GbE iWARP (RDMA/TCP) into your GPU applications. GPUs have demonstrated paradigm-shifting performance in a wide variety of applications. But there remain network infrastructure challenges to the adoption of GPUs operating at scale, especially in large-scale cloud environments. We present 40GbE iWARP, which leverages existing Ethernet infrastructure and requires no new protocols, interoperability, or long maturity period as the no-risk path for Ethernet-based, large-scale GPU clustering. The session provides a technical overview of 40GbE iWARP, including best practices and tuning for GPU applications.
This session will provide an step-by-step walk through of new features added in NVIDIA Visual Profiler and nvprof. It will show how these profiling tools can be used to identify optimization opportunities at the application, kernel, and source-line levels.