Introducing the 3rd Edition of "Programming Massively Parallel Processors – a Hands-on Approach". This new edition is the result of a collaboration between GPU computing experts and covers the CUDA computing platform, parallel patterns, case studies and other programming models. Brand new chapters cover Deep Learning, graph search, sparse matrix computation, histogram and merge sort.
The tightly-coupled GPU Teaching Kit contains everything needed to teach university courses and labs with GPUs.
As performance and functionality requirements of interdisciplinary computing applications rise, industry demand for new graduates familiar with accelerated computing with GPUs grows. This webinar introduces a comprehensive set of academic labs and university teaching material for use in courses leveraging introductory and advanced parallel programming concepts. The teaching materials start with the basics and focus on programming GPUs with CUDA, and go on to advanced topics such as optimization, advanced architectural enhancements, and integration of a variety of programming languages.
I will present two synergistic systems that enable productive development of scalable, Efficient data parallel code. Triolet is a Python-syntax based functional programming system where library implementers direct the compiler to perform parallelization and deep optimization. Tangram is an algorithm framework that supports effective parallelization of linear recurrence computation.
Attend this session to learn new techniques to build a scalable and numerically stable tridiagonal solver for GPUs. It appears the numerical stability was missing in all existing GPU-based tridiagonal solvers. In this work, presented is a scalable, numerically stable, high-performance tridiagonal solver. Solver provides comparable quality of stable solutions to Intel MKL and Matlab, at speed comparable to the GPU tridiagonal solvers in existing packages like CUSPARSE. Presented and analyzed are two key optimization strategies for our solver: a high throughput data layout transformation for memory efficiency, and a dynamic tiling approach for reducing the memory access footprint caused by branch divergence. Several applications are shown to get large benefits from this solver. In this case study, Empirical Mode Decomposition, which is a critical method in time-frequency analyses, is used to demonstrate usability of our solver.