Uncertainty in locomotion and sensing is one of the primary challenges in the robotics domain. GPU's are emerging as powerful new tools for uncertainty quantification through their ability to perform real-time Monte Carlo simulation as part of a closed-loop control system. By coupling GPU-based uncertainty propagation with optimal control laws, robotic vehicles can "hedge their bets" in unknown environments and protect themselves from unexpected disturbances. Examples of GPU-based stochastic controllers will be discussed for several robotic systems of interest, including simulated and experimental results demonstrating unique improvements in obstacle avoidance and accuracy. The theoretical concepts behind GPU-based control will be described allowing application of these control laws to a wide array of robotic systems.
In this presentation we describe an efficient multi-level parallel implementation of the most significant bit (MSB) radix sort-based multi-select algorithm (k-NN). Our implementation processes multiple queries within a single kernel call with each thread block/warp simultaneously processing different queries. Our approach is incremental and reduces memory transactions through the use of bit operators, warp voting functions, and shared memory. Benchmarks show significant improvement for over previous implementation of k-NN search on the GPU.
Random Forests have become an extremely popular machine learning algorithm for making predictions from large and complicated data sets. The currently highest performing implementations of Random Forests all run on the CPU. We implemented a Random Forest learner for the GPU (using PyCUDA and runtime code generation) which outperforms the currently preferred libraries (scikits-learn and wiseRF). The "obvious" parallelization strategy (using one thread-block per tree) results in poor performance. Instead, we developed a more nuanced collection of kernels to handle various tradeoffs between the number of samples and the number of features.
The rise of the internet, especially mobile internet, has accelerated the data explosion - a driving force for the great success of deep learning in recent years. Behind the scenes, the heterogeneous high-performance computing is another key enabler of that success. In this talk, we will share some of work we did at Baidu. We will highlight how big data, deep analytics and high-performance heterogeneous computing can work together with great success.
Speeding up machine learning algorithms has often meant tedious, bug-ridden programs tuned to specific architectures, all written by parallel programming amateurs. But machine learning experts can leverage libraries such as CuBLAS to greatly ease the burden of development and make fast code widely available. We present a case study in parallelizing Kernel Support Vector Machines, powerful machine-learned classifiers which are very slow to train on large data. In contrast to previous work which relied on hand-coded exact methods, we demonstrate that a recent approximate method can be compelling for its remarkably simple implementation, portability, and unprecedented speedup on GPUs.
See how a cluster of GPUs has enabled our research group to train Artificial Neural Networks with more than 10 billion connections. "Deep learning" algorithms, driven by bigger datasets and the ability to train larger networks, have led to advancements in diverse applications including computer vision, speech recognition, and natural language processing. After a brief introduction to deep learning, we will show how neural network training fits into our GPU computing environment and how this enables us to duplicate deep learning results that previously required thousands of CPU cores.
In this talk, we compare the implementation of deep learning networks  on traditional x86 processors with the implementation on NVIDIA Tesla K20 GPU Accelerators for the purposes of training Restricted Boltzmann Machines  and for deep network back propagation in a large-vocabulary speech recognition task (automatic transcription of TED talks). Two GPU implementations are compared: 1) a high-level implementation using Theano  and 2) a native implementation using low-level CUDA BLAS libraries. We describe the scaling properties of these implementations in comparison to a baseline batched-x86 implementation as a function of training data size. We also explore the development time tradeoffs for each of the implementations.
Machine learning is a powerful tool for processing large amounts of data. Learning to rank plays a key role in many information retrieval problems and constructs a ranking model from training data. Ensemble methods allow us to make a trade-off between the quality of the obtained model and computational time of the learning process. On the other hand a lot of algorithms imply parallel processing of data. We describe the task of machine-learned ranking and consider the MatrixNet algorithm based on decision tree boosting. We present GPU optimized implementation of this method, which performs more than 20 times faster compared to the CPU based version and retains the same quality of ranking.
This talk will describe recent progress in object recognition using deep convolutional networks. Over the last 18 months, these have demonstrated significant gains over traditional computer vision approaches and are now widely used in industry (e.g. Google, Facebook, Microsoft, Baidu). Rob Fergus will outline how these models work, and describe architectures that produce state-of-the-art results on the leading recognition benchmarks. GPUs are an essential component to training these models. The talk will conclude with a live demo.
Significant advances have recently been made in the fields of machine learning and image recognition, impacted greatly by the use of NVIDIA GPUs. Leading performance is harnessed from deep neural networks trained on millions of images to predict thousands of categories of objects. Our expertise at Clarifai in deep neural networks helped us achieve the world's best published image labeling results [ImageNet 2013]. We use NVIDIA GPUs to train large neural networks within practical time constraints and are creating a developer API to enable the next generation of applications in a variety of fields. This talk will describe what these neural networks learn from natural images and how they can be applied to auto-tagging new images, searching large untagged photo collections, and detecting near-duplicates. A live demo of our state of the art system will showcase these capabilities and allow audience interaction.
Many modern 3D range sensors generate on the order of one million data points per second and form the foundation of many modern applications in robotic perception. For real-time performance, it is beneficial to leverage parallel hardware when possible. This poster details work to quickly compress a raw point cloud into a set of parametric surfaces using a GPU-accelerated form of Expectation Maximization. We find that our algorithm is over an order of magnitude faster than the serial C version, while the segmentation provides several orders of magnitude savings in memory while still preserving the geometric properties of the data.
Our goal was development of efficient ensemble learning methods that use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. Integration of CUDA with python and R for accelerating the machine learning process on multiple GPUs is presented. Python GPU acceleration is based on pycuda and cuda math packages that are used in the machine learning environments such as theano and cudatree. R acceleration is based on the integration of plasma and magma GPU accelerated math libraries. In our presentation we have integrated the implementation of GPU accelerated random forest and GPU accelerated support vector machines in the ensemble learning method that runs on multiple GPUs simultaneously. An example of wrapper code for support vector machines is presented and some benchmarks results are depicted.
MALDI imaging is a label-free bioanalytical technique which can capture spatial distribution of hundreds of molecular compounds in a single measurement while maintaining the sample molecular integrity. Development of statistical methods for MALDI involve preprocessing, segmentation, generating prototype molecular images using PCA, and spectra classification. The large amounts of data and complex machine learning algorithms call for GPU acceleration. In the 3D Massomics EU funded project, SagivTech aims at developing a GPU based library for the analysis of MALDI imaging.
This approach aims at aligning, unifying and expanding the set of sentiment lexicons which are available on the web in order to increase their robustness of coverage. Our USL approach computes the unified strength of polarity of each lexical entry based on the Pear- son correlation coefficient which measures how correlated lexical entries are with a value between 1 and -1, where 1 indicates that the lexical entries are perfectly correlated, 0 indicates no correlation, and -1 means they are perfectly inversely correlated and the UnifiedMetrics procedure for CPU and GPU, respectively.
In this poster, we present preliminary results on our rewriting of the source code of NEURON, the de facto standard general neural network simulator in the field of computational neuroscience, using CUDA. In this study, we took a strategy to solve ordinary differential equations for mechanisms in parallel, which describe electrical and chemical properties attached on a small area of cell membrane. The rewriting was rather straightforward, and we were able to achieve maximally 27% speed up of a typical benchmark simulation. These results suggest that CUDA provides simple means to accelerate general neural simulation.
We introduce an algorithm for determining optimal transition paths between given configurations. The solution is obtained by solving variational equations for Freidlin-Wentzell action functionals. One of the applications of the method presented is a system controlling motion and redeployment between unit's formations. The effciency of the algorithm has been evaluated in a simple sandbox environment implemented with the use of the NVIDIA CUDA technology.
Multi-robot coalition formation is an NP hard combinatorial optimization problem. This work models the multi-robot coalition formation problem as a multi-objective optimization problem. Evolutionary approaches have been the preferred choice for finding the set of Pareto-optimal solutions for any multi-objective optimization problem. But due their high computational complexity, these approaches fail to deliver results in time critical scenarios like the coalition formation. This work introduces a novel parallel multi-point Pareto Archived Evolutionary Strategy (PAES) algorithm to solve the multi-robot coalition formation algorithm. The results show that the proposed algorithm is scalable and produces better approximation of the Pareto optimal set than other approaches investigated.
This work presents the implementation of parallel evaluation of coefficients for a navigation system providing a solution in the navigation area of autonomous mobile robots. The idea is make use of parallel computation as a tool to improve the performance and accelerate the evaluation of the coefficients in the system.
We are implementing a fully GPU-based imager for radio interferometric imaging for high sensitivity near real-time imaging. Modern interferometric radio telescope generated many Tera Bytes of data per observation which needs to be imaged in near-real time. Imaging software running on conventional computers currently take many orders of magnitude longer for imaging. In this presentation, we will briefly describe the algorithms and describe in more detail their adaptation for GPUs in particular and for heterogeneous computing in general. We will discuss the resulting run-time performance on the GPU using deal data from existing radio telescopes. Test with our current implementation show a speed-up of upto 100x compared to CPU implementation in the critical parts of processing enabling us to reduce the memory footprint by replacing compute-and-cache with on-demand computing on the GPU. For scientific use cases requiring high resolution high sensitivity imaging such a GPU-based imager represents an enabler technology.
Radio frequency interference (RFI) is the primary enemy of sensitive multi element radio instruments like the Giant Metrewave Radio Telescope (GMRT, India). Signals from radio receivers are corrupted with RFI from power lines, satellite signals, etc. Seen in the form of spikes and bursts in raw voltage data, RFI is statistically seen as outliers in a Gaussian distribution. We present an approach to tackle the problem of RFI, in real-time, using a robust scale estimator such as the Median Absolute Deviation (MAD). Given the large data rate from each of the 30 antennas, sampled at 16 ns, it is necessary for the filter to work well within real-time limits. To accomplish this, the algorithm has been ported to the GPUs to work within the GMRT pipeline. Presently, the RFI rejection pipeline runs in real-time for 0.3-0.7 sec long data chunks. The GMRT will soon be upgraded to work at 10 times the current data rate. We are now working on improving the algorithm further so as to have the RFI rejection pipeline ready for the upgraded GMRT.
A continuation of Part 1, this is a hands-on, interactive demonstration of content creation using UI Composer. The audience will be guided through the steps to build a data-driven virtual automotive gauge. In order to actively participate in this session, attendees are asked to bring their own Windows laptop with UI Composer installed. UI Composer is available for free from http://uicomposer.nvidia.com/
In this session we present a real-time simulation of electromagnetic wave propagation using OptiX GPU ray tracing. This simulation is used in virtual test drives to allow testing of Advanced Driver Assistance Systems which will be based on wireless Car-to-Car communication. Learn how ray tracing performance can be improved to archieve real-time simulations and how the ray tracing results are post-processed to perform the electromagnetic calculations on the GPU using the Thrust library.
Discover how mobile GPUs enable modern features of car driving in a power-efficient and standardized way, by providing the fundamental building blocks of computer vision to the higher-level reasoning functions that enable the car to detect lanes, park automatically, avoid obstacles, etc. We explain the challenges of having to fit into a given time budget, and how the low-level machine vision such as corner detection, feature tracking and even more advanced functionality such as 3D surrounding reconstruction is achieved in the context of the car's systems and its outside environment.
People want not only cost-friendly, trouble-free, energy-efficient, but also safe cars. Today's technology provides Advanced Emergency Braking Systems that can detect pedestrians and automatically brake just before collision is unavoidable. We have a vision that future Advanced Driver Assistance Systems enable not just detecting pedestrians but recognizing how the pedestrians are and understanding the level of danger to avoid emergency situations. We claim deep Convolutional Neural Networks (CNN) are the right tools for these highly non-trivial tasks, and Tegra is the best partner. We demonstrate real-time deep CNN using Tegra.
Learn about two Delphi projects that are pushing the concept of a personalized in-vehicle experience. As drivers bring more of their personal content and personal style into the car, opportunities are emerging for car makers and platform providers to differentiate their offerings. We will explore the infotainment architecture of the future - enabling feature upgrades at the same rate as mobile devices. We will also explore how GPU technology enables "months-to-minutes" user interfaces, and greater flexibility in end-user personalization.
In this session, we will present contents of the Vision Toolkit, discuss performance advantages and demonstrate real-time applications enabled by this library. The Vision Toolkit is a product of NVIDIA, designed to enable real-life Computer Vision applications. It leverages state-of-the-art Computer Vision research and offers a variety of functions to its developers,initially targeting Advanced Driver Assistance Systems (ADAS) and Augmented Reality (AR) applications. The toolkit will be highly GPU accelerated on mobile platforms, offering significant speedup and reducing engineering effort to design real-time vision applications. The toolkit includes open source samples and offers a flexible framework that enables users to extend and contribute new functionality. It will be deployed on different operating systems including Android and the Linux on ARM to registered developers and partners through NVIDIA's web site.
With recent advances in low-cost high-performance LiDARs (laser-based Light Detection and Ranging sensors) and GPUs, ultra-accurate GPS-free navigation based on SLAM (Simultaneous Localization and Mapping) is becoming a reality. Learn how the latest 360? field of view long-range 3D mapping LiDARs capable of generating data streams at gigasample-per-second (GSPS) sampling rates are used with 192 CUDA core GPUs based on the Kepler architecture to run artificial intelligence software and deliver advanced vehicular safety and navigation systems capable of real-time object detection, tracking, identification and classification, as well as offline full-availability jam-proof centimeter-accurate navigation.
The Tegra K1 is a powerful SOC that will be leveraged across many industries. It is based on the same Kepler architecture as the world's fastest gaming systems and most efficient supercomputers and brings supercomputing power to mobile and embedded. Jesse Clayton from NVIDIA will articulate the embedded development process for Tegra K1. The talk will cover the platform, programming paradigm, and development tools, and provide details on the Tegra K1 architecture relevant to embedded applications.
Fast on-road object detection is an important ADAS feature (advanced driver assistance systems). We propose CUDA implementation of soft cascade detector that allows real-time object detection on Tegra K1 platform. Applicable for pedestrian and vehicle detection.
Index of web documents provides a base for search and decision making. Traditionally, GPUs are used to run applications having a lot of parallelism and a small degree of divergence. We show that GPUs also are able to outperform CPUs for an application that has a large degree of parallelism, but medium divergence. Specifically, we concentrate on text processing used to index web documents. We present indexing algorithms for both GPU and CPU and show that GPU outperforms CPU on two common workloads. We argue that a medium sized GPU enabled cluster will be able to index all internet documents in one day. Indexing of web documents on GPU opens a new area for GPU computing. Companies that provide search services spend a lot of cycles on indexing. Faster and more energy efficient indexing on GPU may provide a valuble alternative to CPU-only clusters used today.
BIDMach is an open-source library for GPU-accelerated machine learning. BIDMach on a single GPU node exceeds the performance of all other tools (including **cluster** systems on hundreds of nodes) for the most common machine learning tasks. BIDMach is an easy-to-use, interactive environment similar to SciPy/Matlab, but with qualitatively higher performance. The session will discuss: Performance: BIDMach follows a "Lapack" philosophy of building high-level algorithms on fast low-level routines (like BLAS). It exploits the unique hardware features of GPUs to provide more than order-of-magnitude gains over alternatives. Accuracy: Monte-Carlo methods (MCMC) are the most general way to derive models, but are slow. We have developed a new approach to MCMC which provides two orders of magnitude speedup beyond hardware gains. Our "cooled" MCMC is fast and improves model accuracy. Interactivity: We are developing interactive modeling/visualization capabilities in BIDMach to allow analysts to guide, correct, and improve models in real time.
The Research Computing and Cyberinfrastructure (RCC) Unit at The Pennsylvania State University (PSU) has a strong commitment to GPU enabled research, and is currently a CUDA Research Center. The main GPU cluster Lion-GA consists of Tesla M2070 and M2090 GPU cards, and newer devices including the K20 are available for interactive use. Lion-GA itself is capable of delivering above thirty Teraflops during peak usage, and delivered almost twenty GPU-years in 2012. This presentation will detail experiences in establishing GPU enriched teaching and research at Penn State, covering a broad range of topics including benchmarking, administration, and high level code development.
In today's fast changing business environment, companies are looking for ways to deliver better designs faster and cheaper while creating high quality products across an ecosystem of partners. To succeed, a company must transform its design processes by converting engineering silos into shared engineering clouds that improve collaboration, standardize processes and create a secure environment for sharing designs across operations and organizations including partners and suppliers. The 3D Engineering Cloud Solution is a high performance visual computing environment for organizations that have large 3D intensive graphics requirements and want to improve collaboration while protecting their assets and reducing costs. The 3D Engineering Cloud Solution is made possible due to a partnership between IBM, Citrix, and NVIDIA. This combination creates a unique 3D engineering environment in the Cloud.
Learn how Mechdyne leverages video compression and streaming to create remote collaboration solutions. Connecting CAVEs, Powerwalls and other ultra-resolution displays to enable multi-site, multi-display sharing and decision making. We will explore multiple customer use-cases: immersive-to-immersive, desktop-to-immersive, immersive-to-desktop, monoscopic and stereoscopic.
With GPUs large-scale plasma simulations can provide frames-per-second simulation speeds. We present interactive, in-GPU rendering of large-scale particle-in-cell simulations running on GPU clusters. The user can choose which data is visualized and change the direction of view while the simulation is running. A remote visualization client can connect to the running simulation, allowing for live visualization even when bandwidth is limited.
CFD calculations in an industrial context prioritize fast turn-around times - a requirement that can be addressed by porting parts of the CFD calculation to the GPU, leading to a hybrid CPU/GPU approach. In a first step, the GPU library Culises has been developed, allowing the GPU-based solution of large-scale linear systems of equations that are in turn set up by MPI-parallelized CFD codes (e.g. OpenFOAM) on CPU. In this session we will address a second step, which consists in porting the construction of the linear system to the GPU as well, while pre- and post-processing remain on the CPU. Aiming for industrial applications in the automotive sector, the approach will be aligned on the simpleFOAM solver of OpenFOAM. As the set up of the linear system consumes up to 40-50% of computational time in typical cases of the automotive industry, this approach can further increase the acceleration of CFD computations.
There is a growing need in internal combustion (IC) engine design to resolve the complicated combustion kinetics in simulations. Without more predictive simulation tools in the design cycle, the cost of development will consume new concepts as it becomes harder to meet the performance and emission targets of the future. The combustion kinetics of real transportation fuels involve thousands of components - each that can react through thousands of intermediate species and tens of thousands of reaction paths. GPUs show promise delivering more physical accuracy (per $) to the IC engine design process. Specifically, GPU acceleration of nearly a factor of ten is demonstrated for the integration of multiple chemical source terms in a reacting fluid dynamics simulation. This speedup is achieved by reorganizing the thermodynamics and chemical reaction functions and by updating the sparse matrix functions using NVIDIA's latest GLU library.
Multi-scale molecular dynamics of the systems of nanomagnets is investigated by numerical simulation using parallel algorithms. Fortran- code Magnetodynamics-F provides next types of research: study of the possibility of regulation time of switching of the magnetic moment of the nanostructure; estimation of the role of nanocrystal geometry on super-radiation of 1-, 2- and 3-dimensional objects; study of magnetodynamics of a nanodots inductively coupled with the passive resonator; depending on the solution from initial orientation of the magnetic moment in order to find the configurations for which the super-radiance and radiative damping are maximal. The parallel programs created using application programming interfaces OpenMP and OpenACC. The estimates of speedup and efficiency of implemented algorithms in comparison with sequential algorithms have been obtained. It is shown that the use of NVIDIA Tesla accelerates simulation for study of magnetic dynamics systems which include thousands of magnetic nanoparticles.