SEARCH SESSIONS
SEARCH SESSIONS

Search All
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

Aerospace and Defense
Presentation
Media
Abstract:
We present two algorithms that are specifically designed to accurately detect geospatial objects in geospatial images. Combining these two algorithms with deep learning algorithms, we have achieved detection accuracy over 99% for vehicles, positional ...Read More
Abstract:
We present two algorithms that are specifically designed to accurately detect geospatial objects in geospatial images. Combining these two algorithms with deep learning algorithms, we have achieved detection accuracy over 99% for vehicles, positional accuracy of within 6 pixels, orientation accuracy of less than 10 degrees, and false positive error rate of 0.001% with 7.5cm GSD aerial images. In essence, our algorithms induce learning capability from deep learning into template image matching in geospatial intelligence. Our algorithms reduce false positive error rate by an order of magnitude over softmax classifier. With over 99% accuracy, we believe this may be the game changer in geospatial intelligence domain.  Back
 
Topics:
Aerospace and Defense, Big Data Analytics, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6260
Streaming:
Download:
Share:
 
Abstract:
Cyberspace is a critical domain for government and commercial organizations. It is about networks, devices, and how they interact. Graphs model nodes and links and how they are connected. Defending the critical networks in cyberspace requires process ...Read More
Abstract:
Cyberspace is a critical domain for government and commercial organizations. It is about networks, devices, and how they interact. Graphs model nodes and links and how they are connected. Defending the critical networks in cyberspace requires processing and analyzing extremely large quantities of graph data in near-real time. Key cyber analytics and data sets ranging from Topological Vulnerability Analysis, Traffic Flow Analysis, and Network Attack Graphs are graphs. This session will discuss how Blazegraph GPU meets this challenge by delivering near-real time performance at a very large data scales, uses a flexible and updatable graph representation to support complex analytics, and supports existing graph frameworks (RDF, Tinkerpop) and query languages (SPARQL).  Back
 
Topics:
Aerospace and Defense, Big Data Analytics, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6337
Streaming:
Download:
Share:
 
Abstract:
The Air Force Research Laboratory Information Directorate Advanced Computing and Communications Division is developing a new computing architecture using GPUs, designed to provide high-performance embedded computing (HPEC) pod solution to meet operat ...Read More
Abstract:
The Air Force Research Laboratory Information Directorate Advanced Computing and Communications Division is developing a new computing architecture using GPUs, designed to provide high-performance embedded computing (HPEC) pod solution to meet operational and tactical real-time processing intelligence surveillance and reconnaissance (ISR) missions. This newly designed system, Agile Condor, is a scalable and HPEC system and based on open industry standards that will increase, far beyond the current state of the art, computational capability within the restrictive size, weight, and power constraints of unmanned aircraft systems' external "pod" payloads.  Back
 
Topics:
Aerospace and Defense, Intelligent Machines, IoT & Robotics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6292
Download:
Share:
Algorithms & Numerical Techniques
Presentation
Media
Abstract:
Attendees can learn how to use a low-rank update in linear solver during a nonlinear process--for example, linear programming, structural mechanics, and circuit simulation. A GPU-friendly version is proposed, which is mainly based on BLAS2 operations ...Read More
Abstract:
Attendees can learn how to use a low-rank update in linear solver during a nonlinear process--for example, linear programming, structural mechanics, and circuit simulation. A GPU-friendly version is proposed, which is mainly based on BLAS2 operations. Compared to traditional approaches, with BLAS2 operations, we can hide instruction latency well and achieve full bandwidth of a many-core processor. In this talk, we describe the basic idea of low-rank update and show up to 5x speedup from complexity analysis.  Back
 
Topics:
Algorithms & Numerical Techniques, Computer-Aided Engineering
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6129
Streaming:
Download:
Share:
 
Abstract:
In this talk, we report algorithmic and instruction-level optimizations used in uDeviceX, a CUDA particle simulator for biomedical microfluidic devices. First, an FMA-intense random number generator (RNG) was proposed by exploiting the chaotic l ...Read More
Abstract:

In this talk, we report algorithmic and instruction-level optimizations used in uDeviceX, a CUDA particle simulator for biomedical microfluidic devices. First, an FMA-intense random number generator (RNG) was proposed by exploiting the chaotic logistic map. This RNG can take advantage of the higher FP-to-integer instruction throughput ratio of CUDA GPUs to generate a large number of high quality random streams in situ. Second, warp-votes and shared memory were used to consolidate workload from diverging warps. Last, inline PTX was used to emulate 24-bit integer arithmetics by their floating point counterparts in order to increase throughput. An implementation using C++ templates ensures that no type-casting overhead is triggered and also guards the technique from unintentional usage.

  Back
 
Topics:
Algorithms & Numerical Techniques, Computational Biology & Chemistry, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6140
Streaming:
Download:
Share:
 
Abstract:
Learn how to use (multi) GPU and CUDA to speed up the process of ranking the importance of each node in a large scale network. You will see how to solve an extraordinary challenge, that is the exact computation of Betweenness Centrality, by using as ...Read More
Abstract:
Learn how to use (multi) GPU and CUDA to speed up the process of ranking the importance of each node in a large scale network. You will see how to solve an extraordinary challenge, that is the exact computation of Betweenness Centrality, by using as building blocks relatively simple algorithms, like the Breadth First Search, that have been highly tuned for latest generation GPU cards. Our approach is fully scalable and overcomes the limitation on the size of the graph that can be studied on a single GPU. We'll present results obtained on both synthetic and real-world graphs.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6157
Streaming:
Download:
Share:
 
Abstract:
Contemporary microprocessors use relaxed memory consistency models to allow for aggressive optimizations in hardware. This enhancement in performance comes at the cost of design complexity and verification effort. In particular, verifying an executio ...Read More
Abstract:
Contemporary microprocessors use relaxed memory consistency models to allow for aggressive optimizations in hardware. This enhancement in performance comes at the cost of design complexity and verification effort. In particular, verifying an execution of a program against its system's memory consistency model is an NP-complete problem. This session improves upon existing work by introducing an algorithm that not only reduces the time complexity of the verification process, but also facilitates the development of parallel algorithms for solving these problems. For large tests of interest, our GPU implementation achieves an average application speedup of 26x over existing techniques in use at NVIDIA.  Back
 
Topics:
Algorithms & Numerical Techniques, Big Data Analytics, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6180
Streaming:
Download:
Share:
 
Abstract:
There is more to atomicCAS than the double-precision atomicAdd loop from the programming guide. Something different from the universal atomic operation loop it represents. We'll show how to build shared, memory-based hash function loops to solve d ...Read More
Abstract:
There is more to atomicCAS than the double-precision atomicAdd loop from the programming guide. Something different from the universal atomic operation loop it represents. We'll show how to build shared, memory-based hash function loops to solve different counting and grouping problems at warp- and block-level. Variations of this loop can be used to count unique elements in a block, find threads sharing common data elements, or speed up histogram building for large numbers of bins. With the now natively implemented atomic operations on shared memory on Maxwell, these functions can be significantly faster than algorithms optimised for other architectures.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6220
Streaming:
Download:
Share:
 
Abstract:
Learn about a new hierarchical matrix structure for fast linear algebra computations on GPUs! Recursivity, tree traversal, hierarchical data layout, and batched kernel executions are some of the ingredients of a new HPC recipe for computing challengi ...Read More
Abstract:
Learn about a new hierarchical matrix structure for fast linear algebra computations on GPUs! Recursivity, tree traversal, hierarchical data layout, and batched kernel executions are some of the ingredients of a new HPC recipe for computing challenging linear algebra operations and solving large scientific problems (e.g., spatial statistics) on GPUs. By exploiting the low-rank matrix representations, the original dense matrix of the problem can be approximated, which results in saving the memory footprint and reducing the algorithmic complexity, while still maintaining an adequate solution accuracy. In addition, the talk showcases a new high-performance hierarchical symmetric eigensolver and SVD, juicing the horsepower out of multiple GPUs to the fullest.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6230
Streaming:
Download:
Share:
 
Abstract:
Markov decision processes have been used in real-world path planning, where environment information is incomplete or dynamic. The problem with the MDP formalism is that its state space grows exponentially with the number of domain variables, and its ...Read More
Abstract:
Markov decision processes have been used in real-world path planning, where environment information is incomplete or dynamic. The problem with the MDP formalism is that its state space grows exponentially with the number of domain variables, and its inference methods grow with the number of available actions. To overcome this issue, we formulate an MDP solver in terms of matrix multiplications, based on the value iteration algorithm; thus we can take advantage of GPUs to produce interactively obstacle-free paths in the form of an optimal policy. We'll present a performance analysis of our technique using Jetson TK1, CPU, and GPU platforms. Our algorithm presents 90x speed-up in GPUs, and 30x speed-up in the Jetson TK1 in contrast with its CPU multi-threaded version.  Back
 
Topics:
Algorithms & Numerical Techniques, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6268
Streaming:
Download:
Share:
 
Abstract:
We'll present an overview of the internals of the XMP multiple precision library and take a detailed look at the low-level algorithms used for modular squaring and modular multiplication on Kepler and present novel algorithms for Maxwell. Modular ...Read More
Abstract:
We'll present an overview of the internals of the XMP multiple precision library and take a detailed look at the low-level algorithms used for modular squaring and modular multiplication on Kepler and present novel algorithms for Maxwell. Modular multiplication is a performance-critical primitive and widely used in cryptographic algorithms from prime testing and factorization to public key/private key algorithms such as RSA, Diffie-Hellman, and digital signatures.  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6349
Streaming:
Download:
Share:
 
Abstract:
Learn how the world's most powerful quantum computers are simulated and benchmarked using GPU-based Monte Carlo algorithms. We'll introduce D-Wave's quantum annealing platform, describe several Monte Carlo algorithms for their simulation, an ...Read More
Abstract:
Learn how the world's most powerful quantum computers are simulated and benchmarked using GPU-based Monte Carlo algorithms. We'll introduce D-Wave's quantum annealing platform, describe several Monte Carlo algorithms for their simulation, and compare CPU- and GPU-based implementations of these algorithms. In particular, we'll focus on considerations of memory layout and fast mathematical functions to maximize speed. Finally, we'll present benchmarking results, including CPU-based algorithms, GPU-based algorithms, and D-Wave's latest-generation quantum annealers.  Back
 
Topics:
Algorithms & Numerical Techniques, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6380
Streaming:
Download:
Share:
 
Abstract:
Sparse matrix factorization is a fundamental tool in scientific computing and has been shown to be well accelerated using GPUs. Yet applying the full capability of the GPU to the factorization operation remains a challenge. This talk covers the lates ...Read More
Abstract:
Sparse matrix factorization is a fundamental tool in scientific computing and has been shown to be well accelerated using GPUs. Yet applying the full capability of the GPU to the factorization operation remains a challenge. This talk covers the latest GPU optimizations that have been applied to the Cholesky factorization algorithm within the well-known SuiteSparse/CHOLMOD linear solver. These optimizations include new NVIDIA CUDA versions of BLAS and LAPACK routines to accelerate operations on batches of small, non-uniformly sized matrices, hybrid computing enhancements, support for multi-GPU acceleration, and further avoidance of PCIe communication through refinements to the sub-tree algorithm.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6387
Streaming:
Download:
Share:
 
Abstract:
We'll present several methods for detecting pairs of vectors, which are in Hamming distance 1. This problem is an important part of the cell graph construction in motion planning in a space with obstacles. We'll begin with a naive square-time s ...Read More
Abstract:
We'll present several methods for detecting pairs of vectors, which are in Hamming distance 1. This problem is an important part of the cell graph construction in motion planning in a space with obstacles. We'll begin with a naive square-time solution, which simply compares pairs of vectors, through building dedicated search trees, moving towards an optimal linear algorithm. Sequential linear time algorithms for the problem were already known, but due to high constants hidden in the complexity function, they appeared to be not very efficient for real-life data. Our GPU-based massively parallel solution promises acceptable execution times, opening dynamic cell graph construction for real-time applications like robotics and optimal path searching.  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6402
Streaming:
Share:
 
Abstract:
Matching is a fundamental graph problem with numerous applications in science and engineering. This talk discusses the efficient implementation of half-approximate weighted matching on GPUs. We start by describing the Suitor algorithm, currently cons ...Read More
Abstract:
Matching is a fundamental graph problem with numerous applications in science and engineering. This talk discusses the efficient implementation of half-approximate weighted matching on GPUs. We start by describing the Suitor algorithm, currently considered the best algorithm for this problem, and identifying by its key implementation challenges. In its basic formulation, the Suitor algorithm appears poorly suited to GPUs, due to the irregular memory accesses and the use of locks. We proceed by introducing four variants of the algorithm that progressively address these challenges by exploiting Kepler's hardware features. We demonstrate that the final implementation outperforms by several times the performance of previous best matching algorithms for GPUs and of the Suitor algorithm on CPUs.  Back
 
Topics:
Algorithms & Numerical Techniques, Big Data Analytics, Aerospace and Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6423
Streaming:
Download:
Share:
 
Abstract:
We'll present graphs as powerful tools when analyzing complex relationships between entities. We'll share how many structures commonly found in computer science, like social networks, computer networks, and the world wide web, can be modeled as ...Read More
Abstract:
We'll present graphs as powerful tools when analyzing complex relationships between entities. We'll share how many structures commonly found in computer science, like social networks, computer networks, and the world wide web, can be modeled as graphs. Since many of the real graphs are very large and complex, the associated analysis algorithms must be very efficient and highly parallel. We present two implementations of a key graph-based analysis such as the triangle enumeration for two different parallel paradigms: GPU programming and Apache Spark. We'll reveal the performance of the two different implementations for the different paradigms as the characteristics of the graph change.  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6424
Streaming:
Download:
Share:
 
Abstract:
We'll present the sparse direct method, a multifrontal QR factorization intended specifically for GPU accelerators. Our approach relies on the use of a bucket scheduler that exploits an irregular parallelism on both a coarse grain, among a set of ...Read More
Abstract:
We'll present the sparse direct method, a multifrontal QR factorization intended specifically for GPU accelerators. Our approach relies on the use of a bucket scheduler that exploits an irregular parallelism on both a coarse grain, among a set of fronts with different characteristics, and on a fine grain, through the exploitation of the staircase shape of these fronts. The scheduler then relies on dense GPU kernels which design and implementation target recent GPU architectures.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6439
Streaming:
Share:
 
Abstract:
Most GPU data structures must be rebuilt (often on the CPU) any time they are modified. We'll examine the challenges of building and maintaining mutable data structures on the GPU, and will present our solution for one particular data structure: t ...Read More
Abstract:
Most GPU data structures must be rebuilt (often on the CPU) any time they are modified. We'll examine the challenges of building and maintaining mutable data structures on the GPU, and will present our solution for one particular data structure: the quotient filter. A quotient filter is used for performing fast database queries, similar to a Bloom filter. We describe our search for an efficient parallelization of construction, insertion, and query operations on the quotient filter data structure. We show that this data structure can outperform a Bloom filter for database lookups and insertions, while also providing much greater flexibility.  Back
 
Topics:
Algorithms & Numerical Techniques, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6464
Streaming:
Download:
Share:
 
Abstract:
We'll present the CUDA implementation of algorithm to test chordality of graphs, which uses the parallel partition refinement with pivots. A graph is chordal if each cycle of size greater than three in $G$ has a chord, that is an edge between two ...Read More
Abstract:
We'll present the CUDA implementation of algorithm to test chordality of graphs, which uses the parallel partition refinement with pivots. A graph is chordal if each cycle of size greater than three in $G$ has a chord, that is an edge between two non-adjacent vertices on the cycle. In total, the algorithm takes O(N) time on N-threads grid and it performs O(N+M) work for graphs of N vertices and M edges. We'll compare the performance tests results achieved by the CUDA implementation on NVIDIA GeForce GTX TITAN X and the sequential implementation on CPU with four cores (eight threads). We'll present the tests results for cliques, sparse graphs, dense graphs, and random chordal graphs.  Back
 
Topics:
Algorithms & Numerical Techniques, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6489
Streaming:
Download:
Share:
 
Abstract:
Learn techniques for efficient batched computations on GPUs, where small and independent computations must be grouped and executed together to obtain high performance. These problems occur very frequently in scientific applications like machine learn ...Read More
Abstract:
Learn techniques for efficient batched computations on GPUs, where small and independent computations must be grouped and executed together to obtain high performance. These problems occur very frequently in scientific applications like machine learning, data mining, dense and sparse solvers, high-order FEM, astrophysics, and more. We will consider the development of batched computations for these applications, stressing innovative GPU techniques and algorithms for uniform, as well as variable-size batches, tensor contractions, batched BLAS, and more. Batched computations can fill up the GPU with work, remove scheduling overheads and costly CPU-GPU communications to accelerate the computation often by an order of magnitude compared to non-batched approaches.  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6509
Streaming:
Download:
Share:
 
Abstract:
For Sierra, a pre-exascale CORAL supercomputer arriving at Lawrence Livermore National Lab in 2017, neutral-particle transport codes will be a primary application and ensuring peak performance of these applications on this system (multiple IBM P ...Read More
Abstract:

For Sierra, a pre-exascale CORAL supercomputer arriving at Lawrence Livermore National Lab in 2017, neutral-particle transport codes will be a primary application and ensuring peak performance of these applications on this system (multiple IBM POWER9 CPUs + multiple Volta GPUs per node) is important. In preparation, transport mini-apps, like Kripke, are being optimized on today's hybrid CPU-GPU clusters using different programming models. This talk discusses performance issues encountered by Kripke on these systems and their solutions. Specifically we will focus on: a) a novel implementation of the sweep algorithm; b) techniques useful for modeling physical problems requiring memory footprint exceeding the aggregated GPU memory; and c) porting Kripke using OpenMP4.

  Back
 
Topics:
Algorithms & Numerical Techniques, Computational Physics, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6513
Streaming:
Share:
 
Abstract:
Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, pr ...Read More
Abstract:
Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often use a sort instead. However, sort does more work than necessary to implement multisplit, and is thus inefficient. In this work, we provide a parallel model and multiple implementations for the multisplit problem, with a focus on a small number of buckets. In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations. We use warp-synchronous programming models as well as hierarchical reordering of input elements to achieve better performance.  Back
 
Topics:
Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6517
Streaming:
Download:
Share:
 
Abstract:
We present a parallel Monte Carlo (MCX) algorithm accelerated by GPUs for modeling time-resolved photon migration in a 3-D turbid media. We'll present optimizations that benefit execution on a single GPU as well as multiple GPUs. By leveraging per ...Read More
Abstract:
We present a parallel Monte Carlo (MCX) algorithm accelerated by GPUs for modeling time-resolved photon migration in a 3-D turbid media. We'll present optimizations that benefit execution on a single GPU as well as multiple GPUs. By leveraging persistent threads, our single-GPU implementation provides a high-performance parallel simulation of MCX when run on an NVIDIA GPU. Our implementation is automatically tuned to leverage persistent threads for different GPU architectures. We achieved improvements over 25% for Kepler and 12% for Maxwell architecture as compared to using a heuristic approach. In addition, we propose a linear programming approach based on predictive modeling to optimize MCX execution on multiple devices.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization, Rendering & Ray Tracing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6635
Streaming:
Download:
Share:
 
Abstract:
Reducing training time allows us to learn from our experiments more quickly and make new innovations based on what we've learned. Using less than the standard 32 bits to represent a number can help reduce training times. We'll talk about how to ...Read More
Abstract:
Reducing training time allows us to learn from our experiments more quickly and make new innovations based on what we've learned. Using less than the standard 32 bits to represent a number can help reduce training times. We'll talk about how to use 16-bit floating point because it is starting to have wide hardware support with the release of Pascal. Unfortunately, naively converting all datatypes from 32- to 16-bits doesn't work, as training stability and accuracy are comprised. We'll discuss the reasons for the difficulties and solutions. Finally, we'll show performance and scalability improvements due to using reduced precision.  Back
 
Topics:
Algorithms & Numerical Techniques, Artificial Intelligence and Deep Learning, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6661
Streaming:
Download:
Share:
 
Abstract:
We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model for machine learning, that avoids computing a complete table of partial sums of the relative probabili ...Read More
Abstract:
We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model for machine learning, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate ("butterfly-patterned") form is faster to compute, making better use of coalesced memory accesses. From this table, complete partial sums are computed on the fly during a binary search. Measurements using an NVIDIA TITAN Black GPU show that for a sufficiently large number of clusters or topics (K > 200), this technique alone more than doubles the speed of a latent Dirichlet allocation (LDA) application already highly tuned for GPU execution.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6665
Streaming:
Download:
Share:
 
Abstract:
We describe two new classes of algorithm for a "splittable" pseudorandom number generator (PRNG) that is quite fast: either 9 or 11 64-bit arithmetic/logical operations per 64 bits generated. A splittable PRNG provides a "s ...Read More
Abstract:
We describe two new classes of algorithm for a "splittable" pseudorandom number generator (PRNG) that is quite fast: either 9 or 11 64-bit arithmetic/logical operations per 64 bits generated. A splittable PRNG provides a "split" operation that creates a new PRNG that is computationally and statistically independent of its creator and therefore may be used in parallel. Splittable PRNG objects make it easy to organize the use of pseudorandom numbers in multithreaded programs where the number of threads may vary dynamically, but also have sufficient speed and quality to be useful when the number of threads is fixed. It is faster than MRG32k3a and of higher quality than XORWOW. No locking or synchronization is required, and the algorithm is quite suitable for SIMD or GPU implementation.  Back
 
Topics:
Algorithms & Numerical Techniques, Tools & Libraries, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6666
Streaming:
Download:
Share:
 
Abstract:
In this work we show the connection between two problems: halo-finding and heavy hitters. Finding haloes, dense clumps of matter, in output of cosmological simulation is crucial for verifying theoretical models using observation. Current algorithms r ...Read More
Abstract:
In this work we show the connection between two problems: halo-finding and heavy hitters. Finding haloes, dense clumps of matter, in output of cosmological simulation is crucial for verifying theoretical models using observation. Current algorithms require to load full dataset into memory, making computations infeasible on the desktop machine. We reduce halo-finding problem to problem of finding most frequent items (heavy hitters) in streaming data, and apply two algorithms: Pick-and-Drop and Count Sketch. These algorithms can find top 1000 largest haloes with logarithmical memory usage, but time performance is poor. GPU acceleration makes it possible to make several passes in reasonable time, thus helping to find more haloes in the future.  Back
 
Topics:
Algorithms & Numerical Techniques, Astronomy & Astrophysics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6671
Streaming:
Share:
 
Abstract:
We study and implement the Augmented Block Cimmino Distributed (ABCD) algorithm on the GPU. Because of the special structure of tridiagonal matrices, we investigate the boundary padding technique to eliminate the execution branches on GPU for better ...Read More
Abstract:
We study and implement the Augmented Block Cimmino Distributed (ABCD) algorithm on the GPU. Because of the special structure of tridiagonal matrices, we investigate the boundary padding technique to eliminate the execution branches on GPU for better performance. In addition, our implementation incorporates various performance optimization techniques, such as memory coalesce, to further enhance the performance.  Back
 
Topics:
Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6120
Download:
Share:
 
Abstract:
In many areas, from physics to economics to social sciences, there are problems that can be mapped to stochastic cellular automata (SCA). In combination with machine learning techniques, cellular automata with learned rules can be used to efficiently ...Read More
Abstract:
In many areas, from physics to economics to social sciences, there are problems that can be mapped to stochastic cellular automata (SCA). In combination with machine learning techniques, cellular automata with learned rules can be used to efficiently predict real-world systems. In physics, they are used to study atomistically the size and shape evolution of micro- and nanostructures, providing insights into processes of self-organization crucial to today's nanotechnology. We present an extremely efficient SCA implementation of a surface growth model using bit-vectorization enhanced by non-local encoding on GPU. The employed technique and non-local encoding can be transferred to other applications.  Back
 
Topics:
Algorithms & Numerical Techniques, Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6124
Download:
Share:
 
Abstract:
LZW is a popular lossless compression method used in UNIX file compression utility "compress" and in the GIF/TIFF image formats. However, it is very hard to parallelize it, because it creates a dictionary sequentially by reading the input d ...Read More
Abstract:
LZW is a popular lossless compression method used in UNIX file compression utility "compress" and in the GIF/TIFF image formats. However, it is very hard to parallelize it, because it creates a dictionary sequentially by reading the input data one by one. The main contribution of this work is to show a fully parallelized LZW decompression, which assigns each thread to an input compressed code, and converts it into the corresponding original input string. We have implemented our fully parallelized LZW decompression using CUDA. The experimental results show that our CUDA implementation on GeForce GTX 980 can attain 40 times speedup over a sequential implementation on Intel Core i7-4790. We also show that our LZW decompression is useful for big data and deep learning applications.  Back
 
Topics:
Algorithms & Numerical Techniques, Video & Image Processing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6128
Download:
Share:
 
Abstract:
We show the acceleration of sparse matrix vector multiplication (SpMV) on GPU by highly reducing memory traffic. SpMV is a dominant kernel in many sparse algorithms. The performance of SpMV is limited by memory bandwidth and lower locality of memory ...Read More
Abstract:
We show the acceleration of sparse matrix vector multiplication (SpMV) on GPU by highly reducing memory traffic. SpMV is a dominant kernel in many sparse algorithms. The performance of SpMV is limited by memory bandwidth and lower locality of memory access to input vector causing performance degradation. We propose a new sparse matrix format, which alleviates these problems about memory bound by adaptive multi-level blocking techniques and compressing the index of the given matrix. Performance evaluations of SpMV for 40 matrix datasets show that we achieve speedups of 2.91X on maximum and 1.81X on average compared to NVIDIA's cuSparse library. We also find out the memory traffic in SpMV can be estimated and the performance of SpMV strongly depends on the memory traffic.  Back
 
Topics:
Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6132
Download:
Share:
 
Abstract:
We present a high performance hierarchical matrix vector multiplication using hardware accelerators. By properly mapping the tree structures to the GPU and overlapping the phases of the computation using streams, we greatly outperform the CPU impleme ...Read More
Abstract:
We present a high performance hierarchical matrix vector multiplication using hardware accelerators. By properly mapping the tree structures to the GPU and overlapping the phases of the computation using streams, we greatly outperform the CPU implementations and achieve up to 80% of the sustained bandwidth of the GPU.  Back
 
Topics:
Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6140
Download:
Share:
 
Abstract:
The algorithms for isosurface extraction from volumetric data have become crucial in the petroleum industry, medicine, and many other fields over the last years. They are computationally intensive, especially for large, high-resolution domains. Our G ...Read More
Abstract:
The algorithms for isosurface extraction from volumetric data have become crucial in the petroleum industry, medicine, and many other fields over the last years. They are computationally intensive, especially for large, high-resolution domains. Our GPU implementation of Marching Tetrahedra algorithm is not only immensely fast but allows us to split the domain across multiple GPUs. Processing of large domains is now a matter of seconds. For smaller domains, the algorithm is able to compute the isosurface in milliseconds and the resulting model is visualized in real time.  Back
 
Topics:
Algorithms & Numerical Techniques, Medical Imaging & Radiology
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6141
Download:
Share:
 
Abstract:
We describe the work done at the Oxford e-Research Centre (OeRC) at Oxford University toward accelerating one of the most demanding computational tasks of the real-time pulsar signal processing pipeline of the world's largest next generation radio t ...Read More
Abstract:
We describe the work done at the Oxford e-Research Centre (OeRC) at Oxford University toward accelerating one of the most demanding computational tasks of the real-time pulsar signal processing pipeline of the world's largest next generation radio telescope, the Square Kilometre Array (SKA). We introduce the problem of pulsar acceleration searches and a Fourier domain computational method used for detecting signals from accelerated pulsars. A GPU implementation and optimizations results are presented in the context of the SKA timing requirements. This work is done as part of Astro-Accelerate, a real-time time-domain data processing library, currently under development at the OeRC.  Back
 
Topics:
Algorithms & Numerical Techniques, Astronomy & Astrophysics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6227
Download:
Share:
 
Abstract:
We present an accelerated implementation of the Faddeev-Leverrier algorithm (FLA) to solve the Eigenvalue Problem. The problem, being recursive in nature, cannot be directly extended to a parallel implementation. Instead, a hybrid model is implemente ...Read More
Abstract:
We present an accelerated implementation of the Faddeev-Leverrier algorithm (FLA) to solve the Eigenvalue Problem. The problem, being recursive in nature, cannot be directly extended to a parallel implementation. Instead, a hybrid model is implemented to harness the combined computing power of the CPU and GPU more effectively.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6230
Download:
Share:
 
Abstract:
This work focuses on input processing of big data streams in a GPU-accelerated in-memory OLAP (MOLAP) database by Jedox. We present a solution that supports fast insertion of high data volumes by avoiding the compute-expensive task of multidimensiona ...Read More
Abstract:
This work focuses on input processing of big data streams in a GPU-accelerated in-memory OLAP (MOLAP) database by Jedox. We present a solution that supports fast insertion of high data volumes by avoiding the compute-expensive task of multidimensional sorting during the actual insertion phase. The main processing step achieves a significant speedup over existing CPU-only version.  Back
 
Topics:
Algorithms & Numerical Techniques, Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6256
Download:
Share:
 
Abstract:
The Fast Fourier Transform (FFT) is one of the most important numerical tools widely used in many scientific and engineering applications. The algorithm performs O(NlogN) operations on N input data points in order to calculate only small number of k ...Read More
Abstract:
The Fast Fourier Transform (FFT) is one of the most important numerical tools widely used in many scientific and engineering applications. The algorithm performs O(NlogN) operations on N input data points in order to calculate only small number of k large coefficients, while the rest of N-K numbers are zero or negligibly small. The algorithm is clearly inefficient, when N points input data lead to only k << N non-zero coefficients in the transformed domain. The sparse FFT (sFFT) algorithm provides a solution to this problem. In this poster, we present a parallel sFFT algorithm on GPUs using CUDA. Our CUDA-based sFFT, namely cusFFT, performs over 10x faster than the state-of-the-art cuFFT library on GPUs and over 28x faster than the parallel FFTW on multicore CPUs.  Back
 
Topics:
Algorithms & Numerical Techniques
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6261
Download:
Share:
 
Abstract:
We focus on a single code base for a certain scientific algorithm, a performance portable C++ implementation, using only a single code base that is easily executable in both CPU and GPU. For that purpose, we present our core algorithm -- the fast mul ...Read More
Abstract:
We focus on a single code base for a certain scientific algorithm, a performance portable C++ implementation, using only a single code base that is easily executable in both CPU and GPU. For that purpose, we present our core algorithm -- the fast multipole method -- embedded in a stack of abstraction layers, allowing us to achieve portability without maintaining separate kernels for each architecture. In addition, we'll review common implementation pitfalls that might help other developers when aiming at a unified code base. Especially memory allocation, memory access, and the abstraction of SIMT for complex user-defined data structures are investigated. Finally, we present results/comparisons of the performance on a CPU and GPU.  Back
 
Topics:
Algorithms & Numerical Techniques, HPC and Supercomputing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6265
Download:
Share:
 
Abstract:
We propose a new parallel algorithm for solving the APSP problem. The algorithm is based on Floyd-Warshall and, therefore, borrows some of its advantages as having a predictable performance regardless of the underlying graph structure. It was efficie ...Read More
Abstract:
We propose a new parallel algorithm for solving the APSP problem. The algorithm is based on Floyd-Warshall and, therefore, borrows some of its advantages as having a predictable performance regardless of the underlying graph structure. It was efficiently implemented on a machine with a many-core GPU, which is less expensive than a cluster of computers. The tests were performed on a Tesla C2075 graphics card. The implementation was able to identify the shortest paths among all pairs of vertices of randomly generated graphs (all containing a maximum of 8192 vertices) in less than 15 seconds, which represents a speedup of 150x over the sequential Floyd-Warshall algorithm.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6272
Download:
Share:
 
Abstract:
Graphs play a very important role in the field of science and technology for finding the shortest distance. Large graphs are common in scientific and engineering applications consisting of operation on millions of vertices and edges. For faster execu ...Read More
Abstract:
Graphs play a very important role in the field of science and technology for finding the shortest distance. Large graphs are common in scientific and engineering applications consisting of operation on millions of vertices and edges. For faster execution of such operations, parallel computation is essential. GPUs have high computation power and a low price. CUDA technology is becoming a new programming approach for GPGPUs. A multithreaded CUDA device makes various threads to run in parallel using GPUs. We demonstrate the comparison between serial and parallel implementation of BFS and DIJKSTRA algorithms.  Back
 
Topics:
Algorithms & Numerical Techniques, Performance Optimization
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6285
Download:
Share:
 
Abstract:
In evolutionary methods, many processes of the same type can be processed in parallel. These processes are connected to different source and target datasets. For this reason, these methods are optimal for SIMD architectures. This poster shows an evol ...Read More
Abstract:
In evolutionary methods, many processes of the same type can be processed in parallel. These processes are connected to different source and target datasets. For this reason, these methods are optimal for SIMD architectures. This poster shows an evolutionary framework, in which evolutionary algorithms can be developed for GPUs and CPUs. The "Implemented Method" section of this poster is the foundation of this methodology and allows for the creation of more advanced forecasting.  Back
 
Topics:
Algorithms & Numerical Techniques, Artificial Intelligence and Deep Learning
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6286
Download:
Share:
 
Abstract:
GPUs have been applied successfully in many scientific computing realms and have great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power fl ...Read More
Abstract:
GPUs have been applied successfully in many scientific computing realms and have great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve the N-1 SSA problem, the degree of parallelism is limited because existing research has been devoted to accelerating the solution of single ACPF. This paper proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of parallelism. In comparison to its CPU counterpart on Xeon E5-2620, the GPU method and framework solves SSA on Tesla K20C achieves up to a 57.6X speedup.  Back
 
Topics:
Algorithms & Numerical Techniques, Other
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6109
Download:
Share:
 
Abstract:
Our implementation of a cross-validated best subset selection in linear regressions is presented. This algorithm is the latest GPU-enabled feature made available in our statistical solution XLSTAT. It is based on the binary tree regressions first pro ...Read More
Abstract:
Our implementation of a cross-validated best subset selection in linear regressions is presented. This algorithm is the latest GPU-enabled feature made available in our statistical solution XLSTAT. It is based on the binary tree regressions first proposed by Furnival & Wilson and is implemented through a QR factorization and subsequent updates of the R matrix using the cuSolver library. The last step of our model selection is done by a leave-one-out cross-validation test.  Back
 
Topics:
Algorithms & Numerical Techniques, Other
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6194
Download:
Share:
 
Abstract:
Propagating interfaces occur in a wide variety of fields, including fluid mechanics and computer graphics. The distance field from an interface can be calculated by solving the Eikonal equation at each node using the Fast Sweeping Method (FSM) [Zhao, ...Read More
Abstract:
Propagating interfaces occur in a wide variety of fields, including fluid mechanics and computer graphics. The distance field from an interface can be calculated by solving the Eikonal equation at each node using the Fast Sweeping Method (FSM) [Zhao, 2004]. However, parallelization of FSM is not straightforward. We proposed a parallel algorithm using Cuthill-McKee ordering that is suitable for massively threaded architecture. Here, we implement and compare different parallel algorithms for FSM using CUDA, OpenACC, and MPI. The maximum performance is achieved using CUDA and the parallel algorithm of Detrixhe et al., whereas a comparable speedup was achieved using OpenACC with a few directives, substantially shortening the development cycle.  Back
 
Topics:
Algorithms & Numerical Techniques, Other
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6257
Download:
Share:
 
Abstract:
This poster presents an efficient GPU implementation of four neighborhood operators that are commonly applied in the local search of many metaheuristics for different permutation-based problems, such as the Traveling Salesman Problem and the Single ...Read More
Abstract:
This poster presents an efficient GPU implementation of four neighborhood operators that are commonly applied in the local search of many metaheuristics for different permutation-based problems, such as the Traveling Salesman Problem and the Single Row Facility Layout Problem. Although many optimization problems have been solved through GPU parallelization in the last few years, the authors are not aware of a thorough analysis of the neighborhood moves. Therefore, we perform an evaluation of the neighborhood operators rather than analyzing a specific metaheuristic. The parallel approach achieved good results when compared to the CPU version reaching speedups ranging from 14x to 68x faster.  Back
 
Topics:
Algorithms & Numerical Techniques, Other
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6273
Download:
Share:
Artificial Intelligence and Deep Learning
Presentation
Media
Abstract:
Learn the latest deep learning techniques for semantic modeling of image, text, and knowledge graph, all empowered by GPU computing and cloud service. We'll demonstrate how to build deep semantic models across different modalities, and how to appl ...Read More
Abstract:
Learn the latest deep learning techniques for semantic modeling of image, text, and knowledge graph, all empowered by GPU computing and cloud service. We'll demonstrate how to build deep semantic models across different modalities, and how to apply these models to reach the best results in information retrieval, question answering, and image captioning benchmarks. In particular, facilitated by the recently announced Microsoft Azure GPU compute instances, we'll show how to use GPU clusters to extend the MSR image captioning system, which won first prize in the COCO Captioning Challenge at CVPR 2015, and to build a publically available, large-scale, deep image understanding service that achieves state-of-the-art performance in generating novel captions for images.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Data Center & Cloud Infrastructure, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6116
Streaming:
Download:
Share:
 
Abstract:
We'll discuss, analyze, and improve the performance of deep neural network inference using GPUs. Other than neural net training, which is an offline process where large batches of images are fed to the GPU to maximize computational throughpu ...Read More
Abstract:

We'll discuss, analyze, and improve the performance of deep neural network inference using GPUs. Other than neural net training, which is an offline process where large batches of images are fed to the GPU to maximize computational throughput, inference focuses on small-batch, low-latency forward propagation through the network. We'll discuss how the different performance requirements for inference impact the way we implement it on GPUs and what performance optimizations are possible, and we'll show how GPUs, all the way from the small Tegra X1 to the powerful TITAN X, excel at performance and energy efficiency when performing inference for deep neural networks.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Autonomous Vehicles, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6136
Streaming:
Download:
Share:
 
Abstract:
By exposing parallelism between operations in a recurrent neural network it is possible to achieve significant performance improvements when training. In this talk a case study based on a Long Short-Term Memory (LSTM) recurrent network will be used t ...Read More
Abstract:
By exposing parallelism between operations in a recurrent neural network it is possible to achieve significant performance improvements when training. In this talk a case study based on a Long Short-Term Memory (LSTM) recurrent network will be used to demonstrate a 5x speedup over a naive implementation for the forward pass of a single layer. A further 2x speedup (totaling 10x) will be shown when considering multiple layers. Results will also be presented for the backward pass.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6165
Streaming:
Download:
Share:
 
Abstract:
Learn some advanced skills about performance optimization on Kepler GPUs. NVIDIA has provided many powerful tools to analyze and improve efficiency of CUDA kernels. However, in many specific cases, developers need to do some more detailed adjusting t ...Read More
Abstract:
Learn some advanced skills about performance optimization on Kepler GPUs. NVIDIA has provided many powerful tools to analyze and improve efficiency of CUDA kernels. However, in many specific cases, developers need to do some more detailed adjusting to get expected performance. In this session, a native assembler for Kepler architecture used in Alibaba will be introduced. Also, turning experiences of CNN and gemm implementation with this assembler will be shown as examples. If you are interested in assembly level optimization and want to use such a tool in Kepler architecture, you shouldn't miss this session!  Back
 
Topics:
Artificial Intelligence and Deep Learning, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6173
Streaming:
Download:
Share:
 
Abstract:
The goal of this session is to explain the strategy on how to design parallel CAFFE framework on GPU cluster platform to handle big data, like three different kinds of MPI parallel mechanism, and how to optimize the data reading, network communicatio ...Read More
Abstract:
The goal of this session is to explain the strategy on how to design parallel CAFFE framework on GPU cluster platform to handle big data, like three different kinds of MPI parallel mechanism, and how to optimize the data reading, network communication, multi-GPU parallel efficiency.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6242
Streaming:
Download:
Share:
 
Abstract:
We introduce a novel approach to fault localization in oil and gas exploration based on automated feature detection with deep learning algorithms running on GPUs. Faults are key geological structures that can serve as boundaries for hydrocarbon reser ...Read More
Abstract:
We introduce a novel approach to fault localization in oil and gas exploration based on automated feature detection with deep learning algorithms running on GPUs. Faults are key geological structures that can serve as boundaries for hydrocarbon reservoirs. Most current techniques that tackle this problem rely on seismic images, which are the outcome of expensive computing with substantial human intervention. We'll present latest results from a joint project by MIT and Shell International E&P Inc., on using deep learning to bypass the expensive processing mentioned above and perform fault detection on the raw seismic traces. We build a system in Julia/Mocha.jl and cuDNN to solve the challenging structured output prediction problem, and show promising preliminary results.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Seismic & Geosciences
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6266
Download:
Share:
 
Abstract:
This talk will focus on the pragmatic use of high performance computing using NVIDIA GPUs and deep learning algorithms in visual, crowd, and behavioral analytics projects at Cisco. We will highlight use cases in IoT, SmartCities, retail, event analyt ...Read More
Abstract:
This talk will focus on the pragmatic use of high performance computing using NVIDIA GPUs and deep learning algorithms in visual, crowd, and behavioral analytics projects at Cisco. We will highlight use cases in IoT, SmartCities, retail, event analytics, and transportation. We'll also highlight our approach, architecture, and deployment models leveraging NVIDIA-docker, swarm, etc. Computing on end point devices, at the edge, and the cloud in a distributed heterogeneous model in support of the applications above will also be discussed. This talk is targeted primarily towards practitioners either actively engaged in product development or seriously contemplating it. However, we will also discuss advanced use cases that may be of interest to researchers. We will not be covering core deep learning algorithms (generative/discriminative/Boltzmann/Autoencoders/RNN/LSTM/?) although we will highlight our use of these algorithms.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6273
Streaming:
Download:
Share:
 
Abstract:
At CSIRO Data61, we're building the next generation of science platforms that exploit GPU computing to dramatically accelerate the time to discovery and the pace of innovation in science and industry. Scientific applications routinely genera ...Read More
Abstract:

At CSIRO Data61, we're building the next generation of science platforms that exploit GPU computing to dramatically accelerate the time to discovery and the pace of innovation in science and industry. Scientific applications routinely generate huge amounts of data. In response to these trends, we've developed and deployed a new breed of GPU-accelerated big data technologies, earth system modeling tools, and machine learning capabilities. We'll present examples of our work in big data analytics, earth system modeling, and deep learning that clearly demonstrate the value that GPU computing can deliver to research organisations and industry. CSIRO has been at the forefront of GPU computing since 2009 and was one of the first NVIDIA CUDA Research Centers.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics, Climate, Weather & Ocean Modeling
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6281
Streaming:
Download:
Share:
 
Abstract:
Font is one of the core elements in design. In this talk we will present two technologies: one for recognizing font from an image and other for suggesting fonts based on visual similarity. Both technologies are built upon improvements to the state-of ...Read More
Abstract:
Font is one of the core elements in design. In this talk we will present two technologies: one for recognizing font from an image and other for suggesting fonts based on visual similarity. Both technologies are built upon improvements to the state-of-the-art in Deep Learning. Our recognition system is trained with millions of images and on NVIDIA GPUs. It is able to recognize over 7,500 fonts, achieves an accuracy of higher than 80% (top-5), and produces a good font similarity measure for font selection and suggestion. The technologies presented are the foundation of the new font similarity feature in Adobe Photoshop.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6293
Streaming:
Download:
Share:
 
Abstract:
Deep learning research and applications have seen numerous successes in the field of image processing and speech recognition. However, in the field of natural language processing, it is still under utilized. This session will share the relevant techn ...Read More
Abstract:
Deep learning research and applications have seen numerous successes in the field of image processing and speech recognition. However, in the field of natural language processing, it is still under utilized. This session will share the relevant technology and the development process of the intelligent customer service robot; as well as machine learning, deep learning, and natural language processing related technology. We'll also discuss the application of deep learning on natural language processing and automatic question answering system, the role it plays in business, and how it enhances the ability to answer customer questions and boost customer satisfaction.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6321
Streaming:
Share:
 
Abstract:
Learn how deep learning can address some of the most critical problems of computational drug discovery. Historically, the field has been strongly focused on the development of drugs intended to act against one specific target with high potency a ...Read More
Abstract:

Learn how deep learning can address some of the most critical problems of computational drug discovery. Historically, the field has been strongly focused on the development of drugs intended to act against one specific target with high potency and selectivity. It is now recognized that these concepts are too simplistic. At the same time, there was an unprecedented growth of chemical databases incorporating hundreds of billions of useful chemical records. Deep learning is well suited to address both of these challenges. GPU computing is the central hardware technology that allows deep learning to scale.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computational Biology & Chemistry, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6348
Streaming:
Download:
Share:
 
Abstract:
Object detection in real video images is more challengable than image data set. We'll present CNN based object detection research on IQIYI large image and videos. It is used for content based ads recommendation.
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6362
Streaming:
Download:
Share:
 
Abstract:
End-to-end speech recognition systems, which directly transcribe audio data with text without requiring an intermediate phonetic representation, are based on recurrent neural network (RNN) + connectionist temporal classification (CTC). CTC is to auto ...Read More
Abstract:
End-to-end speech recognition systems, which directly transcribe audio data with text without requiring an intermediate phonetic representation, are based on recurrent neural network (RNN) + connectionist temporal classification (CTC). CTC is to automatically learn the alignments between speech frames and the label sequence of transcript. In this work, we focus on optimizing CTC training, especially the forward-backward algorithm, on GPU. Firstly, opportunities of saving computation and memory access of CTC forward-backward algorithm were quantitatively analyzed and utilized to get a speedup of ~1.28X. Secondly, by data reuse among frames and data transfer between frames through register file and shared memory, we get a speedup of ~1.80X.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6383
Streaming:
Download:
Share:
 
Abstract:
Learn how deep learning can be applied to object detection, localization, and tracking problems in remote sensing. We'll present a technical case study showing how a convolutional neural network (CNN) trained in the data center using DIGITS can be ...Read More
Abstract:
Learn how deep learning can be applied to object detection, localization, and tracking problems in remote sensing. We'll present a technical case study showing how a convolutional neural network (CNN) trained in the data center using DIGITS can be deployed to an embedded GPU system to carry out low-latency object detection, classification, and tracking in high-resolution aerial imagery. We'll compare different approaches to detection and localization tasks. An example will be given of integrating the Caffe deep learning framework for GPU-accelerated CNN inference with an OpenCV-based image and video processing pipeline. We'll also show how transfer learning can be accomplished using DIGITS to train a CNN when only a small task specific training dataset is available.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6389
Streaming:
Download:
Share:
 
Abstract:
We'll describe how we created an iterative labeling process to perform data science on 100 million+ images using a GPU-powered workflow with convolutional neural networks. Recently, deep learning techniques such as deep convolutional neural networ ...Read More
Abstract:
We'll describe how we created an iterative labeling process to perform data science on 100 million+ images using a GPU-powered workflow with convolutional neural networks. Recently, deep learning techniques such as deep convolutional neural networks (ConvNets) have achieved state-of-the-art results in many computer vision tasks. The data-driven nature of deep learning normally requires a large number of labeled examples to achieve high accuracies. Unfortunately, much of the publicly available data on the web is not labeled, thus requiring human labelers for large datasets or unsupervised machine learning techniques. Our labeling process allows weak labels and a small number of strong labels to be used to create classifiers for very large datasets.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6391
Streaming:
Download:
Share:
 
Abstract:
DeepSpark is a scalable deep learning framework for Spark-based distributed environments. It employs multiple independent Caffe-based workers that asynchronously generate model updates and a distributed parameter server that maintains a global model. ...Read More
Abstract:
DeepSpark is a scalable deep learning framework for Spark-based distributed environments. It employs multiple independent Caffe-based workers that asynchronously generate model updates and a distributed parameter server that maintains a global model. The framework is designed to overcome network delays, and uses Spark's RDDs to provide fast and fault-tolerant access to the parameter server.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Data Center & Cloud Infrastructure, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6405
Streaming:
Download:
Share:
 
Abstract:
One of the largest barriers to industrial adoption of deep learning is the time required to train models; it can take a week or more to train a high-quality deep neural network on a GPU workstation. We present FireCaffe, which trains state-of-the-art ...Read More
Abstract:
One of the largest barriers to industrial adoption of deep learning is the time required to train models; it can take a week or more to train a high-quality deep neural network on a GPU workstation. We present FireCaffe, which trains state-of-the-art deep neural networks on a cluster of 32 GPUs with a 23x speedup over a single GPU.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6417
Streaming:
Share:
 
Abstract:
This presentation will describe the design of a multi-resolution 3D convolutional neural network for drivers' hand gesture recognition. The talk will include task-specific data augmentation strategies that help to achieve state-of-the-art performa ...Read More
Abstract:
This presentation will describe the design of a multi-resolution 3D convolutional neural network for drivers' hand gesture recognition. The talk will include task-specific data augmentation strategies that help to achieve state-of-the-art performance on a publicly available dataset. Several aspects of multi-sensor fusion with deep neural networks will be discussed in detail.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Video & Image Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6432
Streaming:
Download:
Share:
 
Abstract:
Running deep learning inference tasks on embedded platforms often requires deployment of pretrained models. Finding the best hyper-parameters and training are usually performed on a workstation or large-scale system to obtain the best model. In ...Read More
Abstract:

Running deep learning inference tasks on embedded platforms often requires deployment of pretrained models. Finding the best hyper-parameters and training are usually performed on a workstation or large-scale system to obtain the best model. In this talk, we'll show through examples using frameworks how to train models on a workstation and deploy models on embedded platforms such as the NVIDIA® Jetson™ TX1 or NVIDIA Drive™ PX. We'll also show dedicated tools and how to monitor performance and debug issues on embedded platforms for easy demo setup. This talk will include a live demo session.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics, Aerospace and Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6474
Streaming:
Download:
Share:
 
Abstract:
This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convoluti ...Read More
Abstract:
This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convolution, direct convolution, and small tile GEMM (matrix multiply). In particular, we'll look at how we achieve high utilization at very small mini batches which is important for multi-gpu scaling and inference. In addition we'll talk about where and how you can effectively leverage lower and mixed precision to further increase performance without loss in accuracy.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Performance Optimization, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6485
Streaming:
Download:
Share:
 
Abstract:
With differentiable forms of attention being integrated into neural networks, end-to-end training with backpropagation is possible. We adopt the recently proposed attention mechanism in spatial transformer networks (STNs) into a recurrent architectur ...Read More
Abstract:
With differentiable forms of attention being integrated into neural networks, end-to-end training with backpropagation is possible. We adopt the recently proposed attention mechanism in spatial transformer networks (STNs) into a recurrent architecture to perform object tracking. We show that this attention mechanism has significant overlap with the mechanism in deep recurrent attentive writer (DRAW) networks, which have been successfully used to create generative models of images. We present an end-to-end trainable recurrent attention model for tracking a variety of objects in video recorded by cameras mounted on an automobile.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6497
Streaming:
Download:
Share:
 
Abstract:
Most recently, Listen, Attend and Spell (LAS) was presented to directly transcribe speech utterances to characters. Unlike traditional DNN-HMM models, these models learn all the components of a speech recognizer jointly. The LAS model has two compone ...Read More
Abstract:
Most recently, Listen, Attend and Spell (LAS) was presented to directly transcribe speech utterances to characters. Unlike traditional DNN-HMM models, these models learn all the components of a speech recognizer jointly. The LAS model has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. We'll describe a distributed asynchronous training platform for training such an model on an array of GPUs.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Signal and Audio Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6515
Streaming:
Share:
 
Abstract:
We'll outline the development of state-of-the-art medical imaging system using novel deep architectures that harness GPUs for accelerated training. Trained using data from Stanford Byers Eye Institute and Palo Alto VA Hospital, our model grades th ...Read More
Abstract:
We'll outline the development of state-of-the-art medical imaging system using novel deep architectures that harness GPUs for accelerated training. Trained using data from Stanford Byers Eye Institute and Palo Alto VA Hospital, our model grades the severity of eye diseases and localizes lesions to help screen eye patients at primary care. At the heart of this system lies our hybrid approach to deep learning for high resolution images -- a large convnet with millions of parameters trained with downsized images, fused with a net trained on selected tiles of the high-resolution image. This innovative approach involves use of transfer learning, data augmentation, and multi-GPU systems to identify small-scale features that are critical to detecting eye diseases.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Medical Imaging & Radiology, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6516
Streaming:
Download:
Share:
 
Abstract:
CUDA-based framework is the key for applying deep learning technologies. We introduce Chainer, a Python-based standalone open-source framework. The audiences will know how Chainer works and enables new kinds of applications of deep learning. Due to t ...Read More
Abstract:
CUDA-based framework is the key for applying deep learning technologies. We introduce Chainer, a Python-based standalone open-source framework. The audiences will know how Chainer works and enables new kinds of applications of deep learning. Due to the success of Caffe, Torch, and Theano, the power of deep learning continues to expand beyond traditional pattern recognition tasks such as image recognition. However, the gap is rapidly increasing between the complexities of newly proposed neural network models, and the capabilities of existing frameworks, which have been mainly used for convolutional neural networks. Chainer enables users to intuitively implement many kinds of other models including recurrent neural networks with a lot of flexibility and comparable performance with GPGPU.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6523
Streaming:
Download:
Share:
 
Abstract:
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce &quot;Deep Compression&quot;, which reduces ...Read More
Abstract:
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce &quot;Deep Compression&quot;, which reduces the number of connections of deep neural networks by an order of magnitude and the total size of the networks by 35-49x without affecting their accuracy. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x, from more than half a gigabyte to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. This also makes it possible to fit DNN into mobile Apps given size limit.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6561
Streaming:
Download:
Share:
 
Abstract:
Training deep neural networks can be time consuming when searching a large hyper-parameter space. Using the Rescale optimization platform, we present a simple interface for doing parallelized hyper-parameter grid search for deep learning models from ...Read More
Abstract:
Training deep neural networks can be time consuming when searching a large hyper-parameter space. Using the Rescale optimization platform, we present a simple interface for doing parallelized hyper-parameter grid search for deep learning models from a number of different machine learning packages of a user's choice. Offered packages are pre-configured to take advantage of NVIDIA cuDNN accelerated training, allowing the user to tradeoff cost vs. training time vs. model performance. We will demo the Rescale DNN optimization system and give performance results when trained on NVIDIA GPU hardware available on Rescale. Benchmarking will be done against the MNIST and CIFAR datasets.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6634
Streaming:
Download:
Share:
 
Abstract:
We'll demonstrate some of the design choices required to provide a distributed, in-memory, GPU-accelerated, parallel mathematics library, distributed mathematics (dMath). The library considers some of the most common functionality required for eff ...Read More
Abstract:
We'll demonstrate some of the design choices required to provide a distributed, in-memory, GPU-accelerated, parallel mathematics library, distributed mathematics (dMath). The library considers some of the most common functionality required for effective scaling of deep learning pipelines for a variety of recognition and understanding tasks. The core of the problem is efficient implementations of common basic linear algebra subprograms (BLAS) and specific abstractions for learning at scale.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6669
Streaming:
Download:
Share:
 
Abstract:
Training and deploying deep neural networks for speech recognition is very computationally intensive. I will discuss how we have made our training process scale efficiently to many GPUs while training, as well as how we use GPUs to take our deep neur ...Read More
Abstract:
Training and deploying deep neural networks for speech recognition is very computationally intensive. I will discuss how we have made our training process scale efficiently to many GPUs while training, as well as how we use GPUs to take our deep neural networks to users at scale through Batch Dispatch.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6672
Streaming:
Download:
Share:
 
Abstract:
Learn a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on mat ...Read More
Abstract:
Learn a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU's inverted memory hierarchy to reuse network weights over time, and communication between thread blocks using a deadlock-free global barrier. Our initial implementation sustains 3 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, allows us to train models with 12x more parameters on the same hardware, and allows us to strongly scale RNN training to 32 GPUs.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Performance Optimization, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6673
Streaming:
Download:
Share:
 
Abstract:
Generative adversarial networks (GANs) provide a way to generate realistic images with deep neural networks. Compared to other approaches to generative modeling, GANs are able to learn the cost function. This allows them to learn to capture important ...Read More
Abstract:
Generative adversarial networks (GANs) provide a way to generate realistic images with deep neural networks. Compared to other approaches to generative modeling, GANs are able to learn the cost function. This allows them to learn to capture important details that a fixed, manually designed cost function, such as mean squared error, would ignore. Compared to maximum likelihood estimation (MLE), GANs are specialized for the task of generating realistic samples. Both MLE and GANs are consistent statistical estimators, but have different priorities. MLE aims to assign high likelihood to all of the data, but may also assign high likelihood to other points and thus generate unrealistic samples. GANs aim to always generate realistic samples.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6695
Streaming:
Download:
Share:
 
Abstract:
At Baidu, we are increasingly using deep learning algorithms to improve the quality of our services. For services deployed online to millions and millions of users, the critical factors are speed of response, reliability and cost. Traditionally, such ...Read More
Abstract:
At Baidu, we are increasingly using deep learning algorithms to improve the quality of our services. For services deployed online to millions and millions of users, the critical factors are speed of response, reliability and cost. Traditionally, such services have been based on the CPU, which has limitations. We have successfully utilized GPUs for deep learning training and are now using GPUs for online deployment as well.  Back
 
Topics:
Artificial Intelligence and Deep Learning, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6696
Streaming:
Download:
Share:
 
Abstract:
Don't miss GTC's opening keynote address from NVIDIA CEO and co-founder Jensen Huang.
 
Topics:
Artificial Intelligence and Deep Learning, Autonomous Vehicles, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6699
Streaming:
Share:
 
Abstract:
With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science communit ...Read More
Abstract:
With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, NOAA Fisheries has organized a competition hosted on Kaggle.com. The challenge was to automate the right whales recognition process using a dataset of aerial photographs of individual whales - currently a painstaking and lengthy, manual process. In this session, I will outline the winning solution. It is based on deep learning and convolutional neural networks.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6723
Streaming:
Download:
Share:
 
Abstract:
We'll share our experience on accelerating medical image processing pipelines on GPU architecture. The accelerated topic includes both image processing algorithms and machine learning algorithms. We've developed a platform based on GPU for medi ...Read More
Abstract:
We'll share our experience on accelerating medical image processing pipelines on GPU architecture. The accelerated topic includes both image processing algorithms and machine learning algorithms. We've developed a platform based on GPU for medical imaging big data applications.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computational Biology & Chemistry, Medical Imaging & Radiology
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6731
Streaming:
Download:
Share:
 
Abstract:
The second annual Data Science Bowl was an online data science contest that took place in early 2016 and was hosted on the Kaggle platform. The objective of the contest was to develop an algorithm that could accurately estimate the volume of the left ...Read More
Abstract:
The second annual Data Science Bowl was an online data science contest that took place in early 2016 and was hosted on the Kaggle platform. The objective of the contest was to develop an algorithm that could accurately estimate the volume of the left ventricle of a human heart at the points of maximum and minimum volume from a time-series of multiple cross sectional Magnetic Resonance Imaging (MRI) images of the heart. The contest provided thousands of MRI images to train an algorithm. The challenge was a natural fit for GPU accelerated deep learning (DL). We'll hear from some of the winning teams describe their approaches. The complexities of working with sometimes messy clinical data will be discussed and we will hear how deep learning can be applied to a time-series of 3D images.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Medical Imaging & Radiology, Video & Image Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6740
Streaming:
Download:
Share:
 
Abstract:
Deep convolutional neural networks and other machine learning algorithms are both performance-critical and difficult to optimize. Our project, Cog ex Machina, provides a domain-specific embedded language (DSL) tuned for machine learning and data anal ...Read More
Abstract:
Deep convolutional neural networks and other machine learning algorithms are both performance-critical and difficult to optimize. Our project, Cog ex Machina, provides a domain-specific embedded language (DSL) tuned for machine learning and data analysis applications, a compiler, and a runtime. The DSL, hosted on the Scala language, simplifies the process of writing accelerated applications, while preserving the information the compiler needs to emit efficient accelerator code. Our compiler performs kernel fusion and common subexpression elimination among other optimizations. Our runtime provides a simple control interface and reuses buffers in order to reduce the application's GPU global memory footprint. This session by Hewlett Packard Enterprise will outline the design and implementat  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6772
Streaming:
Download:
Share:
 
Abstract:
Recent advances in deep learning have all but solved speech recognition and image processing. The next frontier is natural language. Maluuba's vision is intelligent machines that can think, reason, and communicate, and we believe that language and ...Read More
Abstract:
Recent advances in deep learning have all but solved speech recognition and image processing. The next frontier is natural language. Maluuba's vision is intelligent machines that can think, reason, and communicate, and we believe that language and intelligence are inextricable. We'll describe steps taken toward next-generation dialogue systems. Such systems should include the capacity to retain memory across dialogue turns and from past conversations, to clarify user intents through dynamic, back-and-forth speech, and to acquire new knowledge through interaction with humans. By taking a deep-learning approach, we have achieved state-of-the-art performance in natural language understanding and dialogue state tracking, tasks vital for any goal-driven dialogue system.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6781
Streaming:
Download:
Share:
 
Abstract:
we'll discuss Torch from a high-level perspective, discussing its usage style across the industry among deep learning giants such as Google DeepMind, Facebook AI Research, Twitter Cortex. We present the current state of Torch as a research and pro ...Read More
Abstract:
We'll discuss Torch from a high-level perspective, discussing its usage style across the industry among deep learning giants such as Google DeepMind, Facebook AI Research, Twitter Cortex. We present the current state of Torch as a research and production framework for deep learning models, and finally we present our long term vision.  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6798
Streaming:
Download:
Share:
 
Abstract:
Deep learning has enabled significant advances in supervised learning problems such as speech recognition and visual recognition. Reinforcement learning provides only a weaker supervisory signal, posing additional challenges in the form of tempo ...Read More
Abstract:

Deep learning has enabled significant advances in supervised learning problems such as speech recognition and visual recognition. Reinforcement learning provides only a weaker supervisory signal, posing additional challenges in the form of temporal credit assignment and exploration. Nevertheless, deep reinforcement learning has already enabled learning to play Atari games from raw pixels (without access to the underlying game state) and learning certain types of visuomotor manipulation primitives. I will discuss major challenges for, as well as some preliminary promising results towards, making deep reinforcement learning applicable to real robotic problems.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6812
Streaming:
Download:
Share:
 
Abstract:
This talk provides an overview of how Microsoft uses its open-source, distributed deep learning toolkit, CNTK, to make our products and services better. We'll show how you can use CNTK to train deep learning models of almost any topology and ...Read More
Abstract:

This talk provides an overview of how Microsoft uses its open-source, distributed deep learning toolkit, CNTK, to make our products and services better. We'll show how you can use CNTK to train deep learning models of almost any topology and scale out to many GPUs. We'll review some of the challenges arising in scaling out deep learning workloads and CNTK way of solving them.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6843
Streaming:
Download:
Share:
 
Abstract:
Theano is an extremely popular framework for machine learning, providing a generalized toolset for machine learning tasks. In this session, we'll discuss the Why and the What of Theano, leaving the How for the Theano Hands-on Lab. We'll ...Read More
Abstract:

Theano is an extremely popular framework for machine learning, providing a generalized toolset for machine learning tasks. In this session, we'll discuss the Why and the What of Theano, leaving the How for the Theano Hands-on Lab. We'll cover the high-level motivations and general philosophy behind Theano, including future direction and goals. Please join the Theano developers to learn where this tool fits into your machine learning efforts.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6845
Streaming:
Download:
Share:
 
Abstract:
Deep Learning is delivering the future today, enabling computers to perform tasks once thought possible only in science fiction. Innovations such as autonomous vehicles, speech recognition and advances in medical imaging will transform the world ...Read More
Abstract:

Deep Learning is delivering the future today, enabling computers to perform tasks once thought possible only in science fiction. Innovations such as autonomous vehicles, speech recognition and advances in medical imaging will transform the world as we know it. GPUs are at the core of this transformation, providing the engines that power Deep Learning. In this session, we'll discuss the software tools NVIDIA provides to unlock the power of Deep Learning on GPUs. We'll provide an overview of NVIDIA's Deep Learning Software, including cuDNN and DIGITS, and pointers to maximize your experience with Deep Learning at GTC.

  Back
 
Topics:
Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6847
Streaming:
Download:
Share:
 
Abstract:
Cray Cluster Systems have long been used to support Supercomputing and Scientific Applications. In this talk we'll demonstrate how these same systems can be easily configured to support Docker and subsequently various Machine Learning Softwa ...Read More
Abstract:

Cray Cluster Systems have long been used to support Supercomputing and Scientific Applications. In this talk we'll demonstrate how these same systems can be easily configured to support Docker and subsequently various Machine Learning Software Packages ? including NVIDIA's Digits Software. Additionally, these systems can be configured in such a way that their Docker containers can be configured to pull data from Cray's Sonexion Scale-out Lustre Storage System. With this configuration our systems can have maximum application flexibility through docker as well as simultaneously being able to support the high performance storage requirements of many types of machine learning workloads through a connection with our Lustre ecosystem.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Tools & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6848
Streaming:
Share:
 
Abstract:
This talk will describe how to develop and deploy deep learning applications efficiently and easily using MXNet. MXNet is a new deep learning framework developed by collaborators from over 10 institutes. It is designed for both flexibility and o ...Read More
Abstract:

This talk will describe how to develop and deploy deep learning applications efficiently and easily using MXNet. MXNet is a new deep learning framework developed by collaborators from over 10 institutes. It is designed for both flexibility and optimized performance, with easy to use interfaces in currently 7 programming languages including Python, Scala and R. We will discuss the technologies to scale out the framework to distributed clouds ranging from EC2, Azure, GCE to Spark clusters, and also memory optimizations to fit into embedded systems like mobile phones. Finally, we'll demonstrate deep learning applications in computer vision, natural language processing, and speech recognition.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Programming Languages, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6853
Streaming:
Share:
 
Abstract:
You will learn how neural networks with memory and attention mechanisms allow for state of the art question answering.Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question ...Read More
Abstract:

You will learn how neural networks with memory and attention mechanisms allow for state of the art question answering.Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering. We describe the dynamic memory network (DMN), which uses both of these mechanisms to achieve state of the art performance on both the Visual Question Answering dataset and the bAbI-10k text question-answering dataset. We demonstrate how attention mechanisms allow for improved inspection of deep learning models, helping to understand the evidence behind specific decisions. The techniques discussed are applicable to a wide range of tasks, helping to improve both the accuracy and interpretability of the resulting models.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6861
Streaming:
Download:
Share:
 
Abstract:
Caffe is an open framework for deep learning that equips researchers and engineers with state-of-the-art tools and models. Caffe and its community provide an open source library, reference models, and do-it-yourself examples. We'll highlight ...Read More
Abstract:

Caffe is an open framework for deep learning that equips researchers and engineers with state-of-the-art tools and models. Caffe and its community provide an open source library, reference models, and do-it-yourself examples. We'll highlight scientific and industrial usage of Caffe, talk about recent changes in the latest roast, and discuss future directions. At present the framework has 150+ contributors, 1,000+ citations, and 5,000+ forks.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6869
Streaming:
Download:
Share:
 
Abstract:
This talk provides a brief overview of deep learning research, the challenges involved in scaling it up across multi-GPU and multi-machine clusters, while providing software that is flexible enough for research settings. We discuss the clear trends t ...Read More
Abstract:
This talk provides a brief overview of deep learning research, the challenges involved in scaling it up across multi-GPU and multi-machine clusters, while providing software that is flexible enough for research settings. We discuss the clear trends that are emerging in deep learning from a HPC perspective and discuss several examples from our work at Facebook AI Research.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Press-Suggested Sessions: AI & Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6227
Streaming:
Download:
Share:
 
Abstract:
We'll discuss DNN applications for determination of main facial skin biomarkers using a face photo. While there are a lot of other factors that enable to determine human age with high accuracy, the most obvious factor is how your face looks. Track ...Read More
Abstract:
We'll discuss DNN applications for determination of main facial skin biomarkers using a face photo. While there are a lot of other factors that enable to determine human age with high accuracy, the most obvious factor is how your face looks. Tracking face wrinkles enables us to track not only skin ageing process as such, but also the results and efficiency of treatment used. By following the dynamics of wrinkles appearance, it is possible to find out which treatment is more suitable for a particular face or skin type and hence provide recommendations.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Press-Suggested Sessions: AI & Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6272
Streaming:
Download:
Share:
 
Abstract:
We are surrounded by sounds and acoustics that describe the world we live in. This audio information is reflected in the content of videos, providing unique characteristics as well as complementing cues such as images and text. We'll present our o ...Read More
Abstract:
We are surrounded by sounds and acoustics that describe the world we live in. This audio information is reflected in the content of videos, providing unique characteristics as well as complementing cues such as images and text. We'll present our ongoing research on deriving information from environmental sounds recordings and audio from city-location web videos utilizing GPU-based recurrent neural networks. These methods and findings can be applied to multimedia content analysis, robotics, and IoTs.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics, Press-Suggested Sessions: AI & Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6443
Streaming:
Download:
Share:
 
Abstract:
Twitter is a unique source of real-time information, offering amazing opportunities for automatic content understanding. The format of this content is diverse (tweets, photos, videos, music, hyperlinks, follow graph, ...), the distribution of topics ...Read More
Abstract:
Twitter is a unique source of real-time information, offering amazing opportunities for automatic content understanding. The format of this content is diverse (tweets, photos, videos, music, hyperlinks, follow graph, ...), the distribution of topics ever-changing (on a weekly, daily, or sometimes hourly basis), and the volume ever-growing; making it very challenging to automatically and continuously expose relevant content. Particularly exciting is the rise of live streaming video, and the constraints that come with it. In this talk we present the work done at Twitter Cortex to tackle live video classification, rich media representation, and the technological choices we made to bridge the gap between deep learning research and fast-paced product development.&quot;  Back
 
Topics:
Artificial Intelligence and Deep Learning, Press-Suggested Sessions: AI & Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6482
Streaming:
Download:
Share:
 
Abstract:
We'll introduce how deep learning helps realize a real-world visual search and recognition system. This topic has been studied for decades and became very hot again in recent years mainly due to the rapid development of deep learning and large-sca ...Read More
Abstract:
We'll introduce how deep learning helps realize a real-world visual search and recognition system. This topic has been studied for decades and became very hot again in recent years mainly due to the rapid development of deep learning and large-scale search techniques. Many visual search and recognition preliminary products are available to the public. However, have we solved all the big technical and non-technical challenges? Has ImageNet solved the recognition problem? What are the key factors of realizing a real-world visual recognition/search system? Are semantic gaps still there? Which direction is visual search/recognition going toward? What is still missing? We'll discuss all these based on a real-world, deep learning-based visual search and recognition.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Press-Suggested Sessions: AI & Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6570
Streaming:
Share:
 
Abstract:
Deep Learning is driving significant advances in what computers can achieve, this talk describes Google's efforts at scaling it up. The scaling is happening in two directions, better software that can leverage the power of many fast processors to ...Read More
Abstract:
Deep Learning is driving significant advances in what computers can achieve, this talk describes Google's efforts at scaling it up. The scaling is happening in two directions, better software that can leverage the power of many fast processors to make advances in machine learning, and making machine learning be part of every product we use to make it smarter. We'll talk about TensorFlow, the platform behind our efforts at Google, and how as an open source project it brings this same power to everyone.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Data Center & Cloud Infrastructure, Press-Suggested Sessions: AI & Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6805
Streaming:
Share:
 
Abstract:
Yahoo's Hadoop clusters consist of 35,000 servers and store hundreds of petabytes of structured/non-structured data. Recently, we introduced a new capability -- deep learning -- into Hadoop clusters and developed software solutions, like Caf ...Read More
Abstract:

Yahoo's Hadoop clusters consist of 35,000 servers and store hundreds of petabytes of structured/non-structured data. Recently, we introduced a new capability -- deep learning -- into Hadoop clusters and developed software solutions, like CaffeOnSpark, to conduct distributed deep learning easily. Also, we expanded those clusters with GPU nodes and Infiniband connectivity, which brought 10X faster connectivity. Yahoo's Flickr teams, for example, have since made significant improvements to image recognition accuracy by training the framework with millions of photos. CaffeOnSpark was recently open sourced at github.com/yahoo/CaffeOnSpark under the Apache 2.0 License. Built upon the deep learning framework Caffe, and big-data framework Apache Spark, CaffeOnSpark supports neural network model training, testing, and feature extraction on a cluster of GPU and CPU servers. Caffe users can use their existing LMDB data files and network configurations. CaffeOnSpark's data-frame style API enables deep learning to be invoked along with non-deep learning and SQL analysis in a single program. In this talk, we will share Yahoo's experience on distributed deep learning, and provide a technical overview of CaffeOnSpark (including a demo on AWS EC2).

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Press-Suggested Sessions: AI & Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6836
Streaming:
Download:
Share:
 
Abstract:
Highlighting the key role GPUs will play in creating systems that understand data in human-like ways, Rob High, IBM Watson's Chief Technology Officer, will discuss how cognitive computing helps doctors, lawyers, marketers and others glean ke ...Read More
Abstract:

Highlighting the key role GPUs will play in creating systems that understand data in human-like ways, Rob High, IBM Watson's Chief Technology Officer, will discuss how cognitive computing helps doctors, lawyers, marketers and others glean key insights by analyzing large volumes of data.

  Back
 
Topics:
Artificial Intelligence and Deep Learning, Press-Suggested Sessions: General Interest
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6901
Streaming:
Download:
Share:
 
Abstract:
Construct a model to generate treatment predictions to optimize patient outcomes by using the information gleaned from over 10,000 patients who passed through the Pediatric Intensive Care Unit at Children's Hospital Los Angeles over more than 10 yea ...Read More
Abstract:
Construct a model to generate treatment predictions to optimize patient outcomes by using the information gleaned from over 10,000 patients who passed through the Pediatric Intensive Care Unit at Children's Hospital Los Angeles over more than 10 years. This is accomplished by converting unstructured, non-uniformly sampled patient information into a structured data representation that resembles an image -- here referred to as a "patient snapshot." These patient snapshots elegantly enable convolutional neural networks to efficiently generate a basis.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Medical Imaging & Radiology
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6102
Download:
Share:
 
Abstract:
When performing multiple tasks like recognition and regression, it requires multiple networks and computational cost becomes enormous in proportion to the number of task. Although heterogenous learning can perform multiple tasks in a single network, ...Read More
Abstract:
When performing multiple tasks like recognition and regression, it requires multiple networks and computational cost becomes enormous in proportion to the number of task. Although heterogenous learning can perform multiple tasks in a single network, the performances of each task are worse than the case of trained individually. We propose the new heterogenous learning with weighted loss function. We apply the method to facial analysis which contains five heterogenous tasks(gender estimation and race detection, facial points detection, age estimation and smile degree estimation). Even a single network, the performance is comparable to the networks trained for single task. The computation speed is 22ms in GTX 980. It is 5 times faster than case of utilize five networks for single task.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6139
Download:
Share:
 
Abstract:
Avito.ru is the biggest online classified advertising platform in Russia. The more buyers we attract, the more attractive the site is for swindlers to upload prohibited content. With all this modernization we've met many challenges related to user c ...Read More
Abstract:
Avito.ru is the biggest online classified advertising platform in Russia. The more buyers we attract, the more attractive the site is for swindlers to upload prohibited content. With all this modernization we've met many challenges related to user content and the quality of that content. At a certain point validating all the incoming items by scaling became unrealistic. Having a tremendous amount of data, we started to implement machine learning approaches first with text models only. But then we found the power of deep learning and GPU computing to help us to use both images and text models for quality improvement.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6162
Download:
Share:
 
Abstract:
LINE mobile messenger sticker shop accepts new stickers designed by independent artists on a daily basis. To assist users in browsing the stickers, a list of similar stickers are recommended by collaborative filtering based on user purchase history. ...Read More
Abstract:
LINE mobile messenger sticker shop accepts new stickers designed by independent artists on a daily basis. To assist users in browsing the stickers, a list of similar stickers are recommended by collaborative filtering based on user purchase history. A major drawback of this approach, known as a cold start problem, is that new items cannot be recommended as they have no purchase records. To address this issue, we recommend stickers based on image content by learning the visual similarity between stickers using deep learning on GPUs. We trained a convolutional neural network to learn semantic features, which we use for recommending visually similar stickers. We measure the relevance of different recommendation schemes and verify the effectiveness of the proposed approach.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6244
Download:
Share:
 
Abstract:
Biological neural networks are complex, plastic, and unique to every individual. These qualities pose great challenges in classifying highly dynamic patterns across neural activity for time-sensitive medical applications. In this project, we use deep ...Read More
Abstract:
Biological neural networks are complex, plastic, and unique to every individual. These qualities pose great challenges in classifying highly dynamic patterns across neural activity for time-sensitive medical applications. In this project, we use deep learning to identify oscillatory activation markers that can differentiate between two working memory tasks. Training on multiple NVIDIA GeForce GTX TITAN GPUs enables us to overcome computational challenges for use in clinical and medical research applications. This poster presents our first step towards classifying deep-temporal whole-brain neural network activation patterns using GPU-accelerated deep learning systems with convolutional neural networks.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Medical Imaging & Radiology
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6245
Download:
Share:
 
Abstract:
Deep neural networks (DNNs) have become a popular foundation for state-of-the-art automatic speech recognition systems. We have collected and transcribed more than 2,000 hours of speech data and used it to train DNNs for acoustic models and voice act ...Read More
Abstract:
Deep neural networks (DNNs) have become a popular foundation for state-of-the-art automatic speech recognition systems. We have collected and transcribed more than 2,000 hours of speech data and used it to train DNNs for acoustic models and voice activity detector models. Implementation techniques for efficient speech DNN training on GPUs are explained and several evaluation results and the training times are shown using different amounts of training data and sizes of DNN. The trained DNNs are deployed in our ASR services in Japan.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Signal and Audio Processing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6295
Download:
Share:
 
Abstract:
Deep Learning is an evolving area of research in machine learning and it has been adopted by UTC for solving various problems in the aerospace and building systems. The use cases highlighted include sensor diagnostics from onboard sensors on aircraft ...Read More
Abstract:
Deep Learning is an evolving area of research in machine learning and it has been adopted by UTC for solving various problems in the aerospace and building systems. The use cases highlighted include sensor diagnostics from onboard sensors on aircraft engines, energy estimation and health monitoring of building systems. GPUs provide the computational horsepower to tackle the huge amount of data generated from these sensors. The existing methods for extracting relevant information has been largely replaced by Deep Learning techniques by mapping the problem to large neural networks.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6315
Download:
Share:
 
Abstract:
We consider the problem of super-resolution using convolutional networks. Previous work has shown the advantages of using convolutional networks to improve the quality of image upscaling. Unlike previous solutions, our method incorporates the image u ...Read More
Abstract:
We consider the problem of super-resolution using convolutional networks. Previous work has shown the advantages of using convolutional networks to improve the quality of image upscaling. Unlike previous solutions, our method incorporates the image upsampling within the network structure. To achieve this goal we propose a so-called Muxout layer that increases the size of image features by combining them in groups. The system structure is motivated by an interpretation of convolutional networks as adaptive filters and classic interpolation theory. We use this interpretation to propose specialized initialization methods that are convenient for training deep structures. Our tests show state-of-art quality, high performance, and the ability for unsupervised learning of text images.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Video & Image Processing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6324
Download:
Share:
 
Abstract:
With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, ...Read More
Abstract:
With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, NOAA Fisheries has organized a competition hosted on <a href="http://Kaggle.com">Kaggle.com</a>. The challenge was to automate the right whales recognition process using a dataset of aerial photographs of individual whales - currently a painstaking and lengthy, manual process. In the poster, we outline the winning solution. It is based on deep learning and convolutional neural networks.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6325
Download:
Share:
 
Abstract:
Deep convolutional neural networks are widely accepted as the state-of-the-art solution for various computer vision problems. These commonly lead to a trade-off between network complexity and over-fitting, addressable by increasing the number of trai ...Read More
Abstract:
Deep convolutional neural networks are widely accepted as the state-of-the-art solution for various computer vision problems. These commonly lead to a trade-off between network complexity and over-fitting, addressable by increasing the number of training examples, thus resulting in a lengthy training process. Moreover, more training examples may not even be available. Recent research suggests that this hurdle can be surmounted by using pre-trained complex networks and then fine-tuning them to fit specific datasets. We show that this approach allows for record-breaking performance on tasks ranging from natural image classification to handwritten character recognition. This is made possible by using high-performance NVIDIA GPUs in conjunction with the NVIDIA DIGITS training system.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Computer Vision
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6327
Download:
Share:
 
Abstract:
Eyeriss is an energy-efficient deep convolutional neural network (CNN) accelerator that supports state-of-the-art CNNs, which have many layers, millions of filter weights, and varying shapes (filter sizes, number of filters and channels). The test ch ...Read More
Abstract:
Eyeriss is an energy-efficient deep convolutional neural network (CNN) accelerator that supports state-of-the-art CNNs, which have many layers, millions of filter weights, and varying shapes (filter sizes, number of filters and channels). The test chip features a spatial array of 168 processing elements (PE) fed by a reconfigurable multicast on-chip network that handles many shapes and minimizes data movement by exploiting data reuse. Data gating and compression are used to reduce energy consumption. The chip has been fully integrated with the Caffe deep learning framework. The chip can run the convolutions in AlexNet at 35 fps with 278 mW power consumption.  Back
 
Topics:
Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6354
Download:
Share:
Astronomy & Astrophysics
Presentation
Media
Abstract:
Have you heard about the world's biggest eye? Learn how GPUs help design major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on d ...Read More
Abstract:
Have you heard about the world's biggest eye? Learn how GPUs help design major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on distributed-memory systems with hardware accelerators, we'll explain how the resulting dense linear algebra operations associated with an efficient task-based programming model help design the next generation of telescope instruments.  Back
 
Topics:
Astronomy & Astrophysics, Algorithms & Numerical Techniques, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6229
Streaming:
Download:
Share:
 
Abstract:
Learn how GPUs are used to shape the light on extreme diameter telescopes. By providing the means to process, in real time, large-scale images from wavefront sensors, GPUs are revolutionizing adaptive optics, an instrumental technique used to compens ...Read More
Abstract:
Learn how GPUs are used to shape the light on extreme diameter telescopes. By providing the means to process, in real time, large-scale images from wavefront sensors, GPUs are revolutionizing adaptive optics, an instrumental technique used to compensate fast-evolving aberrations in optical systems. We'll show how GPUs are used to power the real-time controllers of these systems to provide millions of commands per second to deformable mirrors so as to stabilize the image quality at the output of a large telescope. The first results of the Green Flash project, a large-scale European initiative aimed at prototyping real-time controllers for the European Extremely Large Telescope, will be presented and illustrated with preliminary data obtained in the lab.  Back
 
Topics:
Astronomy & Astrophysics, Signal and Audio Processing, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6236
Streaming:
Download:
Share:
 
Abstract:
We've developed the first three-dimensional, self-consistent kinetic plasma model that runs on NVIDIA GPUs using CUDA. The model self-consistently solves the charged-particles motion and their associated electromagnetic fields. We use this model t ...Read More
Abstract:
We've developed the first three-dimensional, self-consistent kinetic plasma model that runs on NVIDIA GPUs using CUDA. The model self-consistently solves the charged-particles motion and their associated electromagnetic fields. We use this model to explore the microphysics of plasma interactions with solar system objects, to understand fundamental kinetic processes of plasma, and to meet NASA's requirements for planetary and space exploration.  Back
 
Topics:
Astronomy & Astrophysics, Algorithms & Numerical Techniques, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6265
Streaming:
Download:
Share:
 
Abstract:
We'll describe how we can accelerate one of the most demanding computational tasks of the real-time pulsar signal processing pipeline of the world's largest next generation radio telescope, the Square Kilometre Array (SKA). We'll explain the ...Read More
Abstract:
We'll describe how we can accelerate one of the most demanding computational tasks of the real-time pulsar signal processing pipeline of the world's largest next generation radio telescope, the Square Kilometre Array (SKA). We'll explain the scientific goals and importance of pulsar searches, along with the technical challenges facing pulsar signal processing on the SKA. Pulsar acceleration searches will be introduced, and an overview of a Fourier Domain method for recovering signal power from binary accelerated pulsars will be given. We'll then present our GPU implementation of this method, discuss techniques used for optimisation, show comparative computational performance results, and consider performance projections with future GPU technology.  Back
 
Topics:
Astronomy & Astrophysics, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6412
Streaming:
Download:
Share:
 
Abstract:
We'll present Bifrost, a lightweight new framework designed to ease the development and deployment of pipeline applications that demand sustained peak utilization of network, CPU, and GPU resources under soft real-time constraints. Such applicatio ...Read More
Abstract:
We'll present Bifrost, a lightweight new framework designed to ease the development and deployment of pipeline applications that demand sustained peak utilization of network, CPU, and GPU resources under soft real-time constraints. Such applications are common in experimental science and computer vision, where processing must keep up with acquisition systems to avoid data loss. Bifrost enables operations to be wrapped in a simple task container with metadata-rich inputs and outputs. By connecting tasks together, complex branching pipelines can be constructed, with asynchronous communication handled by efficient ring buffers in host or device memory. We'll demonstrate Bifrost using a high-performance radio astronomy application that has been deployed as part of the LEDA project.  Back
 
Topics:
Astronomy & Astrophysics, Tools & Libraries, Signal and Audio Processing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6627
Streaming:
Download:
Share:
 
Abstract:
This talk will present designs and performance results for a highly parallel Tegra X1 based compute platform being developed as part of a next generation radio telescope. The MeerKAT radio telescope is currently under construction in the semi-desert ...Read More
Abstract:
This talk will present designs and performance results for a highly parallel Tegra X1 based compute platform being developed as part of a next generation radio telescope. The MeerKAT radio telescope is currently under construction in the semi-desert Karoo region of Southern Africa. This talk presents the ongoing work into developing novel computing technologies to deliver a large scale computational platform within the strict confines of power, space and emission that are in force at this remote site. Using the Tegra X1 as a building block, a rugged, oil-cooled platform has been developed that will power the imager that lies at the heart of the compute challenge. This is a follow on talk from an initial exploration presented in 2015.  Back
 
Topics:
Astronomy & Astrophysics, Intelligent Machines, IoT & Robotics, Press-Suggested Sessions: HPC & Science
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6692
Streaming:
Download:
Share:
 
Abstract:
The photometry measured by spacecrafts during space missions provides important information about the planetary surface composition and properties, like the roughness that influences its photometry. The model by B. Hapke has been one of the most used ...Read More
Abstract:
The photometry measured by spacecrafts during space missions provides important information about the planetary surface composition and properties, like the roughness that influences its photometry. The model by B. Hapke has been one of the most used models to fit the photometric data, but it presents drawbacks. We present a GPU-accelerated technique that simulates the photometry, produced on large-scale rough surfaces, as the interaction of millions of light rays. Reflectance values measured in the laboratory from real samples are used in the simulation. To prove the validity of the approach, a comparison with the Hapke model is proposed. This is a first step to relate real laboratory measurements to the photometry of solar system surfaces observed by past and future missions.  Back
 
Topics:
Astronomy & Astrophysics, Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6134
Download:
Share:
 
Abstract:
Over 70% of the stars in our galaxy are binary systems. Because of their interaction, the masses of these stars can be found using Newton's and Kepler's Laws. This allows astronomers to use these systems to study properties and processes of stars a ...Read More
Abstract:
Over 70% of the stars in our galaxy are binary systems. Because of their interaction, the masses of these stars can be found using Newton's and Kepler's Laws. This allows astronomers to use these systems to study properties and processes of stars and galaxies. Among the many types of binary stars observed, contact systems are the most interesting because they exhibit mass transfer, changing the functionality of both stars. But, due to the lack of precise observational data and the large time scale of this process, there is limited understanding of the mass transfer. In this work, a model was made to give astronomers a method for gaining a deeper knowledge and visual intuition of how the mass transfer between binary stars takes place.  Back
 
Topics:
Astronomy & Astrophysics, Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6197
Download:
Share:
 
Abstract:
Our Moon is no ordinary satellite! It is too large to be a captured asteroid. Could it be a twin planet formed alongside of Earth as our solar system was being created? Or, perhaps a captured rocky planet forced to light our night and give lovers' i ...Read More
Abstract:
Our Moon is no ordinary satellite! It is too large to be a captured asteroid. Could it be a twin planet formed alongside of Earth as our solar system was being created? Or, perhaps a captured rocky planet forced to light our night and give lovers' inspiration? Though this is romantic, the true answer is thought to be much more violent. We believe the Moon was born from a violent encounter of two young proto-planets. This giant impact hypothesis (GIH) is the main theory for the formation of our Moon, but has been questioned recently because simulations of the GIH leave the Earth-Moon system with excess angular momentum. In this work, we show how to remove the excess angular momentum from giant impact simulations, while preserving all the desired results from previous giant impact studies.  Back
 
Topics:
Astronomy & Astrophysics, Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6200
Download:
Share:
 
Abstract:
A mini-array of ASTRI SST-2M Cherenkov telescopes will be deployed soon in a remote site far away from human activities to achieve optimal observation conditions for gamma-ray astronomy. In such a scenario, the capability of each telescope to process ...Read More
Abstract:
A mini-array of ASTRI SST-2M Cherenkov telescopes will be deployed soon in a remote site far away from human activities to achieve optimal observation conditions for gamma-ray astronomy. In such a scenario, the capability of each telescope to process its own data before sending them to a central acquisition system provides a key advantage. We implemented the complete analysis chain required by a single telescope on a Jetson TK1 development board, overcoming the required real-time processing speed by more than a factor of two, while staying within a very small power budget.  Back
 
Topics:
Astronomy & Astrophysics, Intelligent Machines, IoT & Robotics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6233
Download:
Share:
 
Abstract:
We show the results of implementing OpenACC into a non-uniform diffusion time integration Fortran code. The code's application is to smooth observation-based radial magnetic field maps of the solar surface for use as inner boundary conditions of glo ...Read More
Abstract:
We show the results of implementing OpenACC into a non-uniform diffusion time integration Fortran code. The code's application is to smooth observation-based radial magnetic field maps of the solar surface for use as inner boundary conditions of global magnetohydrodynamic simulations of the corona and heliosphere. The code uses a RKL2 super-time-stepping algorithm to allow time-steps that far exceed the standard explicit stability limit. The algorithm remains explicit, making the code a prime target for OpenACC acceleration. The OpenACC implementation is discussed and speedup results are shown. The newly released OpenACC x86 feature in the PGI compiler is also tested and shown to produce multicore CPU code from the OpenACC directives that can outperform our OpenMP implementation.  Back
 
Topics:
Astronomy & Astrophysics, Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6259
Download:
Share:
 
Abstract:
We present our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe our implementation of the ...Read More
Abstract:
We present our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe our implementation of the polyphase filter algorithm. We have implemented the polyphase filter on three generations of NVIDIA GPU cards (Fermi, Kepler, Maxwell), on the Intel Xeon CPU, and Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this. The first makes use of L1/Texture cache, the second uses shared memory. We present our results in terms of the sample rate that can be processed per second.  Back
 
Topics:
Astronomy & Astrophysics, Signal and Audio Processing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6281
Download:
Share:
Autonomous Vehicles
Presentation
Media
Speakers:
Abstract:
We''ll present an innovate approach to efficiently mapping a popular pedestrian detection algorithm (HoG) on an NVIDIA Tegra GPU. Attendees will learn new techniques to optimize a real computer vision application on Tegra X1, as well as ...Read More
Abstract:

We''ll present an innovate approach to efficiently mapping a popular pedestrian detection algorithm (HoG) on an NVIDIA Tegra GPU. Attendees will learn new techniques to optimize a real computer vision application on Tegra X1, as well as several new architecture features of the Tegra X1 GPU.

  Back
 
Topics:
Autonomous Vehicles, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6108
Streaming:
Download:
Share:
 
Abstract:
Most of today's IVI solutions are trying to replicate the smartphone interaction model in the car. Adopting an approach that is similar to smartphones will not result in differentiated solutions with a sustainable competitive advantage. More ...Read More
Abstract:

Most of today's IVI solutions are trying to replicate the smartphone interaction model in the car. Adopting an approach that is similar to smartphones will not result in differentiated solutions with a sustainable competitive advantage. More importantly, the immersive experiences that are typical of smartphone interaction, are not suitable in a driving environment. CloudCar is proposing a new approach in delivering connected services to the car, which brings about a new interaction model suited for the car.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6179
Streaming:
Download:
Share:
 
Abstract:
Modern vehicle functions like advanced driver assistance systems (ADAS) or even fully autonomous driving functions have a rapidly growing demand for high performance computing power. To fulfill fail-operational requirements of autonomous driving ...Read More
Abstract:

Modern vehicle functions like advanced driver assistance systems (ADAS) or even fully autonomous driving functions have a rapidly growing demand for high performance computing power. To fulfill fail-operational requirements of autonomous driving functions, the next generation of a vehicle infrastructure platform has to ensure the execution of safety critical functions with high reliability. In addition the "always connected" feature, needed for autonomous driving, should be protected by the powerful security mechanisms. We'll show how the requirements of ADAS can be fulfilled in an efficient way, on both system and software architecture levels, using the example of automated valet parking from Elektrobit.

  Back
 
Topics:
Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6252
Streaming:
Download:
Share:
 
Abstract:
Learn how to use NVIDIA performance tools to optimize your scene graph and rendering pipeline for the use in automotive software. We'll demonstrate the capabilities of these tools using some simple Qt-based examples and will look at some of ...Read More
Abstract:

Learn how to use NVIDIA performance tools to optimize your scene graph and rendering pipeline for the use in automotive software. We'll demonstrate the capabilities of these tools using some simple Qt-based examples and will look at some of the more common mistakes in writing efficient software and how to avoid them.

  Back
 
Topics:
Autonomous Vehicles, Performance Optimization, Real-Time Graphics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6341
Streaming:
Download:
Share:
 
Abstract:
A solution for vehicle integration targeting the NVIDIA Tegra Jetson Pro and DriveCX platforms will be presented. Communication with the vehicle via the automotive CAN bus is managed by a system that runs separately from other functions in its o ...Read More
Abstract:

A solution for vehicle integration targeting the NVIDIA Tegra Jetson Pro and DriveCX platforms will be presented. Communication with the vehicle via the automotive CAN bus is managed by a system that runs separately from other functions in its own execution environment and backed by its own real-time operating system -- all based on the industry's standard Automotive Open System Architecture (AUTOSAR). Learn about the various benefits this design often has versus handling CAN directly in systems like Linux, Android, or QNX.

  Back
 
Topics:
Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6342
Streaming:
Download:
Share:
 
Abstract:
Get an overview of the techniques used for Audi's Tegra 3 powered virtual cockpit, focusing on the topics (1) reduction of start-up time, (2) instrument display with 60 fps, and (3) synchronization with the infotainment main unit. Additional ...Read More
Abstract:

Get an overview of the techniques used for Audi's Tegra 3 powered virtual cockpit, focusing on the topics (1) reduction of start-up time, (2) instrument display with 60 fps, and (3) synchronization with the infotainment main unit. Additionally, get to know the overall software structure and see how graphical effects were implemented. The virtual cockpit is available in single-display and dual-display configurations. The single-display configuration is used for sport models, like the TT and R8, where the output of the infotainment main unit is integrated into the instrument cluster. In contrast, the dual-display configuration additionally features a ""standard"" main unit display.

  Back
 
Topics:
Autonomous Vehicles, Intelligent Machines, IoT & Robotics, Real-Time Graphics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6377
Streaming:
Download:
Share:
 
Abstract:
Learn how realistic virtual worlds can be used to train vision-based classifiers that operate in the real world, i.e., avoiding the cumbersome task of collecting ground truth by manual annotation. Many vision-based applications rely on classifie ...Read More
Abstract:

Learn how realistic virtual worlds can be used to train vision-based classifiers that operate in the real world, i.e., avoiding the cumbersome task of collecting ground truth by manual annotation. Many vision-based applications rely on classifiers trained with annotated data. We avoid manual annotation by using realistic computer graphics (e.g. video games). However, the accuracy of the classifiers drops because virtual (training) and real (operation) worlds are different. We overcome the problem using domain adaptation (DA) techniques. In the context of vision-based driver assistance and autonomous driving, we present our DA experiences using classifiers based on both handcrafted features and CNNs. We show how GPUs are used in all the stages of our training and operation paradigm.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6467
Streaming:
Download:
Share:
 
Abstract:
This talk will describe how a single forward propagation of a neural network can give us locations of objects interested on an image frame. There are no proposal generation steps before running neural networks and no post processing steps after. ...Read More
Abstract:

This talk will describe how a single forward propagation of a neural network can give us locations of objects interested on an image frame. There are no proposal generation steps before running neural networks and no post processing steps after. The speaker will describe fully neural detection system, implemented by deep learning research teams of Panasonic, that achieves real-time speed and state-of-the-art performance. The talk also includes live demonstration of the system on a laptop PC with NVIDIA 970m and tablet with NVIDIA Tegra K1 GPU.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6470
Streaming:
Download:
Share:
 
Abstract:
D(r)ive deep into crash prediction in future automotive systems that allow the tracking of dozens of objects in real time by utilizing the processing power of embedded GPUs. We'll describe (1) the new possibilities for crash prediction syste ...Read More
Abstract:

D(r)ive deep into crash prediction in future automotive systems that allow the tracking of dozens of objects in real time by utilizing the processing power of embedded GPUs. We'll describe (1) the new possibilities for crash prediction systems in embedded systems that are only possible by taking advantage of recent developments of embedded GPUs, and (2) the implementation and optimization of such a system on the Tegra K1 utilizing AnyDSL, a framework for rapid prototyping of domain-specific libraries that targets NVVM and CUDA.

  Back
 
Topics:
Autonomous Vehicles, Intelligent Machines, IoT & Robotics, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6490
Streaming:
Download:
Share:
 
Abstract:
The car presents a particular challenge for creators of learning systems -- it is incredibly rich in data and context, its hardware and software environments are heterogeneous and fragmented, and drivers expect incredible precision from its inte ...Read More
Abstract:

The car presents a particular challenge for creators of learning systems -- it is incredibly rich in data and context, its hardware and software environments are heterogeneous and fragmented, and drivers expect incredible precision from its interactions. CloudMade has pioneered an approach to machine learning in the automotive context that leverages the richness of car data, the emerging computational power of the car, and the existing computational power of the cloud to deliver an automotive-grade machine learning toolset. With CloudMade's solutions, automotive OEMs can deliver personalized experiences to customers that together create a self-learning car that anticipates the needs and desires of the user.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6565
Streaming:
Download:
Share:
 
Abstract:
A universal, real-time capable NMPC (nonlinear model predictive controller) based implementation of a trajectory generator for highly automated vehicles is presented. Its main target is to serve as the central instance for all high-level ADAS or ...Read More
Abstract:

A universal, real-time capable NMPC (nonlinear model predictive controller) based implementation of a trajectory generator for highly automated vehicles is presented. Its main target is to serve as the central instance for all high-level ADAS or automated vehicle functions, therefore abstracting vehicle-dependent kinematics and dynamics. The trajectory planner is capable of the combined optimization of lateral and longitudinal dynamics in urban, rural, and highway scenarios. One of the major challenges besides stable system layout is the fast solution of the embedded optimal control problem. For this, a bespoke GPU-optimized implementation was developed; apart from the planner itself, details about this implementation will be presented.

  Back
 
Topics:
Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6572
Streaming:
Download:
Share:
 
Abstract:
This tutorial covers for the first time the technology, operation and application of Quanergy's solid state LiDAR that is making 3D sensing ubiquitous, with its low price point, no moving parts, small form factor, light weight, low power con ...Read More
Abstract:

This tutorial covers for the first time the technology, operation and application of Quanergy's solid state LiDAR that is making 3D sensing ubiquitous, with its low price point, no moving parts, small form factor, light weight, low power consumption, long range, high resolution, high accuracy, long lifetime, and ability to operate in various environmental conditions. GPUs are used for performing in real time (1) LiDAR/Video data fusion for modeling and recognizing the environment around a vehicle, (2) object detection, classification, identification, and tracking, (3) scenario analysis and path planning based on deep learning, and (4) actuation of vehicle controls.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6726
Streaming:
Download:
Share:
 
Abstract:
At CES 2016, DRIVE PX 2 was launched as the world's first AI supercomputer designed for autonomous vehicles from NVIDIA. DRIVE PX is a lot more than that. It is an incredible development platform for developers to write autonomous car applic ...Read More
Abstract:

At CES 2016, DRIVE PX 2 was launched as the world's first AI supercomputer designed for autonomous vehicles from NVIDIA. DRIVE PX is a lot more than that. It is an incredible development platform for developers to write autonomous car applications. It is a reference design for Tier-1s, OEMs to reuse it for safety critical ECUs meant for Level 3/4/5 (As defined by SAE International). This talk will present the *under the hood* details of what makes it an AI Supercomputer, a Development platform and a Reference platform for autonomous cars.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6733
Streaming:
Download:
Share:
 
Abstract:
Attendees will be able to walk away with an appreciation for how modern computing power and GPUs are enabling a whole new world of map design potential for the car. Vector-based maps can render data on the fly, 60fps, taking in-car map design to ...Read More
Abstract:

Attendees will be able to walk away with an appreciation for how modern computing power and GPUs are enabling a whole new world of map design potential for the car. Vector-based maps can render data on the fly, 60fps, taking in-car map design to a more video game-like state. The driving experience can be seamless across devices, and tailored to exactly what a user needs for any specific use case.

  Back
 
Topics:
Autonomous Vehicles, Intelligent Machines, IoT & Robotics, Real-Time Graphics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6762
Streaming:
Share:
 
Abstract:
We'll present how Nauto uses deep learning in its distributed, vehicle-based compute and sensor network, and our learnings to date. Topics will include the performance of deep learning algorithms for computer vision in embedded systems, stra ...Read More
Abstract:

We'll present how Nauto uses deep learning in its distributed, vehicle-based compute and sensor network, and our learnings to date. Topics will include the performance of deep learning algorithms for computer vision in embedded systems, strategies for distributing compute across networks of embedded systems and in the cloud, and collecting and labeling data to maximize the performance of the system. Nauto's system is a dual-camera, windshield-mounted dashcam with GPS, IMU, wireless/cellular connection, and a SoC capable of running small CNNs in real time.

  Back
 
Topics:
Autonomous Vehicles, Intelligent Machines, IoT & Robotics, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6806
Streaming:
Download:
Share:
 
Abstract:
Robots. Supercomputers. Cars. They're all coming together. Come hear Gill Pratt, one of the world's leading figures in artificial intelligence and CEO of the Toyota Research Institute, deliver what is sure to be an enlightening presentat ...Read More
Abstract:

Robots. Supercomputers. Cars. They're all coming together. Come hear Gill Pratt, one of the world's leading figures in artificial intelligence and CEO of the Toyota Research Institute, deliver what is sure to be an enlightening presentation.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6831
Streaming:
Download:
Share:
 
Abstract:
In this presentation, we discuss Ford's autonomous vehicle technology including an overview of the tasks of sensing, sensor fusion, localization and mapping, object detection and object classification. We examine the impact of GPU hardware t ...Read More
Abstract:

In this presentation, we discuss Ford's autonomous vehicle technology including an overview of the tasks of sensing, sensor fusion, localization and mapping, object detection and object classification. We examine the impact of GPU hardware to achieve significant improvements to the computational efficiency of our parallelized algorithms for vehicle localization based on a combination of a synthetic aperture camera (derived from lidar data) and a Gaussian mixture 3d map approach. We provide an overview of some preliminary results of our deep learning research in the novel area of lidar-based methods for vehicle localization and object classification.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics, Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6832
Streaming:
Download:
Share:
 
Abstract:
Hear the latest thinking on the maps that autonomous cars will use for highly accurate positioning. Autonomous cars need maps to function. The most critical use of maps is centimeter-level positioning. TomTom solves this with highly accurate lan ...Read More
Abstract:

Hear the latest thinking on the maps that autonomous cars will use for highly accurate positioning. Autonomous cars need maps to function. The most critical use of maps is centimeter-level positioning. TomTom solves this with highly accurate lane information and lateral depth maps, which we call RoadDNA. Autonomous driving and map creation have incredible synergy. Mobile mapping cars go through the exact same process as autonomous cars: sensor perception, sensor data processing and comparing it with a stored version of reality. We process the sensor data with GPUs for fast creation of deep neural networks (DNNs) that can recognize traffic signs and other road attributes, both in-car as well as in the cloud. These DNNs, RoadDNA and sensors in the car together enable autonomous cars.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Big Data Analytics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6849
Streaming:
Download:
Share:
 
Abstract:
To fulfill the EuroNCAP requirements an autonomous braking system has to be developed. The emergency braking system is designed to brake for pedestrians as well as for car to car scenarios. We'll explain how the functional logic is developed ...Read More
Abstract:

To fulfill the EuroNCAP requirements an autonomous braking system has to be developed. The emergency braking system is designed to brake for pedestrians as well as for car to car scenarios. We'll explain how the functional logic is developed and what has to be done to reach a zero false positive goal with an excellent field performance. Audi was the first OEM who fulfilled this goal with a single 3D Monovision camera by developing the first A-SIL B camera with our supplier Kostal, the architecture of the 3D camera is explained as well.

  Back
 
Topics:
Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6856
Streaming:
Download:
Share:
 
Abstract:
The Electronics Research Laboratory (ERL) is a part of the global research and development network that supports the Volkswagen Group brands. These brands include Audi, Bentley, Bugatti, Lamborghini, Porsche, and Volkswagen. Located in Silicon V ...Read More
Abstract:

The Electronics Research Laboratory (ERL) is a part of the global research and development network that supports the Volkswagen Group brands. These brands include Audi, Bentley, Bugatti, Lamborghini, Porsche, and Volkswagen. Located in Silicon Valley, we draw upon its innovation spirit to build new concepts and technologies for our future vehicles. Deep learning is at the center of our work in the fast evolution of piloted driving. As part of our research into this technology, our mission is to research deep neural network architectures and bridge the gap between concept and series development application. In this paper, we'll present our current development in a variety of Deep Learning projects as well as insights into how this technology could affect the future of piloted driving.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6857
Streaming:
Share:
 
Abstract:
Will review current connected and automated driving initiatives with the goal of identifying progress and impediments. Will look at market/thought leaders, tests, implementations, partnerships and the latest developments, some of which will be r ...Read More
Abstract:

Will review current connected and automated driving initiatives with the goal of identifying progress and impediments. Will look at market/thought leaders, tests, implementations, partnerships and the latest developments, some of which will be reflected from presentations and announcements taking place at GTC. Will share some forecast specifics and perspectives on the timing of partial and full autonomy and the expansion of vehicle connectivity.

  Back
 
Topics:
Autonomous Vehicles, Big Data Analytics, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6858
Streaming:
Share:
 
Abstract:
ROBORACE is a global race series for full-size driverless electric cars. The championship will provide a showcase platform for the autonomous driving solutions that are now being developed by many large industrial automotive and technology playe ...Read More
Abstract:

ROBORACE is a global race series for full-size driverless electric cars. The championship will provide a showcase platform for the autonomous driving solutions that are now being developed by many large industrial automotive and technology players as well as top tech universities. As a competition of intelligence and technology, ROBORACE is fusing AI with automotive engineering in extreme conditions. Bringing together motorsports and gaming in that battle of algorithms the teams will compete on the racing tracks in major cities across the world. During the talk we will share the technical vision of our competition and explain the selection criteria for the racing teams. Join us to discuss and be the first to hear some exciting news about ROBORACE!

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6866
Streaming:
Download:
Share:
 
Abstract:
The WePod is the first self-driving vehicle on the public road without a steering wheel or pedals. To achieve driving in such a complex environment and guarantee safety, multiple sensors covering 360 degrees around the vehicle have been used. Se ...Read More
Abstract:

The WePod is the first self-driving vehicle on the public road without a steering wheel or pedals. To achieve driving in such a complex environment and guarantee safety, multiple sensors covering 360 degrees around the vehicle have been used. Sensor-fusion, road-user detection, classification and tracking have been implemented on NVIDIA's DrivePX platform. This session will give an overview of the systems architecture and implementation, as well preliminary test results of driving on the public road will be presented.

  Back
 
Topics:
Autonomous Vehicles, Artificial Intelligence and Deep Learning, Press-Suggested Sessions: AI & Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6830
Streaming:
Download:
Share:
 
Abstract:
Pedestrian detection for autonomous driving has gained a lot of prominence during the last few years. Besides the fact that this is one of the hardest tasks within computer vision, it involves huge computational costs. Obtaining acceptable real- ...Read More
Abstract:

Pedestrian detection for autonomous driving has gained a lot of prominence during the last few years. Besides the fact that this is one of the hardest tasks within computer vision, it involves huge computational costs. Obtaining acceptable real-time performance, measured in frames per second (fps), for the most advanced algorithms is a difficult challenge. We propose a CUDA implementation of a well known pedestrian detection system (i.e., Random Forest of Local Experts). It includes LBP and HOG as feature descriptors and SVM and Random Forest as classifiers. We introduce significant algorithmic adjustments and optimizations to adapt the problem to the NVIDIA GPU architecture. The aim is to deploy a real-time system providing reliable results.

  Back
 
Topics:
Autonomous Vehicles, Computer Vision
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6181
Download:
Share:
Big Data Analytics
Presentation
Media
Abstract:
Enterprises "assume breach": someone, somewhere, already compromised them. Analysts sift through a GB/min (or more!) of attack logs from hundreds of thousands of systems. For every identified incident, they then map out the entire brea ...Read More
Abstract:

Enterprises "assume breach": someone, somewhere, already compromised them. Analysts sift through a GB/min (or more!) of attack logs from hundreds of thousands of systems. For every identified incident, they then map out the entire breach by backtracking through months of alerts. This talk shares how Graphistry and Accenture tackled the visual analytics problem: how do we explore big graphs? We'll drill into two of our GPU technologies for visualizing graphs: [1] StreamGL, our distributed real-time renderer for delivering buttery interactions, smart designs, and responsive analytics to standard web devices; [2] Node-OpenCL and our CLJS client: open source JavaScript libraries for server-side GPU scripting.

  Back
 
Topics:
Big Data Analytics, Aerospace and Defense, Professional Visualisation
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6114
Streaming:
Download:
Share:
 
Abstract:
You'll learn technical solutions to accelerate R by CUDA. R's DNN has become a very popular approach in statistical analysis areas. Even though there are several DNN packages in R, it is rarely used in big data and deep neural networking becaus ...Read More
Abstract:
You'll learn technical solutions to accelerate R by CUDA. R's DNN has become a very popular approach in statistical analysis areas. Even though there are several DNN packages in R, it is rarely used in big data and deep neural networking because the single core performance of R is limited and the current design of DNN packages in R is not GPU-friendly. Firstly, we'll introduce how we apply specific patterns, such as general matrix multiplication (GEMM), to DNN in R, which is a GPU-friendly pattern and can be easily accelerated by cuBLAS. Secondly, we'll show the tradeoff between performance and memory usage in R for DNN. Finally, we'll package all of these CUDA approaches into a R package and publish to CRAN so than anyone can install it in R quickly, and get significant performance improvement from NVIDIA GPUs.  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6156
Streaming:
Download:
Share:
 
Abstract:
We present cuMF, a highly optimized matrix factorization system on GPUs. Matrix factorization (MF) is a key algorithm in recommender systems. On a single GPU, we introduce a memory-optimized alternating least square (ALS) method; it alleviates discon ...Read More
Abstract:
We present cuMF, a highly optimized matrix factorization system on GPUs. Matrix factorization (MF) is a key algorithm in recommender systems. On a single GPU, we introduce a memory-optimized alternating least square (ALS) method; it alleviates discontiguous access and aggressively uses registers, so as to reduce memory latency. On multiple GPUs, we combine data parallelism with model parallelism, and introduce a topology-aware parallel reduction method, so as to scale ALS to multiple GPUs. Using only one machine with four NVIDIA GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported.  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6211
Streaming:
Download:
Share:
 
Abstract:
Writing fast, efficient data analytics for graph and machine learning on GPUs can be hard due to the complexities of CUDA and achieving effective parallelism. DASL and SPARQL are high-level languages for graph and machine learning algorithms (DASL) a ...Read More
Abstract:
Writing fast, efficient data analytics for graph and machine learning on GPUs can be hard due to the complexities of CUDA and achieving effective parallelism. DASL and SPARQL are high-level languages for graph and machine learning algorithms (DASL) and graph pattern matching (SPARQL) that provide speedups of up to 1,000x over Spark native and up to 300x over leading graph databases when executed on the BlazeGraph platform. These high-level languages are translated into task graphs that expose the available parallelism. The mapgraph runtime evaluates the task graphs and provides a scalable architecture on GPUs and GPU clusters. This presentation discusses the concepts for graph algorithms and queries, the mapgraph architecture, and how algorithms are evaluated on a GPU cluster.  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning, Aerospace and Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6267
Streaming:
Download:
Share:
 
Abstract:
The Apache Spark engine is being increasingly used for implementing large-scale distributed analytics workloads. These workloads cover a wide array of analytics models, including predictive analytics, optimizations, and graph analytics. We'll disc ...Read More
Abstract:
The Apache Spark engine is being increasingly used for implementing large-scale distributed analytics workloads. These workloads cover a wide array of analytics models, including predictive analytics, optimizations, and graph analytics. We'll discuss opportunities for exploiting GPUs for accelerating different Spark components such as MLLib. The talk will first overview the Spark programming and execution model and the describe key issues in integrating GPUs into the Spark infrastructure. We then describe our approach for enabling Spark to use multiple GPUs in a distributed manner and provide details of accelerating key MLLib kernels without changing the source Spark program.  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6280
Streaming:
Download:
Share:
 
Abstract:
The potential information buried within datasets is immense -- though extracting this information is difficult when the data is large, noisy, unlabeled, and unstructured. We present the use of GPGPU-powered unsupervised deep learning to identify the ...Read More
Abstract:
The potential information buried within datasets is immense -- though extracting this information is difficult when the data is large, noisy, unlabeled, and unstructured. We present the use of GPGPU-powered unsupervised deep learning to identify the anomalies within such datasets. Analysis of these anomalies can be performed to determine which are &quot;pertinent&quot; and which are &quot;benign.&quot; Once the significance of an anomaly has been determined, this then becomes a label, which is added to the data. Repeating this process will lead to unlabeled data becoming labeled. This newly labeled data can be used to train a supervised deep learning system to identify new instances of that stereotype. We demonstrate how GPGPUs can be used to enable real-time anomaly detection and stereotyping.  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6340
Streaming:
Download:
Share:
 
Abstract:
We show how our interactive, integrated analytics solution allows a new class of users to perform machine-assisted visual sensemaking. Up till now, machine learning techniques such as predictive analytics and deep learning are mostly used as par ...Read More
Abstract:

We show how our interactive, integrated analytics solution allows a new class of users to perform machine-assisted visual sensemaking. Up till now, machine learning techniques such as predictive analytics and deep learning are mostly used as part of a complex tool-chain that serves as an endpoint in the decision making process. We combine the strengths of human decision making and GPU-driven machine learning in a multi-coordinated visual analytics solution. This enables the discovery of actionable insights by bridging the gap between data scientist and business user.

  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning, Autonomous Vehicles
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6356
Streaming:
Share:
 
Abstract:
Large-scale graph analytics frameworks provide a convenient and highly scalable platform for developing algorithms to analyze large datasets. Although conceptually scalable, these techniques exhibit poor performance on modern computational hardware. ...Read More
Abstract:
Large-scale graph analytics frameworks provide a convenient and highly scalable platform for developing algorithms to analyze large datasets. Although conceptually scalable, these techniques exhibit poor performance on modern computational hardware. We're developing an implementation of the high-level functions supported by these APIs in terms of linear algebra operations, which will be parallel on each pair of vertices connected by an edge. This technology can reduce the number of nodes required and map well to computational accelerators such as GPUs, thus enabling users to perform more complex analysis with less hardware at lower cost. We'll detail our latest work on this project, including challenges, specifics of our approach, and preliminary results.  Back
 
Topics:
Big Data Analytics, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6360
Streaming:
Share:
 
Abstract:
Learn how to perform data analysis over software repositories in GPU architecture Dominoes tool. We'll give an overview and introduction of the tool and its capabilities, which provides a unified view of the computational resources. Dominoes allow ...Read More
Abstract:
Learn how to perform data analysis over software repositories in GPU architecture Dominoes tool. We'll give an overview and introduction of the tool and its capabilities, which provides a unified view of the computational resources. Dominoes allows anyone to explore large software repositories at any grain (files, methods, or classes), without using any programming language. Due to its high-level parallel architecture in GPU, the results are processed in real time. The attendees will learn the strategy used by Dominoes to allow big data to be processed over GPU.  Back
 
Topics:
Big Data Analytics, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6372
Streaming:
Download:
Share:
 
Abstract:
We present Gunrock, a multi-GPU graph processing library, that enables easy graph algorithm implementation and extension onto multiple GPUs for scalable performance on large graphs with billions of edges. Attendees can learn how to 1) solve large-sca ...Read More
Abstract:
We present Gunrock, a multi-GPU graph processing library, that enables easy graph algorithm implementation and extension onto multiple GPUs for scalable performance on large graphs with billions of edges. Attendees can learn how to 1) solve large-scale graph problems with high-performance GPU computing primitives and optimization strategies, using our high-level data-centric abstraction that focuses on vertex or edge frontier operations, and 2) utilize multi-GPU computing power by just a few algorithm-dependent blocks, using our multi-GPU framework that handles most multi-GPU implementation details and memory allocation. We will also share experience on the library's design and implementation that helps it achieve the best performance among programmable GPU graph libraries.  Back
 
Topics:
Big Data Analytics, Tools & Libraries, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6374
Streaming:
Download:
Share:
 
Abstract:
Blazegraph GPU provides 300X acceleration for SPARQL graph query and graph database management with acceleration for existing RDF/SPARQL and Property Graph (Tinkerpop) applications. Multi-GPU configurations can effectively manage billion+ edge graphs ...Read More
Abstract:
Blazegraph GPU provides 300X acceleration for SPARQL graph query and graph database management with acceleration for existing RDF/SPARQL and Property Graph (Tinkerpop) applications. Multi-GPU configurations can effectively manage billion+ edge graphs on single-node machines with 4 or 8 K80 GPU accelerators. This is a cost-effective way to deliver high performance for graphs, but many end-users and applications do not have existing multi-GPU systems; current cloud offerings at this scale are not generally available. Cirrascale has developed a cloud-based solution for provisioning multi-GPU Tesla systems using its switch riser technology. This session details the Blazegraph GPU cloud offering on Cirrascale, demonstrates how to quickly deploy it in the cloud, and shows graph benchmarks on cloud systems.  Back
 
Topics:
Big Data Analytics, Data Center & Cloud Infrastructure, Aerospace and Defense
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6395
Streaming:
Download:
Share:
 
Abstract:
Learn how in-GPU-memory databases optimize complex manufacturing processes by enabling real-time data input into big datasets, in-line decision making, and predictive maintenance. In general, manufacturing processes today provide tons of data, e.g., ...Read More
Abstract:
Learn how in-GPU-memory databases optimize complex manufacturing processes by enabling real-time data input into big datasets, in-line decision making, and predictive maintenance. In general, manufacturing processes today provide tons of data, e.g., on the process itself, workpieces, machine sensor data, parts delivered by external vendors, etc. In the Production Intelligence project, our goal is to turn this unspecific data into &quot;smart data&quot; to gain better insight in the manufacturing process, e.g., prevent machine shutdowns or decrease the amount of junk parts. We'll present our solutions to streaming input data vectors into big datasets, analyzing incoming data in real time and predicting production or system errors with the help of deep learning algorithms.  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning, Computer Vision
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6426
Streaming:
Download:
Share:
 
Abstract:
This session showcases how to leverage GPUs to accelerate influence spread estimation in large social networks. Estimating the spread of an opinion or product across members of a graph-modelled social network is a hard problem requiring compute-inten ...Read More
Abstract:
This session showcases how to leverage GPUs to accelerate influence spread estimation in large social networks. Estimating the spread of an opinion or product across members of a graph-modelled social network is a hard problem requiring compute-intensive approximation algorithms. The complexity of the problem further rises in the continuous-time domain, where influence transmission rates on network edges are derived from stochastic distributions. Spread estimation algorithms that operate on stochastic transmission rates, such as naive sampling and neighbourhood size estimation, require a plethora of samples to achieve convergence. By exploiting the inherent independence across multiple sampling iterations of these algorithms we achieve up to 11x improvement in run-time using GPUs.  Back
 
Topics:
Big Data Analytics, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6471
Streaming:
Download:
Share:
 
Abstract:
We'll explain why GPU-powered in-memory databases and analytics platforms are the logical successor to CPU in-memory systems, largely due to recent increases in onboard memory available on GPUs. With sufficient memory, GPUs possess numerous advant ...Read More
Abstract:
We'll explain why GPU-powered in-memory databases and analytics platforms are the logical successor to CPU in-memory systems, largely due to recent increases in onboard memory available on GPUs. With sufficient memory, GPUs possess numerous advantages over CPUs, including much greater compute and memory bandwidth and a native graphics pipeline. We'll demo how MapD is able to leverage multiple GPUs per server to extract orders-of-magnitude performance increases over CPU-based systems, bringing interactive querying and visualization to multi-billion row datasets.  Back
 
Topics:
Big Data Analytics, Performance Optimization
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6472
Streaming:
Download:
Share:
 
Abstract:
In this presentation, you will learn about the use of GPUs in data science applications using the R language, as well as a general method, Software Alchemy, for parallelizing statistical applications. The talk will provide an overview of R libraries ...Read More
Abstract:
In this presentation, you will learn about the use of GPUs in data science applications using the R language, as well as a general method, Software Alchemy, for parallelizing statistical applications. The talk will provide an overview of R libraries available for interfacing with GPUs, and discussion of issues involved in writing such libraries, before showing you how to use Software Alchemy (with or without R) to overcome GPU memory limitations in statistical applications.  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6708
Streaming:
Download:
Share:
 
Abstract:
Applications of deep learning in sensor data analysis has not been studied as extensively as in speech and vision. However, sensor data have properties similar to those of images and audio: multidimensional, with intrinsic dependencies and correlatio ...Read More
Abstract:
Applications of deep learning in sensor data analysis has not been studied as extensively as in speech and vision. However, sensor data have properties similar to those of images and audio: multidimensional, with intrinsic dependencies and correlations in the data, and hard to analyze with conventional approaches. Our results prove that deep learning has better generalization capabilities compared to conventional methods on sensor data and has high potential in sensor data analytics. We also address scalability issues of the training process for models best suited for sensor data. The training of these models do not scale-out beyond a certain number of nodes.  Back
 
Topics:
Big Data Analytics, Artificial Intelligence and Deep Learning, Intelligent Machines, IoT & Robotics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6773
Streaming:
Download:
Share:
 
Abstract:
The growth of the OpenPOWER Foundation has been phenomenal. Why, you might ask? In less than two years, OpenPOWER has grown from five members to over 180, with membership across all tiers of hardware, software, and end users themselves. The Foun ...Read More
Abstract:

The growth of the OpenPOWER Foundation has been phenomenal. Why, you might ask? In less than two years, OpenPOWER has grown from five members to over 180, with membership across all tiers of hardware, software, and end users themselves. The Foundation provides a compelling and rapidly growing open approach to infrastructure and software for rapidly changing workloads and evolving IT consumption models. This is a revolution that is making a profound difference in the price/performance criteria of end users, as well as accelerating compelling development for performance to drive business advantage. OpenPOWER members are co-creating their approach to technology?as innovators, producers, and consumers utilizing IBM's Power Architecture.

  Back
 
Topics:
Big Data Analytics, Data Center & Cloud Infrastructure, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6825
Streaming:
Download:
Share:
 
Abstract:
Understanding and modeling the human brain continues to be one of the biggest challenges of research. The Human Brain Project is a European flagship, which is in the process of creating a research infrastructure that will facilitate this research. Ma ...Read More
Abstract:
Understanding and modeling the human brain continues to be one of the biggest challenges of research. The Human Brain Project is a European flagship, which is in the process of creating a research infrastructure that will facilitate this research. Many research topics in this field require scalable compute resources or the ability to process extreme-scale data volumes (in some cases even both). Examples are approaches to simulate the network of a human brain in its full complexity and the efforts to create high-resolution brain atlases. GPUs play already today an important role to realize the necessary computational capabilities. We'll give an overview of the efforts of building an high-performance analytics and computing platform for brain research.  Back
 
Topics:
Big Data Analytics, Press-Suggested Sessions: HPC & Science, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6655
Streaming:
Download:
Share:
Climate, Weather & Ocean Modeling
Presentation
Media
Abstract:
In an era defined by increasing diversity in computing architectures, performance portability is a key requirement for weather and climate applications that require massive computing resources. In this talk, you will learn about how we developed ...Read More
Abstract:

In an era defined by increasing diversity in computing architectures, performance portability is a key requirement for weather and climate applications that require massive computing resources. In this talk, you will learn about how we developed and achieve performance on CPU, GPU and MIC architectures using industry-standard OpenACC and OpenMP directives. Performance results from the NIM weather model will be shown for a number of device, node and multi-node and system configurations. Further, communications optimizations will highlight a more than a 40% improvement in runtime with scaling to thousands of GPUs.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, Programming Languages, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6117
Streaming:
Download:
Share:
 
Abstract:
Explore a GPU-based efficient algorithm for chemical ODEs, which is the core and costly part of atmosphere chemistry model in CAS-ESM project. Chemical ODEs is numerically sticky because of its stiffness, nonlinearity, and nonnegativity. Traditi ...Read More
Abstract:

Explore a GPU-based efficient algorithm for chemical ODEs, which is the core and costly part of atmosphere chemistry model in CAS-ESM project. Chemical ODEs is numerically sticky because of its stiffness, nonlinearity, and nonnegativity. Traditional solvers, such as LSODE, are hard for parallelism because of its complicated control flow and coupling. In our experiments, we have obtained 3-5X speedup on GPU when the same input is set on each node, which eliminates the divergences in kernel, while the performance with real input is even worse than the serial code. So we develop a new solver Modified-Backward-Euler (MBE). In our numerical experiments, MBE is shown to be faster and more precise than LSODE, and it's easy to parallelize, so we can expect a significant speedup on GPU.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, Algorithms & Numerical Techniques, Computational Fluid Dynamics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6215
Streaming:
Download:
Share:
 
Abstract:
As with many complex scientific computing applications, NASA's GEOS-5 climate modeling tool is computationally intense and can benefit from modern accelerated co-processing hardware. However, the burden of utilizing these new devices and ach ...Read More
Abstract:

As with many complex scientific computing applications, NASA's GEOS-5 climate modeling tool is computationally intense and can benefit from modern accelerated co-processing hardware. However, the burden of utilizing these new devices and achieving optimal results is placed on the same scientists responsible for developing the core algorithms and applying them to applications of interest. We'll present a task-based programming approach coupled with a dynamic scheduler. This allows the science of the software to be divorced from its implementation, both reducing the burden on the programmer and allowing it to adapt to changes in hardware architecture. In collaboration with NASA's Goddard Flight Research Center, we show our results in applying this technique to GEOS-5.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, Tools & Libraries, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6343
Streaming:
Download:
Share:
 
Abstract:
We'll demonstrate our efforts on developing highly efficient solvers for atmospheric dynamics on the GPU platforms. Besides general optimizations for GPU-based scientific computing applications, we apply optimization strategies that are spec ...Read More
Abstract:

We'll demonstrate our efforts on developing highly efficient solvers for atmospheric dynamics on the GPU platforms. Besides general optimizations for GPU-based scientific computing applications, we apply optimization strategies that are specifically customized for atmospheric dynamic solvers. We'll show that by combining both algorithmic and architectural considerations, our optimization improves the computation efficiency from the original 2.24% to around 16% at the peak, with a sustained double-precision performance of 1.04 Tflops within one CPU-GPU node. We think this work demonstrates a huge potential for performing more efficient climate modeling work on GPU platforms.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, Performance Optimization, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6354
Streaming:
Download:
Share:
 
Abstract:
Porting applications to GPUs still requires compromises between time-to-solution, GPU performance, and CPU performance. This often leads to major challenges for large, Fortran-based applications like weather and climate models. We'll focus o ...Read More
Abstract:

Porting applications to GPUs still requires compromises between time-to-solution, GPU performance, and CPU performance. This often leads to major challenges for large, Fortran-based applications like weather and climate models. We'll focus on two of these challenges, whose significance is shown using real-world code examples and performance results: The differing requirements on parallel task granularity as well as storage order between the two architectures. A proposed solution is a flexible preprocessor framework called "Hybrid Fortran," which has been used to port both dynamics and physics of ASUCA, one of the Japan Meteorological Agency's current operational weather models. Finally, an even more hands-off solution to GPU portability is proposed in the shape of a black box solution.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, Tools & Libraries, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6621
Streaming:
Download:
Share:
 
Abstract:
In partnership with scientists from Space Science and Engineering Center (SSEC), Tempo Quest Inc. is embarking on a quest to complete a proprietary version of Weather Research and Forecasting Model (WRF) - AceCAST, a mesoscale and global model d ...Read More
Abstract:

In partnership with scientists from Space Science and Engineering Center (SSEC), Tempo Quest Inc. is embarking on a quest to complete a proprietary version of Weather Research and Forecasting Model (WRF) - AceCAST, a mesoscale and global model designed for both operational forecasters and atmospheric researchers and widely used by commercial, government, and institutional users. The state-of-the-art acceleration of low throughput, low energy consumption, and error resilient satellite remote sensing data compression suitable for data, image, and video transmission and archive will also be discussed.

  Back
 
Topics:
Climate, Weather & Ocean Modeling
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6724
Streaming:
Download:
Share:
 
Abstract:
The European Centre for Medium-Range Weather Forecasts has been at the cutting edge of Numerical Weather Prediction for the past 40 years, and is making sure it will remain so as HPC heads for the exascale. To this end, ECMWF is leading the EU H ...Read More
Abstract:

The European Centre for Medium-Range Weather Forecasts has been at the cutting edge of Numerical Weather Prediction for the past 40 years, and is making sure it will remain so as HPC heads for the exascale. To this end, ECMWF is leading the EU H2020 ESCAPE project, which promises to address the many requirements necessary for achieving exascale NWP. After talking about the general strategy that ECMWF currently envisages for accelerator usage, we'll look at GPGPU work being carried out for the ESCAPE project, focusing on two important components of the ECMWF weather model, the cloud physics routine and spectral transforms.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, OpenACC, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6855
Streaming:
Download:
Share:
 
Abstract:
We'll discuss the hardware-software co-design project behind the most cost and energy efficient system for numerical weather prediction -- an appliance based on the Cray CS-Storm system architecture that is loaded with NVIDIA K80 GPUs and op ...Read More
Abstract:

We'll discuss the hardware-software co-design project behind the most cost and energy efficient system for numerical weather prediction -- an appliance based on the Cray CS-Storm system architecture that is loaded with NVIDIA K80 GPUs and operated on behalf of MeteoSwiss by CSCS since October 2015.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, Press-Suggested Sessions: HPC & Science
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6628
Streaming:
Download:
Share:
 
Abstract:
We describe the implementation of a simple numerical scheme for solving the shallow water equations on a GPU, which will be used in the further development of a massive ensemble prediction system running on GPUs. The numerical scheme has previou ...Read More
Abstract:

We describe the implementation of a simple numerical scheme for solving the shallow water equations on a GPU, which will be used in the further development of a massive ensemble prediction system running on GPUs. The numerical scheme has previously been used in operational forecasting, and benchmarks comparing the FORTRAN CPU version with the new GPU version have been performed. The results show that the GPU implementation gives a speedup over the CPU of slightly more than 200X. This is highly promising regarding the possibilities of running a large number of ensembles cost effectively on a computer and thereby increasing the usefulness of short-term ocean current forecasts and drift trajectory predictions.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, Computational Physics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6193
Download:
Share:
 
Abstract:
Generation of huge amounts of spatial data has increased demand for applications that are capable of handling large-scale and high-resolution terrain data. A novel example of this would be the Iowa Flood Information System, which is a web-based, ...Read More
Abstract:

Generation of huge amounts of spatial data has increased demand for applications that are capable of handling large-scale and high-resolution terrain data. A novel example of this would be the Iowa Flood Information System, which is a web-based, one-stop platform for accessing flood-related data. One of the most challenging tasks for terrain analysis is the delineation of watersheds. Although traditional methods for watershed analysis give high-accuracy results, it becomes more burdensome as the data resolution increases, and there is no client-side analysis tool for watershed delineation. In this project, we developed a client-side GPGPU algorithm to analyze high-resolution terrain data for watershed delineation, which allows parallelization using GPUs.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, Big Data Analytics
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6226
Download:
Share:
 
Abstract:
AceCAST is a proprietary version of WRF, a mesoscale and global weather research and forecasting model designed for both operational forecasters and atmospheric researchers that are widely used by commercial, government & institutional users ...Read More
Abstract:

AceCAST is a proprietary version of WRF, a mesoscale and global weather research and forecasting model designed for both operational forecasters and atmospheric researchers that are widely used by commercial, government & institutional users around the world, in >150 countries. WRF is suitable for a broad spectrum of applications across domain scales ranging from meters to hundreds of kilometers. AceCAST increases in computational power which enables all time critical weather sensitive industry/commerce to achieve (1) High resolution accuracy and cost performance, (2) Need for strong scaling, and (3) Greatly improved profits. AceCAST is already one third completed and time to first commercial product is only ~12 months away.

  Back
 
Topics:
Climate, Weather & Ocean Modeling, HPC and Supercomputing
Type:
Poster
Event:
GTC Silicon Valley
Year:
2016
Session ID:
P6338
Download:
Share:
Computational Biology & Chemistry
Presentation
Media
Abstract:
Learn how to efficiently parallelize gene set enrichment analysis (GSEA) using CUDA. GSEA is an important bioinformatics method that determines whether given sets of genes are statistically overrepresented between two phenotypes. The GSEA software fr ...Read More
Abstract:
Learn how to efficiently parallelize gene set enrichment analysis (GSEA) using CUDA. GSEA is an important bioinformatics method that determines whether given sets of genes are statistically overrepresented between two phenotypes. The GSEA software from the Broad Institute is the most popular tool to perform such studies with several thousand users. NGS technologies are gradually replacing microarrays for high-throughput gene expression studies. Size and availability of input data sets are increasing, leading to high runtimes of the desktop GSEA application. We present an efficient CUDA parallelization of the core GSEA algorithm. By using a combination of parallelization techniques, we achieve speed-ups of around two orders of magnitude on a single GPU.  Back
 
Topics:
Computational Biology & Chemistry
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6164
Streaming:
Download:
Share:
 
Abstract:
Come and see how to use HOOMD-blue, a flexible particle simulation tool. HOOMD-blue runs hard particle Monte Carlo, Molecular Dynamics, DPD, and other types of particle simulations, all on GPUs. It runs on single GPU workstations up to thousands ...Read More
Abstract:

Come and see how to use HOOMD-blue, a flexible particle simulation tool. HOOMD-blue runs hard particle Monte Carlo, Molecular Dynamics, DPD, and other types of particle simulations, all on GPUs. It runs on single GPU workstations up to thousands of GPUs on supercomputers. Use python scripts to configure jobs with custom initialization, complex flow control, and in-situ analysis of data. This talk introduces HOOMD-blue features and describes how to use them, focusing on the newest capabilities. It demonstrate job scripts for common usage patterns and shows examples of efficient workflows.

  Back
 
Topics:
Computational Biology & Chemistry, Computational Physics
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6256
Streaming:
Download:
Share:
 
Abstract:
The AMBER molecular dynamics (MD) package is one of the fastest MD packages on commodity hardware and was one of the first widely used packages to exploit GPUs. We'll discuss the history of AMBER on NVIDIA GPUs and then highlight some of the ...Read More
Abstract:

The AMBER molecular dynamics (MD) package is one of the fastest MD packages on commodity hardware and was one of the first widely used packages to exploit GPUs. We'll discuss the history of AMBER on NVIDIA GPUs and then highlight some of the newest advances in MD simulation that feature in the latest version 16 of AMBER. This includes extremely high-throughput thermodynamic integration free energy methods, explicit solvent constant pH simulations, advanced umbrella sampling restraints, multi-dimensional replica exchange methods, and asymmetric boundary conditions. We'll also discuss the development and validation of our latest precision model, SPXP, which is focused on maximizing the performance achievable from Maxwell-generation hardware without sacrificing accuracy.

  Back
 
Topics:
Computational Biology & Chemistry, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6278
Streaming:
Download:
Share:
 
Abstract:
This talk presents the first GPU enabled code for ReaxFF MD, GMD-Reax, and its applications in simulating large scale reactive molecular systems for challenging problems in energy applications, including reaction mechanism investigation of coal ...Read More
Abstract:

This talk presents the first GPU enabled code for ReaxFF MD, GMD-Reax, and its applications in simulating large scale reactive molecular systems for challenging problems in energy applications, including reaction mechanism investigation of coal and biomass pyrolysis and combustion of jet fuels. GMD-Reax allows for efficient simulations of large models of ~10,000 atoms. Combined with using VARxMD, the first code we created for ReaxFF MD reaction analysis in our methodology development, the coal pyrolysis simulations can predict the overall spectrum evolution trend of products and uncover important reaction pathways and radical behaviour. What we obtained in simulations of coal and biomass pyrolysis and fuel combustion is hardly accessible experimentally or by other computational approach.

  Back
 
Topics:
Computational Biology & Chemistry, Algorithms & Numerical Techniques
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6319
Streaming:
Download:
Share:
 
Abstract:
Lift is a thin abstraction layer that hides some of the complexity of parallel programming. It provides a set of primitives analogous to (and entirely compatible with) NVIDIA's Thrust library, but designed around drastically simpler code, suitable ...Read More
Abstract:
Lift is a thin abstraction layer that hides some of the complexity of parallel programming. It provides a set of primitives analogous to (and entirely compatible with) NVIDIA's Thrust library, but designed around drastically simpler code, suitable for inclusion in large, complex projects which target NVIDIA GPUs or Intel CPUs. Lift is an open-source project under active development at Genia. It is the foundation for our primary analysis pipeline for DNA sequencing, as well as the foundation for Firepony, an open-source base quality score recalibrator for DNA sequencing data. We'll cover the motivation for Lift and the applications we're developing it for, and then explain how it works, what problems it solves, and what lessons we learned from prior experience with similar libraries.  Back
 
Topics:
Computational Biology & Chemistry, Tools & Libraries
Type:
Talk
Event:
GTC Silicon Valley
Year:
2016
Session ID:
S6473
Streaming:
Download:
Share: