SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC On-Demand

Aerospace & Defense
Presentation
Media
Image Registration for Real-Time Database Extraction
Randall Miles (Propulsion Science and Technology)
We present GPU-enabled, real-time, high-fidelity image interpolation software implementing a non-rigid image registration method, i.e., morphing. Morphing is a mathematical method that modifies input images to create a smoothly evolving image set, wi ...Read More
We present GPU-enabled, real-time, high-fidelity image interpolation software implementing a non-rigid image registration method, i.e., morphing. Morphing is a mathematical method that modifies input images to create a smoothly evolving image set, with minimal image degradation. Morphing eliminates jitter of extracted database images and can also decrease the database size. Tests using simulated thermal images (128x256 pixels) of high-speed jet flow show image extraction speeds of over 500Hz (~80X over serial code). Application to HWIL and scene simulation can provide accurate target inputs with a much smaller database footprint.  Back
 
Keywords:
Aerospace & Defense, Video & Image Processing, GTC 2016 - ID P6186
Download:
 
Agile Condor: Scalable High Performance Embedded Computing Architecture
Mark Barnell (Air Force Research Laboratory), Christopher Capraro (SRC)
The Air Force Research Laboratory Information Directorate Advanced Computing and Communications Division is developing a new computing architecture using GPUs, designed to provide high-performance embedded computing (HPEC) pod solution to meet operat ...Read More
The Air Force Research Laboratory Information Directorate Advanced Computing and Communications Division is developing a new computing architecture using GPUs, designed to provide high-performance embedded computing (HPEC) pod solution to meet operational and tactical real-time processing intelligence surveillance and reconnaissance (ISR) missions. This newly designed system, Agile Condor, is a scalable and HPEC system and based on open industry standards that will increase, far beyond the current state of the art, computational capability within the restrictive size, weight, and power constraints of unmanned aircraft systems' external "pod" payloads.  Back
 
Keywords:
Aerospace & Defense, Embedded, GTC 2016 - ID P6292
Download:
Algorithms
Presentation
Media
ABCD Algorithm for Tridiagonal Solver
Erh-Chung Chen (National Tsing Hua University)
We study and implement the Augmented Block Cimmino Distributed (ABCD) algorithm on the GPU. Because of the special structure of tridiagonal matrices, we investigate the boundary padding technique to eliminate the execution branches on GPU for better ...Read More
We study and implement the Augmented Block Cimmino Distributed (ABCD) algorithm on the GPU. Because of the special structure of tridiagonal matrices, we investigate the boundary padding technique to eliminate the execution branches on GPU for better performance. In addition, our implementation incorporates various performance optimization techniques, such as memory coalesce, to further enhance the performance.  Back
 
Keywords:
Algorithms, Supercomputing & HPC, GTC 2016 - ID P6120
Download:
 
Non-Local Lattice Encoding for Bit-Vectorized Cellular Automata GPU Implementations
Jeffrey Kelling (Helmholtz-Zentrum Dresden-Rossendorf)
In many areas, from physics to economics to social sciences, there are problems that can be mapped to stochastic cellular automata (SCA). In combination with machine learning techniques, cellular automata with learned rules can be used to efficiently ...Read More
In many areas, from physics to economics to social sciences, there are problems that can be mapped to stochastic cellular automata (SCA). In combination with machine learning techniques, cellular automata with learned rules can be used to efficiently predict real-world systems. In physics, they are used to study atomistically the size and shape evolution of micro- and nanostructures, providing insights into processes of self-organization crucial to today's nanotechnology. We present an extremely efficient SCA implementation of a surface growth model using bit-vectorization enhanced by non-local encoding on GPU. The employed technique and non-local encoding can be transferred to other applications.  Back
 
Keywords:
Algorithms, Computational Physics, GTC 2016 - ID P6124
Download:
 
Fully Parallelized Lossless LZW Decompression for CUDA® Enabled GPUs
Koji Nakano (Hiroshima University)
LZW is a popular lossless compression method used in UNIX file compression utility "compress" and in the GIF/TIFF image formats. However, it is very hard to parallelize it, because it creates a dictionary sequentially by reading the input d ...Read More
LZW is a popular lossless compression method used in UNIX file compression utility "compress" and in the GIF/TIFF image formats. However, it is very hard to parallelize it, because it creates a dictionary sequentially by reading the input data one by one. The main contribution of this work is to show a fully parallelized LZW decompression, which assigns each thread to an input compressed code, and converts it into the corresponding original input string. We have implemented our fully parallelized LZW decompression using CUDA. The experimental results show that our CUDA implementation on GeForce GTX 980 can attain 40 times speedup over a sequential implementation on Intel Core i7-4790. We also show that our LZW decompression is useful for big data and deep learning applications.  Back
 
Keywords:
Algorithms, Video & Image Processing, GTC 2016 - ID P6128
Download:
 
Fast Sparse Matrix Vector Multiplication with Highly-Compressed Sparse Format
Yusuke Nagasaka (Tokyo Institute of Technology)
We show the acceleration of sparse matrix vector multiplication (SpMV) on GPU by highly reducing memory traffic. SpMV is a dominant kernel in many sparse algorithms. The performance of SpMV is limited by memory bandwidth and lower locality of memory ...Read More
We show the acceleration of sparse matrix vector multiplication (SpMV) on GPU by highly reducing memory traffic. SpMV is a dominant kernel in many sparse algorithms. The performance of SpMV is limited by memory bandwidth and lower locality of memory access to input vector causing performance degradation. We propose a new sparse matrix format, which alleviates these problems about memory bound by adaptive multi-level blocking techniques and compressing the index of the given matrix. Performance evaluations of SpMV for 40 matrix datasets show that we achieve speedups of 2.91X on maximum and 1.81X on average compared to NVIDIA's cuSparse library. We also find out the memory traffic in SpMV can be estimated and the performance of SpMV strongly depends on the memory traffic.  Back
 
Keywords:
Algorithms, Supercomputing & HPC, GTC 2016 - ID P6132
Download:
 
High Performance Hierarchical Matrix-Vector Multiplication using Hardware Accelerators
Hatem Ltaief (KAUST)
We present a high performance hierarchical matrix vector multiplication using hardware accelerators. By properly mapping the tree structures to the GPU and overlapping the phases of the computation using streams, we greatly outperform the CPU impleme ...Read More
We present a high performance hierarchical matrix vector multiplication using hardware accelerators. By properly mapping the tree structures to the GPU and overlapping the phases of the computation using streams, we greatly outperform the CPU implementations and achieve up to 80% of the sustained bandwidth of the GPU.  Back
 
Keywords:
Algorithms, Supercomputing & HPC, GTC 2016 - ID P6140
Download:
 
GPU-Accelerated Isosurface Extraction
Marcin Adamski (Poznan Supercomputing and Networking Center), Michal Kierzynka (Poznan Supercomputing and Networking Center)
The algorithms for isosurface extraction from volumetric data have become crucial in the petroleum industry, medicine, and many other fields over the last years. They are computationally intensive, especially for large, high-resolution domains. Our G ...Read More
The algorithms for isosurface extraction from volumetric data have become crucial in the petroleum industry, medicine, and many other fields over the last years. They are computationally intensive, especially for large, high-resolution domains. Our GPU implementation of Marching Tetrahedra algorithm is not only immensely fast but allows us to split the domain across multiple GPUs. Processing of large domains is now a matter of seconds. For smaller domains, the algorithm is able to compute the isosurface in milliseconds and the resulting model is visualized in real time.  Back
 
Keywords:
Algorithms, Medical Imaging, GTC 2016 - ID P6141
Download:
 
Fourier Domain Pulsar Acceleration Searches on GPUs for the Square Kilometre Array.
Sofia Dimoudi (University of Oxford)
We describe the work done at the Oxford e-Research Centre (OeRC) at Oxford University toward accelerating one of the most demanding computational tasks of the real-time pulsar signal processing pipeline of the world's largest next generation radio t ...Read More
We describe the work done at the Oxford e-Research Centre (OeRC) at Oxford University toward accelerating one of the most demanding computational tasks of the real-time pulsar signal processing pipeline of the world's largest next generation radio telescope, the Square Kilometre Array (SKA). We introduce the problem of pulsar acceleration searches and a Fourier domain computational method used for detecting signals from accelerated pulsars. A GPU implementation and optimizations results are presented in the context of the SKA timing requirements. This work is done as part of Astro-Accelerate, a real-time time-domain data processing library, currently under development at the OeRC.  Back
 
Keywords:
Algorithms, Astronomy & Astrophysics, GTC 2016 - ID P6227
Download:
 
A Highly Parallel Implementation of the Faddeev-Leverrier Algorithm
Rahul Chandrashekhar (Trinity College, Hartford - CT)
We present an accelerated implementation of the Faddeev-Leverrier algorithm (FLA) to solve the Eigenvalue Problem. The problem, being recursive in nature, cannot be directly extended to a parallel implementation. Instead, a hybrid model is implemente ...Read More
We present an accelerated implementation of the Faddeev-Leverrier algorithm (FLA) to solve the Eigenvalue Problem. The problem, being recursive in nature, cannot be directly extended to a parallel implementation. Instead, a hybrid model is implemented to harness the combined computing power of the CPU and GPU more effectively.  Back
 
Keywords:
Algorithms, Performance Optimization, GTC 2016 - ID P6230
Download:
 
Fast Parallel Bulk Insertion in GPU MOLAP Databases
Steffen Wittmer (Jedox AG)
This work focuses on input processing of big data streams in a GPU-accelerated in-memory OLAP (MOLAP) database by Jedox. We present a solution that supports fast insertion of high data volumes by avoiding the compute-expensive task of multidimensiona ...Read More
This work focuses on input processing of big data streams in a GPU-accelerated in-memory OLAP (MOLAP) database by Jedox. We present a solution that supports fast insertion of high data volumes by avoiding the compute-expensive task of multidimensional sorting during the actual insertion phase. The main processing step achieves a significant speedup over existing CPU-only version.  Back
 
Keywords:
Algorithms, Big Data Analytics, GTC 2016 - ID P6256
Download:
 
cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs
Cheng Wang (University of Houston)
The Fast Fourier Transform (FFT) is one of the most important numerical tools widely used in many scientific and engineering applications. The algorithm performs O(NlogN) operations on N input data points in order to calculate only small number of k ...Read More
The Fast Fourier Transform (FFT) is one of the most important numerical tools widely used in many scientific and engineering applications. The algorithm performs O(NlogN) operations on N input data points in order to calculate only small number of k large coefficients, while the rest of N-K numbers are zero or negligibly small. The algorithm is clearly inefficient, when N points input data lead to only k << N non-zero coefficients in the transformed domain. The sparse FFT (sFFT) algorithm provides a solution to this problem. In this poster, we present a parallel sFFT algorithm on GPUs using CUDA. Our CUDA-based sFFT, namely cusFFT, performs over 10x faster than the state-of-the-art cuFFT library on GPUs and over 28x faster than the parallel FFTW on multicore CPUs.  Back
 
Keywords:
Algorithms, GTC 2016 - ID P6261
Download:
 
Radar Signal Processing on GPU's and Performance Comparison with Vector Processors
Peter Joseph Basil Morris (Defense Research & Development Organisation)
We investigate the computing capabilities of GPUs for radar signal processing applications through the realization of a radar signal processor on a GPU, leveraging the inherent parallelization offered by the radar signal processing algorithms and the ...Read More
We investigate the computing capabilities of GPUs for radar signal processing applications through the realization of a radar signal processor on a GPU, leveraging the inherent parallelization offered by the radar signal processing algorithms and the extensive computing capability of the GPU.  Back
 
Keywords:
Algorithms, Signal & Audio Processing, GTC 2016 - ID P6264
Download:
 
One Kernel To Rule Them All. Performance-Portable FMM for CPUs and GPUs
Ivo Kabadshow (Juelich Supercomputing Centre)
We focus on a single code base for a certain scientific algorithm, a performance portable C++ implementation, using only a single code base that is easily executable in both CPU and GPU. For that purpose, we present our core algorithm -- the fast mul ...Read More
We focus on a single code base for a certain scientific algorithm, a performance portable C++ implementation, using only a single code base that is easily executable in both CPU and GPU. For that purpose, we present our core algorithm -- the fast multipole method -- embedded in a stack of abstraction layers, allowing us to achieve portability without maintaining separate kernels for each architecture. In addition, we'll review common implementation pitfalls that might help other developers when aiming at a unified code base. Especially memory allocation, memory access, and the abstraction of SIMT for complex user-defined data structures are investigated. Finally, we present results/comparisons of the performance on a CPU and GPU.  Back
 
Keywords:
Algorithms, Supercomputing & HPC, GTC 2016 - ID P6265
Download:
 
A Parallel Floyd-Warshall Algorithm on GPU
Roussian Gaioso (Universidade Federal de Sao Carlos)
We propose a new parallel algorithm for solving the APSP problem. The algorithm is based on Floyd-Warshall and, therefore, borrows some of its advantages as having a predictable performance regardless of the underlying graph structure. It was efficie ...Read More
We propose a new parallel algorithm for solving the APSP problem. The algorithm is based on Floyd-Warshall and, therefore, borrows some of its advantages as having a predictable performance regardless of the underlying graph structure. It was efficiently implemented on a machine with a many-core GPU, which is less expensive than a cluster of computers. The tests were performed on a Tesla C2075 graphics card. The implementation was able to identify the shortest paths among all pairs of vertices of randomly generated graphs (all containing a maximum of 8192 vertices) in less than 15 seconds, which represents a speedup of 150x over the sequential Floyd-Warshall algorithm.  Back
 
Keywords:
Algorithms, Performance Optimization, GTC 2016 - ID P6272
Download:
 
Parallelization of Graph Algorithms on GPU Using CUDA®
Chetan Pise (Yeshwantrao Chavan College of Engineering Nagpur, India)
Graphs play a very important role in the field of science and technology for finding the shortest distance. Large graphs are common in scientific and engineering applications consisting of operation on millions of vertices and edges. For faster execu ...Read More
Graphs play a very important role in the field of science and technology for finding the shortest distance. Large graphs are common in scientific and engineering applications consisting of operation on millions of vertices and edges. For faster execution of such operations, parallel computation is essential. GPUs have high computation power and a low price. CUDA technology is becoming a new programming approach for GPGPUs. A multithreaded CUDA device makes various threads to run in parallel using GPUs. We demonstrate the comparison between serial and parallel implementation of BFS and DIJKSTRA algorithms.  Back
 
Keywords:
Algorithms, Performance Optimization, GTC 2016 - ID P6285
Download:
 
Evolutionary Methodology Framework for GPUs
Mihaly Retek (Corvinus University of Budapest)
In evolutionary methods, many processes of the same type can be processed in parallel. These processes are connected to different source and target datasets. For this reason, these methods are optimal for SIMD architectures. This poster shows an evol ...Read More
In evolutionary methods, many processes of the same type can be processed in parallel. These processes are connected to different source and target datasets. For this reason, these methods are optimal for SIMD architectures. This poster shows an evolutionary framework, in which evolutionary algorithms can be developed for GPUs and CPUs. The "Implemented Method" section of this poster is the foundation of this methodology and allows for the creation of more advanced forecasting.  Back
 
Keywords:
Algorithms, Deep Learning & Artificial Intelligence, GTC 2016 - ID P6286
Download:
 
A Rasterization Based Line Segment Intersection Algorithm for Urban Mobility Simulations
Benjamin Hernandez (Oak Ridge National Laboratory)
Road network data is an important component used to model city mobility. However, when using volunteered geographic information, such as OpenStreetMaps, road intersections usually are incomplete or invalid. A line segment intersection algorithm can ...Read More
Road network data is an important component used to model city mobility. However, when using volunteered geographic information, such as OpenStreetMaps, road intersections usually are incomplete or invalid. A line segment intersection algorithm can correct this issue. However, the naïve algorithm has O(N^2) complexity and one of the best solutions, the Bentley-Ottmann algorithm, O(NlogN). We propose GPGPU alternative solution that uses OpenGL 4 rasterization, per-pixel linked lists and almost zero driver overhead functions. Results show our method offers a speed up of 87x over these algorithms.  Back
 
Keywords:
Algorithms, Performance Optimization, GTC 2016 - ID P6345
 
GPU-Accelerated Molecular Dynamics Simulations for Systems with Lennard-Jones Type Potential
Jose Maria Zamora (Lufac Computacion S.A. de C.V.)
This work shows an implementation of a basic algorithm to study molecular systems with interactions of Lennard-Jones type potential. We present a parallelization strategy using CUDA to accelerate the computations in a GPU. After reviewing the results ...Read More
This work shows an implementation of a basic algorithm to study molecular systems with interactions of Lennard-Jones type potential. We present a parallelization strategy using CUDA to accelerate the computations in a GPU. After reviewing the results of simulations with a large number of particles (about 1 million) different states of equilibrium are observed dependent on the initial arrangement of particles. The cause of these trajectories is that the initial arrangements have different values of total energy due to pressure differences (which is a microscopic system variable) depending on their initial geometric configuration of particles in the simulation cubic box. These differences are accentuated when the number of particles is greater than 10^5.  Back
 
Keywords:
Algorithms, Computational Physics, GTC 2016 - ID P6351
 
GPU-Accelerated Batch-ACPF Solution for N-1 Static Security Analysis
Gan Zhou (Southeast University)
GPUs have been applied successfully in many scientific computing realms and have great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power fl ...Read More
GPUs have been applied successfully in many scientific computing realms and have great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve the N-1 SSA problem, the degree of parallelism is limited because existing research has been devoted to accelerating the solution of single ACPF. This paper proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of parallelism. In comparison to its CPU counterpart on Xeon E5-2620, the GPU method and framework solves SSA on Tesla K20C achieves up to a 57.6X speedup.  Back
 
Keywords:
Algorithms, Other, GTC 2016 - ID P6109
Download:
 
CUDA Accelerated Cross Validated Best Subset Selection with XLSTAT
Arnaud Belletoile (Addinsoft)
Our implementation of a cross-validated best subset selection in linear regressions is presented. This algorithm is the latest GPU-enabled feature made available in our statistical solution XLSTAT. It is based on the binary tree regressions first pro ...Read More
Our implementation of a cross-validated best subset selection in linear regressions is presented. This algorithm is the latest GPU-enabled feature made available in our statistical solution XLSTAT. It is based on the binary tree regressions first proposed by Furnival & Wilson and is implemented through a QR factorization and subsequent updates of the R matrix using the cuSolver library. The last step of our model selection is done by a leave-one-out cross-validation test.  Back
 
Keywords:
Algorithms, Other, GTC 2016 - ID P6194
Download:
 
GPU Parallelization of a Distance Field Solver
Anup Shrestha (Boise State University)
Propagating interfaces occur in a wide variety of fields, including fluid mechanics and computer graphics. The distance field from an interface can be calculated by solving the Eikonal equation at each node using the Fast Sweeping Method (FSM) [Zhao, ...Read More
Propagating interfaces occur in a wide variety of fields, including fluid mechanics and computer graphics. The distance field from an interface can be calculated by solving the Eikonal equation at each node using the Fast Sweeping Method (FSM) [Zhao, 2004]. However, parallelization of FSM is not straightforward. We proposed a parallel algorithm using Cuthill-McKee ordering that is suitable for massively threaded architecture. Here, we implement and compare different parallel algorithms for FSM using CUDA, OpenACC, and MPI. The maximum performance is achieved using CUDA and the parallel algorithm of Detrixhe et al., whereas a comparable speedup was achieved using OpenACC with a few directives, substantially shortening the development cycle.  Back
 
Keywords:
Algorithms, Other, GTC 2016 - ID P6257
Download:
 
GPU-Accelerated Neighborhood Operators for Permutation-Based Problems
Victor Machado (Fluminense Federal University)
This poster presents an efficient GPU implementation of four neighborhood operators that are commonly applied in the local search of many metaheuristics for different permutation-based problems, such as the Traveling Salesman Problem and the Single ...Read More
This poster presents an efficient GPU implementation of four neighborhood operators that are commonly applied in the local search of many metaheuristics for different permutation-based problems, such as the Traveling Salesman Problem and the Single Row Facility Layout Problem. Although many optimization problems have been solved through GPU parallelization in the last few years, the authors are not aware of a thorough analysis of the neighborhood moves. Therefore, we perform an evaluation of the neighborhood operators rather than analyzing a specific metaheuristic. The parallel approach achieved good results when compared to the CPU version reaching speedups ranging from 14x to 68x faster.  Back
 
Keywords:
Algorithms, Other, GTC 2016 - ID P6273
Download:
Astronomy & Astrophysics
Presentation
Media
Photometry of Fractal Meshes for Applications to Large-Scale Rough Planetary Surfaces
Antonio Gracia Berna (University of Bern)
The photometry measured by spacecrafts during space missions provides important information about the planetary surface composition and properties, like the roughness that influences its photometry. The model by B. Hapke has been one of the most used ...Read More
The photometry measured by spacecrafts during space missions provides important information about the planetary surface composition and properties, like the roughness that influences its photometry. The model by B. Hapke has been one of the most used models to fit the photometric data, but it presents drawbacks. We present a GPU-accelerated technique that simulates the photometry, produced on large-scale rough surfaces, as the interaction of millions of light rays. Reflectance values measured in the laboratory from real samples are used in the simulation. To prove the validity of the approach, a comparison with the Hapke model is proposed. This is a first step to relate real laboratory measurements to the photometry of solar system surfaces observed by past and future missions.  Back
 
Keywords:
Astronomy & Astrophysics, Computational Physics, GTC 2016 - ID P6134
Download:
 
N-Body Simulation of Binary Star Mass Transfer Using NVIDIA GPUs
Baylor Fain (Tarleton State University), Taylor Hutyra (Tarleton State University), Edward Smith (Tarleton State University)
Over 70% of the stars in our galaxy are binary systems. Because of their interaction, the masses of these stars can be found using Newton's and Kepler's Laws. This allows astronomers to use these systems to study properties and processes of stars a ...Read More
Over 70% of the stars in our galaxy are binary systems. Because of their interaction, the masses of these stars can be found using Newton's and Kepler's Laws. This allows astronomers to use these systems to study properties and processes of stars and galaxies. Among the many types of binary stars observed, contact systems are the most interesting because they exhibit mass transfer, changing the functionality of both stars. But, due to the lack of precise observational data and the large time scale of this process, there is limited understanding of the mass transfer. In this work, a model was made to give astronomers a method for gaining a deeper knowledge and visual intuition of how the mass transfer between binary stars takes place.  Back
 
Keywords:
Astronomy & Astrophysics, Computational Physics, GTC 2016 - ID P6197
Download:
 
Angular Momentum of Late Lunar Forming Impacts Using NVIDIA GPUs
Jonathan Petz (Tarleton State University), William Sumpter (Tarleton State University), Ty Turner (Tarleton State University)
Our Moon is no ordinary satellite! It is too large to be a captured asteroid. Could it be a twin planet formed alongside of Earth as our solar system was being created? Or, perhaps a captured rocky planet forced to light our night and give lovers' i ...Read More
Our Moon is no ordinary satellite! It is too large to be a captured asteroid. Could it be a twin planet formed alongside of Earth as our solar system was being created? Or, perhaps a captured rocky planet forced to light our night and give lovers' inspiration? Though this is romantic, the true answer is thought to be much more violent. We believe the Moon was born from a violent encounter of two young proto-planets. This giant impact hypothesis (GIH) is the main theory for the formation of our Moon, but has been questioned recently because simulations of the GIH leave the Earth-Moon system with excess angular momentum. In this work, we show how to remove the excess angular momentum from giant impact simulations, while preserving all the desired results from previous giant impact studies.  Back
 
Keywords:
Astronomy & Astrophysics, Computational Physics, GTC 2016 - ID P6200
Download:
 
Data Reduction for Cherenkov Gamma-Ray Astronomy on Jetson TK1
Alberto Madonna (Italian National Institute for Astrophysics (INAF))
A mini-array of ASTRI SST-2M Cherenkov telescopes will be deployed soon in a remote site far away from human activities to achieve optimal observation conditions for gamma-ray astronomy. In such a scenario, the capability of each telescope to process ...Read More
A mini-array of ASTRI SST-2M Cherenkov telescopes will be deployed soon in a remote site far away from human activities to achieve optimal observation conditions for gamma-ray astronomy. In such a scenario, the capability of each telescope to process its own data before sending them to a central acquisition system provides a key advantage. We implemented the complete analysis chain required by a single telescope on a Jetson TK1 development board, overcoming the required real-time processing speed by more than a factor of two, while staying within a very small power budget.  Back
 
Keywords:
Astronomy & Astrophysics, Embedded, GTC 2016 - ID P6233
Download:
 
Non-Uniform Diffusion of the Solar Surface Magnetic Field: Code Acceleration Using OpenACC for both GPUs and x86
Ronald Caplan (Predictive Science Inc.)
We show the results of implementing OpenACC into a non-uniform diffusion time integration Fortran code. The code's application is to smooth observation-based radial magnetic field maps of the solar surface for use as inner boundary conditions of glo ...Read More
We show the results of implementing OpenACC into a non-uniform diffusion time integration Fortran code. The code's application is to smooth observation-based radial magnetic field maps of the solar surface for use as inner boundary conditions of global magnetohydrodynamic simulations of the corona and heliosphere. The code uses a RKL2 super-time-stepping algorithm to allow time-steps that far exceed the standard explicit stability limit. The algorithm remains explicit, making the code a prime target for OpenACC acceleration. The OpenACC implementation is discussed and speedup results are shown. The newly released OpenACC x86 feature in the PGI compiler is also tested and shown to produce multicore CPU code from the OpenACC directives that can outperform our OpenMP implementation.  Back
 
Keywords:
Astronomy & Astrophysics, Computational Physics, GTC 2016 - ID P6259
Download:
 
Implementation of a Real-Time Polyphase Filter in Radio Astronomy
Karel Adamek (University of Oxford)
We present our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe our implementation of the ...Read More
We present our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe our implementation of the polyphase filter algorithm. We have implemented the polyphase filter on three generations of NVIDIA GPU cards (Fermi, Kepler, Maxwell), on the Intel Xeon CPU, and Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this. The first makes use of L1/Texture cache, the second uses shared memory. We present our results in terms of the sample rate that can be processed per second.  Back
 
Keywords:
Astronomy & Astrophysics, Signal & Audio Processing, GTC 2016 - ID P6281
Download:
Computational Biology
Presentation
Media
GPU-Enabled Monte Carlo Simulation Makes Study of Cardiac Arrhythmia Possible
Mohsin Jafri (George Mason University)
Heart disease is the leading cause of death in the developed world. Many of these deaths occur through fatal arrhythmia. Multi-scale computational models are required to integrate data to understand the complex dynamics of the myriad components that ...Read More
Heart disease is the leading cause of death in the developed world. Many of these deaths occur through fatal arrhythmia. Multi-scale computational models are required to integrate data to understand the complex dynamics of the myriad components that comprise the heart. Stochastic simulations that integrate the function of individual proteins called ion channels that number in the millions in the cardiac muscle cell are needed to understand arrhythmia. Simulations of such computational complexity are now possible due to our patented Ultra Fast Monte Carlo Algorithm and its implementation on GPUs.  Back
 
Keywords:
Computational Biology, Algorithms, GTC 2016 - ID P6173
Download:
 
Modeling and Simulation of a Atrium Fiber Using GPUs
John Osorio (Universidad Tecnologica de Pereira)
The Particle-in-cell Method (PIC) is a computational method that allows to solve theoretical models such as the kinetic description of plasma. To study plasma it is necessary to understand the behavior of the particles through differential equations ...Read More
The Particle-in-cell Method (PIC) is a computational method that allows to solve theoretical models such as the kinetic description of plasma. To study plasma it is necessary to understand the behavior of the particles through differential equations such as the Vlasov-Poisson equations. The simulation of methods such as PIC consumes several computational resources due to the amount of particles that are used. This work presents the implementation of the PIC code in 2D to simulate two-stream instability in a cyclic and conservative simulation space while introducing 2D PIC CUDA implementation for the GPU usage achieving an improvement (17x) in the execution time for a grid of 64*64 cells for 800,000 particles and up to 13x for a grid of 517*517 for 800,000 particles.  Back
 
Keywords:
Computational Biology, Supercomputing & HPC, GTC 2016 - ID P6182
Download:
 
Accelerating Protein Sequences and Classification Using GPU-HMMER Search
Pragati Dharmale (SNHU NH), Mahesh Khadtare (Pune University)
This poster represents the results of parallelizing HMMer, which is a widely used tool for protein sequence homology detection, as well as functional annotation of homologous protein sequences, and protein family classification. The HMMer program is ...Read More
This poster represents the results of parallelizing HMMer, which is a widely used tool for protein sequence homology detection, as well as functional annotation of homologous protein sequences, and protein family classification. The HMMer program is based upon a Viterbi algorithm coded in C, and is quite time consuming. We modify the Viterbi algorithm logically to port it on GPGPU.  Back
 
Keywords:
Computational Biology, Astronomy & Astrophysics, GTC 2016 - ID P6218
Download:
 
GPU Implementation of Protein Morphing Algorithm
Chengbin Hu (University of South Florida)
Computational modeling of ligand-protein binding has become a popular initial approach to designing new drugs. Most current structure-based drug design focuses on the docking of static crystal protein structures. Under natural conditions, proteins ar ...Read More
Computational modeling of ligand-protein binding has become a popular initial approach to designing new drugs. Most current structure-based drug design focuses on the docking of static crystal protein structures. Under natural conditions, proteins are known to be dynamic in order to use different conformation to perform vital functions. The purpose of a protein morphing algorithm is to estimate the dynamic protein conformation by computation. In this research, we implement the morphing algorithm by "linear interpolation" and we enable massive parallel programming to calculate and adjust the intermediate protein pose. This algorithm could generate morphing intermediate poses at a very high speed.  Back
 
Keywords:
Computational Biology, GTC 2016 - ID P6224
Download:
 
GPU-Accelerated Simulations of Evolution for Medical and Population Genetics
David Lawrie (N/A)
Learn how the analysis of whole-genome SNP data will be revolutionized by accelerating the Wright-Fisher (WF) algorithm on GPUs. The tools of population genetics are crucial for detecting adaptive and disease alleles, as well as tracking population c ...Read More
Learn how the analysis of whole-genome SNP data will be revolutionized by accelerating the Wright-Fisher (WF) algorithm on GPUs. The tools of population genetics are crucial for detecting adaptive and disease alleles, as well as tracking population changes over time. The forward WF simulation is powerful in its ability to model complex population histories and selection scenarios, but is limited in its practical applications by its slow execution on the CPU. The presented GPU Optimized WF simulation (GO Fish) keeps the full flexibility of its serial, CPU counterpart while running far faster. As other related, computationally intensive algorithms important in population genetics are likewise parallelizable, GO Fish serves as an exciting template for future research in the field.  Back
 
Keywords:
Computational Biology, Algorithms, GTC 2016 - ID P6296
Download:
 
A GPU-Accelerated Statistical Method to Identify Differential Genetic Dependencies
Gil Speyer (The Translational Genomics Research Institute)
We have developed a GPU implementation of a statistical method to identify gene sets enriched with condition-specific genetic dependencies. The statistical rigor of the method incurs a substantial computational burden, motivating this effort. Startin ...Read More
We have developed a GPU implementation of a statistical method to identify gene sets enriched with condition-specific genetic dependencies. The statistical rigor of the method incurs a substantial computational burden, motivating this effort. Starting with pairwise comparisons between each set of nodes in the network, edges across the distribution of networks are determined in parallel. After network information has been condensed, a unique list of networks is determined, and the computation is then decomposed across all unique network nodes to compute the divergence. Initial implementation showed more than two orders of magnitude acceleration.  Back
 
Keywords:
Computational Biology, GTC 2016 - ID P6339
Download:
Computational Chemistry
Presentation
Media
A Highly Scalable Kernel Based Clustering Algorithm for Molecular Dynamics
Marco Jacopo Ferrarotti (Italian Institute of Technology)
Presented is a novel distributed clustering algorithm based on kernel k-means. The algorithm is carefully designed and crafted around the specific needs of molecular dynamics applications but is nevertheless general enough to be applied and verified ...Read More
Presented is a novel distributed clustering algorithm based on kernel k-means. The algorithm is carefully designed and crafted around the specific needs of molecular dynamics applications but is nevertheless general enough to be applied and verified against standard clustering datasets. Scalability and good performances on modern HPC GPU-endowed facilities are the key points of the design discussed in the poster. The computational burden is addressed by devising a smart distribution strategy and through an iterative mini-batch approach. GPU acceleration is introduced to speed up the expensive evaluation of the kernel matrix. A three-stage pipeline to hide PCIe latency is described along with a producer-consumer pattern that allows further overlap between GPU and CPU computations.  Back
 
Keywords:
Computational Chemistry, Supercomputing & HPC, GTC 2016 - ID P6178
Download:
Computational Fluid Dynamics
Presentation
Media
User-Defined Drag Models on the GPU
Andrew Larson (CPFD Software LLC)
With CUDA, GPU processing is not relegated to matrix operations. Integrating GPU acceleration into commercial products requires both performance and flexibility. Often, acceleration from applying GPU parallelization is paramount. However, with the pr ...Read More
With CUDA, GPU processing is not relegated to matrix operations. Integrating GPU acceleration into commercial products requires both performance and flexibility. Often, acceleration from applying GPU parallelization is paramount. However, with the proper flexibility to easily incorporate more generalized usage, large performance gains are not lost due to a limited scope of application. We demonstrate the feasibility of GPU acceleration with end-user support for custom drag models supplied at runtime, greatly improving overall usability without sacrificing too much performance.  Back
 
Keywords:
Computational Fluid Dynamics, Tools & Libraries, GTC 2016 - ID P6107
Download:
 
Real-Time Simulation and Prognosis of Smoke Propagation in Underground Stations: Roadmap
Anne Severt (Forschungszentrum Julich GmbH)
Real-time simulations of smoke propagation during fires in complex geometries are challenging. For the sake of computing time, accuracy could impact rescue decisions. We present a roadmap to build a real-time simulation and prognosis software for smo ...Read More
Real-time simulations of smoke propagation during fires in complex geometries are challenging. For the sake of computing time, accuracy could impact rescue decisions. We present a roadmap to build a real-time simulation and prognosis software for smoke propagation in underground stations. First, a fractional step method using finite differences is implemented. Second, we evaluate the Lattice-Boltzmann method due to its high parallelization capability and smart handling of complex boundaries and grids. By including live data through sensor coupling, accuracy is maintained and the prognosis is optimized by ensemble simulation. Further, acceleration is accomplished by dynamically extending the computational domain and multi-GPU support.  Back
 
Keywords:
Computational Fluid Dynamics, Real-Time Graphics, GTC 2016 - ID P6145
Download:
 
A GPU Parallel Solver for 3D Incompressible Navier-Stokes Equations Discretized by the SUPG/PSPG Stabilized Finite Element Formulation
Viet Huynh Quang Huy (Graduate School of Environmental and Life Science, Okayama University, Japan)
The discretization of the Navier-Stokes equations by the SUPG/PSPG stabilized finite element formulation leads to a large and sparse non-symmetric system of linear equations. The numerical solution of nonsymmetric linear systems has often been solved ...Read More
The discretization of the Navier-Stokes equations by the SUPG/PSPG stabilized finite element formulation leads to a large and sparse non-symmetric system of linear equations. The numerical solution of nonsymmetric linear systems has often been solved by using iterative methods such as the biconjugate gradient stabilized method (Bi-CGStab). Among the variations of the Bi-CGStab algorithm proposed by various researchers, the GPBi-CG algorithm has been proven to have very good convergence behavior. In this poster, we propose an efficient GPU implementation of a parallel solver based on the GPBi-CG algorithm for 3D Navier-Stokes equations discretized by the SUPG/PSPG stabilized finite element formulation.  Back
 
Keywords:
Computational Fluid Dynamics, Computational Physics, GTC 2016 - ID P6160
Download:
 
Pore-Network Simulation of Fluid Flow and Transport in Porous Media on GPUs
Hassan Dashtian (University of Southern California)
Networks of interconnected resistors, springs and beams, or pores are standard models of studying scalar and vector transport processes in heterogeneous materials and media, such as fluid flow in porous media, and conduction, deformations, and electr ...Read More
Networks of interconnected resistors, springs and beams, or pores are standard models of studying scalar and vector transport processes in heterogeneous materials and media, such as fluid flow in porous media, and conduction, deformations, and electric and dielectric breakdown in heterogeneous solids. We developed an algorithm for using the computational power of GPUs to speed up the calculations with pore and resistor networks. The same algorithm can be used with networks of springs or beams. A mixed-precision algorithm, together with the conjugate-gradient method, have been implemented on a single GPU solver. We achieve a speedup factor of 60X, and are able to simulate very large networks with several million sites.  Back
 
Keywords:
Computational Fluid Dynamics, Energy Exploration, GTC 2016 - ID P6209
Download:
 
Interactive Boussinesq-Type Simulation and Visualization of Water Wave Propagation on GPU
Sasan Tavakkol (University of Southern California)
A coastal wave simulation and visualization software is developed based on the Boussinesq equations. Both simulation and its concurrent visualization are performed on the GPU using DirectX API. The software provides faster than real-time, interactive ...Read More
A coastal wave simulation and visualization software is developed based on the Boussinesq equations. Both simulation and its concurrent visualization are performed on the GPU using DirectX API. The software provides faster than real-time, interactive modeling for the first time in coastal engineering. A model running faster than real time can be handy in disaster forecasting, naval navigation, and any time-sensitive project. Interactivity provides the option for scientists and engineers to test different scenarios and see the results on the go.  Back
 
Keywords:
Computational Fluid Dynamics, Virtual Reality & Augmented Reality, Real-Time Graphics, GTC 2016 - ID P6213
Download:
 
GPU Based Fluid Structure Interaction
Christopher Minar (Oregon State University)
In computational fluid dynamics, solving a large system of equations is required. This is a massively parallel problem well-suited to GPUs. We present work on a GPU-based computational fluid dynamics solver. The solver uses the immersed boundary meth ...Read More
In computational fluid dynamics, solving a large system of equations is required. This is a massively parallel problem well-suited to GPUs. We present work on a GPU-based computational fluid dynamics solver. The solver uses the immersed boundary method, which allows the Navier-Stokes equations to be solved on a structured, Cartesian grid. This avoids having to remesh or update overset meshes every time an immersed body moves.  Back
 
Keywords:
Computational Fluid Dynamics, GTC 2016 - ID P6241
Download:
 
Parallel Algorithms for Unsteady Navier-Stokes Flow on GPU Architectures
Bahareh Mostafazadeh Davani (University of California, Irvine)
We present HiPer, a high-performance algorithm for unsteady Navier-Stokes flow on GPU systems. Compared to other simulations, our approach has two distinct characteristics: (1) we achieve a high percentage of the machine's peak performance, and (2) ...Read More
We present HiPer, a high-performance algorithm for unsteady Navier-Stokes flow on GPU systems. Compared to other simulations, our approach has two distinct characteristics: (1) we achieve a high percentage of the machine's peak performance, and (2) can adapt to current and future heterogeneous systems. We present results on 1 million grid points and achieve 37X speedup compared to prior state-of-the-art software. We designed HiPer as the building block for next-generation CFD.  Back
 
Keywords:
Computational Fluid Dynamics, Algorithms, GTC 2016 - ID P6266
Download:
 
An Off-Load Model for Computing on GPU for a Parallel CFD Solver HiFUN
Munikrishna Nagaram (S & I Engineering Solutions Pvt. Ltd.), Balakrishnan Narayanarao (Indian Institute of Science, Bangalore), Thejaswi Rao (NVIDIA), Nikhil Shende (S & I Engineering Solutions Pvt. Ltd.)
The present study deals with porting the computational fluid dynamics flow solver HiFUN, a proprietary software by S & I Engineering Solutions Pvt. Ltd., on a GPU-based accelerator platform using OpenACC directives. HiFUN is already parallelized ...Read More
The present study deals with porting the computational fluid dynamics flow solver HiFUN, a proprietary software by S & I Engineering Solutions Pvt. Ltd., on a GPU-based accelerator platform using OpenACC directives. HiFUN is already parallelized on distributed memory HPC platforms using MPI directives and exhibits excellent scalability. In one of the recent studies, scaling over 15,000 processor cores on Cray XC40 has been demonstrated. The challenge at hand is to port HiFUN solver to accelerator-based HPC clusters without compromising its scalability. The presentation includes details on the use of OpenACC directives, wherein the compute-intensive tasks are transferred to the GPU. The success of this strategy to realize the objectives with minimal code change is also highlighted.  Back
 
Keywords:
Computational Fluid Dynamics, Supercomputing & HPC, GTC 2016 - ID P6298
Download:
Computational Physics
Presentation
Media
Non-Equilibrium GPU Simulation of Entangled Polymer Melts by Extending HOOMD-Blue
Ludwig Schneider (University of Gottingen -- Institute for Theoretical Physics)
Molecular dynamics (MD) simulations for melts of multicomponent polymer systems with an experimental value of the invariant degree of polymerization are not tractable with conventional techniques and CPU implementation. For standard GPU-accelerated M ...Read More
Molecular dynamics (MD) simulations for melts of multicomponent polymer systems with an experimental value of the invariant degree of polymerization are not tractable with conventional techniques and CPU implementation. For standard GPU-accelerated MD simulations, the HOOMD-blue software package provides an excellent implemented solution. The simulation of highly coarse grained polymer melts and requires specialized techniques to investigate non-equilibrium properties. These techniques (slip-spring model) are physically motivated and we discuss the algorithmic characteristics and their parallel implementation for GPUs. We evaluate the results with HOOMD-blue benchmarks extended by the new implementations.  Back
 
Keywords:
Computational Physics, Computational Fluid Dynamics, GTC 2016 - ID P6144
Download:
 
Fourier-Stepping Implementation for Non-Linear Schrodinger Equations Using OCCA.
Andreas Mieritz (DTU)
We present initial results for the creation of a GPU-based split-step Fourier solver for the nonlinear Schrodinger equations. For this poster, the OCCA framework was chosen, which shows great promise for unifying the various parallel programming lang ...Read More
We present initial results for the creation of a GPU-based split-step Fourier solver for the nonlinear Schrodinger equations. For this poster, the OCCA framework was chosen, which shows great promise for unifying the various parallel programming languages that we have today. These results are very initial, and we fully expect to have a much better implementation of the Fourier transform operations before the conference.  Back
 
Keywords:
Computational Physics, Programming Languages, GTC 2016 - ID P6171
Download:
 
Implementation of the Vlasov-Poisson Equations in 2D Using PIC on GPUs
John Osorio (Universidad Tecnologica de Pereira)
The particle-in-cell method (PIC) is a computational method that allows solving theoretical models such as the kinetic description of plasma. To study plasma, it is necessary to understand the behavior of the particles through differential equations ...Read More
The particle-in-cell method (PIC) is a computational method that allows solving theoretical models such as the kinetic description of plasma. To study plasma, it is necessary to understand the behavior of the particles through differential equations such as the Vlasov-Poisson equations. The simulation of methods such as PIC consumes several computational resources due to the amount of particles that are used. This work presents the implementation of the PIC code in 2D to simulate two-stream instability in a cyclic and conservative simulation space while introducing 2D PIC CUDA implementation for the GPU usage achieving an improvement (17x) in the execution time for a grid of 64*64 cells for 800,000 particles and up to 13x for a grid of 517*517 for 800,000 particles.  Back
 
Keywords:
Computational Physics, Supercomputing & HPC, GTC 2016 - ID P6183
Download:
 
GPU Accelerated Computation of the Beta Function of the SU(3) Gauge Theory with Ten Fundamental Fermions
Ting-Wai Chiu (National Taiwan University)
Recent experiments in the Large Hadron Collider at CERN have discovered the Higgs scalar at the mass ~125 GeV. Even though the Higgs scalar is an elementary particle in the Standard Model, there is still a possibility that such a light scalar might a ...Read More
Recent experiments in the Large Hadron Collider at CERN have discovered the Higgs scalar at the mass ~125 GeV. Even though the Higgs scalar is an elementary particle in the Standard Model, there is still a possibility that such a light scalar might arise as a composite particle in non-abelian gauge theories with many fermions, provided that it is not too far below the conformal window. This study focuses on the SU(3) gauge theory with 10 massless domain-wall fermions in the fundamental representation, using a GPU cluster at National Taiwan University, which is crucial for completing the dynamical simulations within a few months. Our result of the beta-function suggests that this theory is conformal. In this poster, we present our algorithms and strategies for GPU-accelerated computations.  Back
 
Keywords:
Computational Physics, GTC 2016 - ID P6258
Download:
 
NaNet: FPGA-Based NICs for GPU Accelerated Real-Time Systems
Alessandro Lonardo (Istituto Nazionale di Fisica Nucleare (INFN))
NaNet is a modular design of a family of FPGA-based PCIe Network Interface Cards specialized for low-latency real-time operations. ...Read More
NaNet is a modular design of a family of FPGA-based PCIe Network Interface Cards specialized for low-latency real-time operations.  Back
 
Keywords:
Computational Physics, GTC 2016 - ID P6262
Download:
 
EEE Event Reconstruction on GPUs
Richard Forster (Eotvos Lorand University), Orsolya Visnyei (Eotvos Lorand University)
The EEE project is a detector array of Multigap Resistive Plate Chambers located at selected sites on the Italian territory. Goals of the Project includes the study of the properties of the local muon ?ux and its dependence on planetary and solar env ...Read More
The EEE project is a detector array of Multigap Resistive Plate Chambers located at selected sites on the Italian territory. Goals of the Project includes the study of the properties of the local muon ?ux and its dependence on planetary and solar environment, the detection of high energy extensive air showers created in the Earth's atmosphere by time and orientation correlations between several telescopes, and the search for possible long range correlations between far telescopes. We are proposing to involve GPUs in the data reconstruction phase to increase the available computing capacity as the available data rapidly increases.  Back
 
Keywords:
Computational Physics, Astronomy & Astrophysics, GTC 2016 - ID P6263
Download:
 
Computational Study of Magnetic Anisotropy Using Heisenberg Model and GPU-Accelerated Monte-Carlo Simulation
Soo Kyung Kim (Lawrence Livermore National Laboratory)
Although Monte-Carlo simulation based on Ising model is a widely used method to model the thermal fluctuation of magnetic spins and dynamics combined with DFT calculation, it has two main drawbacks: [1] accuracy of model and [2] performance of model. ...Read More
Although Monte-Carlo simulation based on Ising model is a widely used method to model the thermal fluctuation of magnetic spins and dynamics combined with DFT calculation, it has two main drawbacks: [1] accuracy of model and [2] performance of model. Firstly, the Ising model is not accurate enough to represent complicated pictures of real physics. Secondly, Monte-Carlo itself is very slow, which is critical issue to extend to the many atomic system. Monte-Carlo repeats random sampling for each iteration step. It can be extremely slow to generate "truly random" random numbers to choose the spins to flip in a large atomic system. To resolve the problems listed, we have implemented a checkerboard algorithm using GPUs within the heisenberg model, which is a better model to mimic real physics.  Back
 
Keywords:
Computational Physics, Computational Chemistry, GTC 2016 - ID P6271
Download:
Computer Vision & Machine Vision
Presentation
Media
GPU Accelerated Hausdorff Distance Computation Using Mathematical Morphology
Esteban Clua (Unversidade Federal Fluminense), Erick Rodrigues (Universidade Federal Fluminense)
Hausdorff distance is a widely used metric in visual computing for comparing images, finding patterns, and performing registrations. We propose two parallel algorithms for computing the Hausdorff distance using morphological dilations on the GPU. The ...Read More
Hausdorff distance is a widely used metric in visual computing for comparing images, finding patterns, and performing registrations. We propose two parallel algorithms for computing the Hausdorff distance using morphological dilations on the GPU. The algorithms require block synchronization and both the CPU-based and GPU-based block synchronizations were evaluated. Furthermore, we compare the efficiency of the proposed algorithms to implementations on the CPU and GPU regarding distinct programming languages, including C++, Java, and the Aparapi library. Experimental results have shown that the CUDA GPU-based synchronized algorithm provided the best results and was approximately 26 and 6630 times faster than the CPU for large binary and grey images, respectively.  Back
 
Keywords:
Computer Vision & Machine Vision, Performance Optimization, GTC 2016 - ID P6123
Download:
 
Visual Tracking with Deep Networks: A Performance Study
David Concha (Universidad Rey Juan Carlos), Antonio S. Montemayor (Universidad Rey Juan Carlos), Juan Jose Pantrigo (Universidad Rey Juan Carlos)
We propose a tracking system using the power of a deep neural network for the object detection stage, and guiding the classification stage to relevant zones in future time steps. Moreover, we perform an important performance evaluation using differen ...Read More
We propose a tracking system using the power of a deep neural network for the object detection stage, and guiding the classification stage to relevant zones in future time steps. Moreover, we perform an important performance evaluation using different computing platforms such as a Tegra K1 board, a mobile GPU and CPU, and a desktop CPU and GPU.  Back
 
Keywords:
Computer Vision & Machine Vision, Deep Learning & Artificial Intelligence, GTC 2016 - ID P6175
Download:
 
Dense Reconstruction with GPUs
Jason Mak (University of California, Davis)
We show how to obtain a dense 3D reconstruction of a scene given an initial sparse 3D reconstruction and images of that scene. GPUs are used to make the method computationally viable. 3D reconstruction is an increasingly important computer vision pro ...Read More
We show how to obtain a dense 3D reconstruction of a scene given an initial sparse 3D reconstruction and images of that scene. GPUs are used to make the method computationally viable. 3D reconstruction is an increasingly important computer vision problem with a variety of applications, and dense reconstructions are more desirable for the level of detail they provide. The dense reconstruction method featured in this talk is simple to implement and relies on geometric and image consistency constraints. It updates a traditional approach with more modern segmentation and feature descriptors to improve accuracy. Details of the method's implementation on GPUs are also explained. The brute-force computation provided by GPUs allows for more dense and accurate reconstructions.  Back
 
Keywords:
Computer Vision & Machine Vision, Video & Image Processing, GTC 2016 - ID P6239
Download:
 
A Parallel CPU/GPU Algorithm for 3D Pose Estimation Using CUDA and OpenMP
Kenia Picos (CITEDI-IPN)
Pose recognition is characterized by location and orientation parameters, which introduces high complexity due the huge number of visualizations, more than a target can present within a scene. A design of an effective algorithm is needed to analyze t ...Read More
Pose recognition is characterized by location and orientation parameters, which introduces high complexity due the huge number of visualizations, more than a target can present within a scene. A design of an effective algorithm is needed to analyze the physical phenomena implied in the visualization of a moving 3D object. This work presents a proposal for pose recognition with adaptive correlation filters with a CPU/GPU implementation using OpenMP and CUDA in order to improve the execution performance.  Back
 
Keywords:
Computer Vision & Machine Vision, Video & Image Processing, GTC 2016 - ID P6274
Download:
 
Real-time 3D Reconstruction for Autonomous Driving through Semi-Global Matching
Antonio Espinosa (Universitat Autonoma de Barcelona)
Robust and dense computation of depth information from stereo-camera systems is a computationally demanding requirement for real-time autonomous driving. Semi-Global Matching (SGM) [1] approximates heavy-computation global algorithms results but ...Read More

Robust and dense computation of depth information from stereo-camera systems is a computationally demanding requirement for real-time autonomous driving. Semi-Global Matching (SGM) [1] approximates heavy-computation global algorithms results but with lower computational complexity, therefore it is a good candidate for a real-time implementation. SGM minimizes energy along several 1D paths across the image. The aim of this work is to provide a real-time system producing reliable results on energy-efficient hardware. Our design runs on a NVIDIA Titan X GPU at 104.62 FPS and on a NVIDIA Drive PX at 6.7 FPS, promising for real-time platforms.

  Back
 
Keywords:
Computer Vision & Machine Vision, Self-Driving Cars & Automotive, Automotive, GTC 2016 - ID P6289
Download:
 
Fast and Robust Feature Matching
Cristina Nader Vasconcelos (IC-UFF/Brazil), Ana Caroline Vargas (IC-UFF/Brazil)
Choosing informative, discriminating, and independent features is crucial for effective computer vision algorithms. The result of a feature-detection procedure over an image is a set of keypoints commonly matched to other sets extracted over differen ...Read More
Choosing informative, discriminating, and independent features is crucial for effective computer vision algorithms. The result of a feature-detection procedure over an image is a set of keypoints commonly matched to other sets extracted over different images. Traditionally, the matching is obtained using a k-Nearest Neighbors modeling (k-NN). However, such approach does not model matching unicity restrictions, that is, it allows a keypoint to be associated more than once. We explore a parallel Bipartite Graph Matching (BGM) entirely in GPU for fast and robust matching and present its comparison against the k-NN.  Back
 
Keywords:
Computer Vision & Machine Vision, Algorithms, GTC 2016 - ID P6293
Download:
 
GPU Accelerated Image Recognition Cloud Services?Internet UGC Images and Videos Recognition Overall Solution
Leonard Li (Tupu Technology Co.,Ltd.)
With the advent of the era of visual media, more and more information has been spread through images and videos. The demand for image analysis and recognition is growing fast. For companies that need to process lots of images but don't have the tech ...Read More
With the advent of the era of visual media, more and more information has been spread through images and videos. The demand for image analysis and recognition is growing fast. For companies that need to process lots of images but don't have the technology, Tupu has built an open platform to provide a way to censor, search, or mine images automatically and intelligently. Tuputech Cloud Platform is the largest image and video analysis cloud service provider in China and provides highly customizable services for its clients.  Back
 
Keywords:
Computer Vision & Machine Vision, GTC 2016 - ID P6309
Download:
 
Deep Residual Networks - Ultra-Deep Neural Networks with 150+ layers
Jian Sun (Microsoft)
Deeper neural networks are difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions wit ...Read More
Deeper neural networks are difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task.  Back
 
Keywords:
Computer Vision & Machine Vision, Deep Learning & Artificial Intelligence, GTC 2016 - ID P6310
Download:
 
Oblique-View Computed Tomography for 3D IC Package Inspection Using CUDA
Kyung-Chan Jin (Korea Institute of Industrial Technology)
This study focused on the CUDA implementation to oblique-view CT(Computed Tomography) technique for non-destructive internal inspection of 3D IC chips. With 400 projected images from rotating phantom in an oblique direction, we executed 16 GUPS perfo ...Read More
This study focused on the CUDA implementation to oblique-view CT(Computed Tomography) technique for non-destructive internal inspection of 3D IC chips. With 400 projected images from rotating phantom in an oblique direction, we executed 16 GUPS performance to reconstruct 512x512x512 volume of phantom with NVIDIA Quadro K6000 GPU, showed that the GPU performed 100 times faster than the dual CPU processors in the CT reconstruction method.  Back
 
Keywords:
Computer Vision & Machine Vision, Medical Imaging, GTC 2016 - ID P6336
Download:
 
Neural Attention for Object Tracking
Brian Cheung (UC Berkeley)
With differentiable forms of attention being integrated into neural networks, end-to-end training with backpropagation is possible. We adopt the recently proposed attention mechanism in Spatial Transformer Networks (STNs) into a recurrent architectur ...Read More
With differentiable forms of attention being integrated into neural networks, end-to-end training with backpropagation is possible. We adopt the recently proposed attention mechanism in Spatial Transformer Networks (STNs) into a recurrent architecture to perform object tracking. We also present several issues which arise when such recurrent attention models are scaled up to much larger and more complex images/videos. We present pretraining strategies to resolve some of these training issues.  Back
 
Keywords:
Computer Vision & Machine Vision, Deep Learning & Artificial Intelligence, GTC 2016 - ID P6356
Download:
Deep Learning & Artificial Intelligence
Presentation
Media
DR. TED: Deep Learning Recommendation of Treatment from Electronic Data
Melissa Aczon (Children's Hospital Los Angeles), David Ledbetter (Children's Hospital Los Angeles)
Construct a model to generate treatment predictions to optimize patient outcomes by using the information gleaned from over 10,000 patients who passed through the Pediatric Intensive Care Unit at Children's Hospital Los Angeles over more than 10 yea ...Read More
Construct a model to generate treatment predictions to optimize patient outcomes by using the information gleaned from over 10,000 patients who passed through the Pediatric Intensive Care Unit at Children's Hospital Los Angeles over more than 10 years. This is accomplished by converting unstructured, non-uniformly sampled patient information into a structured data representation that resembles an image -- here referred to as a "patient snapshot." These patient snapshots elegantly enable convolutional neural networks to efficiently generate a basis.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Medical Imaging, GTC 2016 - ID P6102
Download:
 
Heterogeneous Learning for Multi-task Facial Analysis Using Single Deep Convolutional Network
Hironobu Fujiyoshi (Chubu University), Hiroshi Fukui (Chubu University), Yuu Kato (Chubu University), Takayoshi Yamashita (Chubu University), Yuji Yamauchi (Chubu University)
When performing multiple tasks like recognition and regression, it requires multiple networks and computational cost becomes enormous in proportion to the number of task. Although heterogenous learning can perform multiple tasks in a single network, ...Read More
When performing multiple tasks like recognition and regression, it requires multiple networks and computational cost becomes enormous in proportion to the number of task. Although heterogenous learning can perform multiple tasks in a single network, the performances of each task are worse than the case of trained individually. We propose the new heterogenous learning with weighted loss function. We apply the method to facial analysis which contains five heterogenous tasks(gender estimation and race detection, facial points detection, age estimation and smile degree estimation). Even a single network, the performance is comparable to the networks trained for single task. The computation speed is 22ms in GTX 980. It is 5 times faster than case of utilize five networks for single task.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Computer Vision & Machine Vision, GTC 2016 - ID P6139
Download:
 
CNNs in Content Moderation for Online Classifieds: Scalable Image Classification Service
Alexandra Fenster (Avito)
Avito.ru is the biggest online classified advertising platform in Russia. The more buyers we attract, the more attractive the site is for swindlers to upload prohibited content. With all this modernization we've met many challenges related to user c ...Read More
Avito.ru is the biggest online classified advertising platform in Russia. The more buyers we attract, the more attractive the site is for swindlers to upload prohibited content. With all this modernization we've met many challenges related to user content and the quality of that content. At a certain point validating all the incoming items by scaling became unrealistic. Having a tremendous amount of data, we started to implement machine learning approaches first with text models only. But then we found the power of deep learning and GPU computing to help us to use both images and text models for quality improvement.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Computer Vision & Machine Vision, GTC 2016 - ID P6162
Download:
 
Image-Based Sticker Recommendation Using Deep Learning
Jiwon Kim (Naver Labs)
LINE mobile messenger sticker shop accepts new stickers designed by independent artists on a daily basis. To assist users in browsing the stickers, a list of similar stickers are recommended by collaborative filtering based on user purchase history. ...Read More
LINE mobile messenger sticker shop accepts new stickers designed by independent artists on a daily basis. To assist users in browsing the stickers, a list of similar stickers are recommended by collaborative filtering based on user purchase history. A major drawback of this approach, known as a cold start problem, is that new items cannot be recommended as they have no purchase records. To address this issue, we recommend stickers based on image content by learning the visual similarity between stickers using deep learning on GPUs. We trained a convolutional neural network to learn semantic features, which we use for recommending visually similar stickers. We measure the relevance of different recommendation schemes and verify the effectiveness of the proposed approach.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Computer Vision & Machine Vision, GTC 2016 - ID P6244
Download:
 
Neurophysiological Working Memory Task Classification from Magnetoencephalography Using Deep Learning
Zachary Harper (Medical College of Wisconsin), Charles Welzig (Medical College of Wisconsin)
Biological neural networks are complex, plastic, and unique to every individual. These qualities pose great challenges in classifying highly dynamic patterns across neural activity for time-sensitive medical applications. In this project, we use deep ...Read More
Biological neural networks are complex, plastic, and unique to every individual. These qualities pose great challenges in classifying highly dynamic patterns across neural activity for time-sensitive medical applications. In this project, we use deep learning to identify oscillatory activation markers that can differentiate between two working memory tasks. Training on multiple NVIDIA GeForce GTX TITAN GPUs enables us to overcome computational challenges for use in clinical and medical research applications. This poster presents our first step towards classifying deep-temporal whole-brain neural network activation patterns using GPU-accelerated deep learning systems with convolutional neural networks.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Medical Imaging, GTC 2016 - ID P6245
Download:
 
Automatic Speech Recognition Using Deep Learning
Takehiro Sekine (Yahoo! JAPAN corporation)
Deep neural networks (DNNs) have become a popular foundation for state-of-the-art automatic speech recognition systems. We have collected and transcribed more than 2,000 hours of speech data and used it to train DNNs for acoustic models and voice act ...Read More
Deep neural networks (DNNs) have become a popular foundation for state-of-the-art automatic speech recognition systems. We have collected and transcribed more than 2,000 hours of speech data and used it to train DNNs for acoustic models and voice activity detector models. Implementation techniques for efficient speech DNN training on GPUs are explained and several evaluation results and the training times are shown using different amounts of training data and sizes of DNN. The trained DNNs are deployed in our ASR services in Japan.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Signal & Audio Processing, GTC 2016 - ID P6295
Download:
 
Applying Deep Learning to Aerospace and Building System Applications at UTC
Kishore Reddy (United Technologies Research Center), Vivek Venugopalan (United Technologies Research Center)
Deep Learning is an evolving area of research in machine learning and it has been adopted by UTC for solving various problems in the aerospace and building systems. The use cases highlighted include sensor diagnostics from onboard sensors on aircraft ...Read More
Deep Learning is an evolving area of research in machine learning and it has been adopted by UTC for solving various problems in the aerospace and building systems. The use cases highlighted include sensor diagnostics from onboard sensors on aircraft engines, energy estimation and health monitoring of building systems. GPUs provide the computational horsepower to tackle the huge amount of data generated from these sensors. The existing methods for extracting relevant information has been largely replaced by Deep Learning techniques by mapping the problem to large neural networks.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Big Data Analytics, GTC 2016 - ID P6315
Download:
 
Upscaling with Deep Convolutional Networks and Muxout Layers
Pablo Navarrete Michelini (BOE Technology Group Co., Ltd.)
We consider the problem of super-resolution using convolutional networks. Previous work has shown the advantages of using convolutional networks to improve the quality of image upscaling. Unlike previous solutions, our method incorporates the image u ...Read More
We consider the problem of super-resolution using convolutional networks. Previous work has shown the advantages of using convolutional networks to improve the quality of image upscaling. Unlike previous solutions, our method incorporates the image upsampling within the network structure. To achieve this goal we propose a so-called Muxout layer that increases the size of image features by combining them in groups. The system structure is motivated by an interpretation of convolutional networks as adaptive filters and classic interpolation theory. We use this interpretation to propose specialized initialization methods that are convenient for training deep structures. Our tests show state-of-art quality, high performance, and the ability for unsupervised learning of text images.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Video & Image Processing, GTC 2016 - ID P6324
Download:
 
Which Whale Is It, Anyway? Face Recognition for Right Whales Using Deep Learning
Robert Bogucki (deepsense.io)
With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, ...Read More
With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, NOAA Fisheries has organized a competition hosted on <a href="http://Kaggle.com">Kaggle.com</a>. The challenge was to automate the right whales recognition process using a dataset of aerial photographs of individual whales - currently a painstaking and lengthy, manual process. In the poster, we outline the winning solution. It is based on deep learning and convolutional neural networks.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Computer Vision & Machine Vision, GTC 2016 - ID P6325
Download:
 
Fine-Tune For A Fortune: Transfer Learning Using DIGITS and GPUs
Valeriu Codreanu (SURFsara)
Deep convolutional neural networks are widely accepted as the state-of-the-art solution for various computer vision problems. These commonly lead to a trade-off between network complexity and over-fitting, addressable by increasing the number of trai ...Read More
Deep convolutional neural networks are widely accepted as the state-of-the-art solution for various computer vision problems. These commonly lead to a trade-off between network complexity and over-fitting, addressable by increasing the number of training examples, thus resulting in a lengthy training process. Moreover, more training examples may not even be available. Recent research suggests that this hurdle can be surmounted by using pre-trained complex networks and then fine-tuning them to fit specific datasets. We show that this approach allows for record-breaking performance on tasks ranging from natural image classification to handwritten character recognition. This is made possible by using high-performance NVIDIA GPUs in conjunction with the NVIDIA DIGITS training system.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Computer Vision & Machine Vision, GTC 2016 - ID P6327
Download:
 
GPU Boosted Deep Learning in Real-time Face Alignment
Binglong Xie (HiScene Information Technology Co,.Ltd)
For the task of real-time face alignment, we employ a GPU server cluster to train a Convolutional Neural Network. By taking advantages of both Deep Learning and GPU computing technologies, our algorithm outperforms all existing algorithms on the popu ...Read More
For the task of real-time face alignment, we employ a GPU server cluster to train a Convolutional Neural Network. By taking advantages of both Deep Learning and GPU computing technologies, our algorithm outperforms all existing algorithms on the popularly tested IBUG benchmark. In a photo edit application, our face alignment algorithm is integrated to locate precise facial key points, which provide the basis for further virtual facial makeup. Details of our algorithm are given in our poster, along with experimental results on the public benchmark.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Algorithms, Computer Vision & Machine Vision, GTC 2016 - ID P6343
 
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks
Yu-Hsin Chen (MIT)
Eyeriss is an energy-efficient deep convolutional neural network (CNN) accelerator that supports state-of-the-art CNNs, which have many layers, millions of filter weights, and varying shapes (filter sizes, number of filters and channels). The test ch ...Read More
Eyeriss is an energy-efficient deep convolutional neural network (CNN) accelerator that supports state-of-the-art CNNs, which have many layers, millions of filter weights, and varying shapes (filter sizes, number of filters and channels). The test chip features a spatial array of 168 processing elements (PE) fed by a reconfigurable multicast on-chip network that handles many shapes and minimizes data movement by exploiting data reuse. Data gating and compression are used to reduce energy consumption. The chip has been fully integrated with the Caffe deep learning framework. The chip can run the convolutions in AlexNet at 35 fps with 278 mW power consumption.  Back
 
Keywords:
Deep Learning & Artificial Intelligence, Embedded, GTC 2016 - ID P6354
Download:
Earth System Modelling
Presentation
Media
Using the GPU to Predict Drift in the Ocean
Martin Lilleeng Satra (Norwegian Meteorological Institute)
We describe the implementation of a simple numerical scheme for solving the shallow water equations on a GPU, which will be used in the further development of a massive ensemble prediction system running on GPUs. The numerical scheme has previously b ...Read More
We describe the implementation of a simple numerical scheme for solving the shallow water equations on a GPU, which will be used in the further development of a massive ensemble prediction system running on GPUs. The numerical scheme has previously been used in operational forecasting, and benchmarks comparing the FORTRAN CPU version with the new GPU version have been performed. The results show that the GPU implementation gives a speedup over the CPU of slightly more than 200X. This is highly promising regarding the possibilities of running a large number of ensembles cost effectively on a computer and thereby increasing the usefulness of short-term ocean current forecasts and drift trajectory predictions.  Back
 
Keywords:
Earth System Modelling, Computational Physics, GTC 2016 - ID P6193
Download:
 
Client-Side GPGPU Web Application for Catchment Delineation and Watershed Segmentation
Muhammed Yusuf Sermet (IIHR-Hydroscience & Engineering, University of Iowa)
Generation of huge amounts of spatial data has increased demand for applications that are capable of handling large-scale and high-resolution terrain data. A novel example of this would be the Iowa Flood Information System, which is a web-based, one- ...Read More
Generation of huge amounts of spatial data has increased demand for applications that are capable of handling large-scale and high-resolution terrain data. A novel example of this would be the Iowa Flood Information System, which is a web-based, one-stop platform for accessing flood-related data. One of the most challenging tasks for terrain analysis is the delineation of watersheds. Although traditional methods for watershed analysis give high-accuracy results, it becomes more burdensome as the data resolution increases, and there is no client-side analysis tool for watershed delineation. In this project, we developed a client-side GPGPU algorithm to analyze high-resolution terrain data for watershed delineation, which allows parallelization using GPUs.  Back
 
Keywords:
Earth System Modelling, Big Data Analytics, GTC 2016 - ID P6226
Download:
 
AceCAST - High Performance CUDA Based Weather Research Forecasting (WRF) Model
Allen Huang (Tempo Quest Inc.)
AceCAST is a proprietary version of WRF, a mesoscale and global weather research and forecasting model designed for both operational forecasters and atmospheric researchers that are widely used by commercial, government & institutional users arou ...Read More
AceCAST is a proprietary version of WRF, a mesoscale and global weather research and forecasting model designed for both operational forecasters and atmospheric researchers that are widely used by commercial, government & institutional users around the world, in >150 countries. WRF is suitable for a broad spectrum of applications across domain scales ranging from meters to hundreds of kilometers. AceCAST increases in computational power which enables all time critical weather sensitive industry/commerce to achieve (1) High resolution accuracy and cost performance, (2) Need for strong scaling, and (3) Greatly improved profits. AceCAST is already one third completed and time to first commercial product is only ~12 months away.  Back
 
Keywords:
Earth System Modelling, Supercomputing & HPC, GTC 2016 - ID P6338
Download:
Embedded
Presentation
Media
Real-Time Face Detection in FHD Images Exploiting both Embedded CPU and GPU
Chanyoung Oh (University of Seoul), Youngmin Yi (University of Seoul), Saehanseul Yi (University of Seoul)
Face detection has become very popular for many reasons, along with face recognition. Among them, many detection systems require real-time processing for the incoming video streams, such as intelligent surveillance systems. Recently, the widely used ...Read More
Face detection has become very popular for many reasons, along with face recognition. Among them, many detection systems require real-time processing for the incoming video streams, such as intelligent surveillance systems. Recently, the widely used image resolution for surveillance cameras has been increased to full high definition (FHD, 1920x1080), which obviously increases the computation time for face detection. At the same time, CPU-GPU heterogeneous systems have become a mainstream platform in both server and embedded domains with ever increasing demand for powerful accelerators. Today, we present parallelization techniques that exploit both data and task parallelism of LBP-based face detection algorithm on an embedded heterogeneous platform.  Back
 
Keywords:
Embedded, IoT, Video & Image Processing, GTC 2016 - ID P6138
Download:
 
A Task-based Programming Model Using Industry Standards for Embedded Hardware
Sunita Chandrasekaran (University of Delaware)
The Multicore Association (MCA) is an industry association that defines and promotes open specification to enable multicore product development. The main goal of MCA is to abstract hardware details and offer a portable software solution stack for emb ...Read More
The Multicore Association (MCA) is an industry association that defines and promotes open specification to enable multicore product development. The main goal of MCA is to abstract hardware details and offer a portable software solution stack for embedded systems. One of the MCA APIs is Multicore Task Management API (MTAPI), which leverages task parallelism on embedded multicore systems that are comprised of symmetric and asymmetric processors. We have developed a runtime library (RTL) based on MTAPI that allows scheduling and mapping of tasks to the heterogeneous cores of the given platform. Our RTL utilizes Multicore Communication API (MCAPI) to communicate between cores. Our RTL is evaluated on the NVIDIA Jetson TK1 embedded processor comprising ARM and GPU cores.  Back
 
Keywords:
Embedded, Programming Languages, IoT, GTC 2016 - ID P6287
Download:
 
A High-Precision Power Model for the Tegra K1 CPU, GPU and RAM
Kristoffer Robin Stokke (University of Oslo), Peter Yoon (Trinity College, Hartford - CT)
This poster session accompanies our talk on high-precision power modelling for the Tegra K1's GPU, CPU clusters and memory. Power modelling is necessary to understand how software consumes energy and optimise for power. However, existing power model ...Read More
This poster session accompanies our talk on high-precision power modelling for the Tegra K1's GPU, CPU clusters and memory. Power modelling is necessary to understand how software consumes energy and optimise for power. However, existing power models are typically very coarse-grained and can mispredict badly up to 70 % depending on hardware state. This poster introduces our high-precision power model which, by taking into account operating frequencies, clock- and core-gating, rail voltages and hardware utilisation, is shown to be over 98 % accurate for GPU and CPU video processing workloads. The model is not only able to predict power usage very accurately, but can also be used to analyse and optimise power usage of applications, for example by utilising non-coherent GPU caches.  Back
 
Keywords:
Embedded, IoT, Energy Exploration, GTC 2016 - ID P6342
Game Development
Presentation
Media
GPU-Accelerated Motion Input Analysis for Motion Reconstruction
Rafael Rego Drumond (Universidade Federal Fluminense), Esteban Walter Gonzalez Clua (Universidade Federal Fluminense)
Games and real-life simulations introduced motion capture as an interactive controlling feature that allows humans to reproduce movements they want to see inside the game. However, reconstructing motion can be slow, especially when it's necessary to ...Read More
Games and real-life simulations introduced motion capture as an interactive controlling feature that allows humans to reproduce movements they want to see inside the game. However, reconstructing motion can be slow, especially when it's necessary to look up a huge pre-recorded database. This work seeks to present a GPU-parallelized algorithm of a previous animation reconstruction method to detect the motion being performed in order to reconstruct it with reduced delay. This work used a K-NN approach to compare human motion input with a pre-recorded database to detect what kind of motion is being performed in order to select what kind of animation will be reproduced.  Back
 
Keywords:
Game Development, Performance Optimization, GTC 2016 - ID P6176
Download:
 
Using CUDA® to Accelerate an Adaptive Game Controller
Leonardo Torok (Federal Fluminense University)
The adaptive game controller is a novel concept that aims to introduce new ways to interact with video games, allowing developers to design the joystick used to play through a simple API provided by our solution. A key component is the K-means algori ...Read More
The adaptive game controller is a novel concept that aims to introduce new ways to interact with video games, allowing developers to design the joystick used to play through a simple API provided by our solution. A key component is the K-means algorithm, used to fine tune the controller interface according to the user's touches during the gameplay session. Currently, our controller is implemented in Java for the Android operating system and the machine learning routines are executed by the mobile device's CPU, limiting the amount of touches that can be considered. With CUDA, we will introduce the next evolutionary step in our controller, increasing the amount of points evaluated to include all touches, in real time, using the GPU available on the computer that is running the game.  Back
 
Keywords:
Game Development, Deep Learning & Artificial Intelligence, GTC 2016 - ID P6350
 
Which GPU for How Many NPCs ?
Stephane Cardon (Center of Research, French Military Academy of Saint-Cyr Coetquidan)
Today's game artificial intelligence makes use of CPUs and not of GPUs. With more and more powerful GPUs and cloud gaming, our vision is that GPUs will be the best choice to run game AI, just as it now provides computing power for game physics (e.g ...Read More
Today's game artificial intelligence makes use of CPUs and not of GPUs. With more and more powerful GPUs and cloud gaming, our vision is that GPUs will be the best choice to run game AI, just as it now provides computing power for game physics (e.g., NVIDIA PhysX). In particular, we believe GPUs can be used to implement game AI planning, which computes the behaviors of characters (non-player characters -- NPCs) in games. Our objective is an efficient GPU-based game AI planning component that can be used not only for PC games, but also for cloud gaming. We report here on our recent implementation which is up to 150X faster than our CPU-based game AI planning component. It handles up to 256 NPCs with plans of at most eight actions.  Back
 
Keywords:
Game Development, Other, GTC 2016 - ID P6268
Download:
Intelligent Video Analytics (IVA)
Presentation
Media
Robust Moving Object Detection on Tegra K1
Cevahir Cigla (Aselsan Inc.), Burak Ozkalayci (Aselsan Inc.)
We present a novel background modeling-based moving object detection and segmentation approach with real-time implementation on the recent NVIDIA Tegra K1 mobile GPU platform. The proposed solution introduces pixel-wise adaptive background learning r ...Read More
We present a novel background modeling-based moving object detection and segmentation approach with real-time implementation on the recent NVIDIA Tegra K1 mobile GPU platform. The proposed solution introduces pixel-wise adaptive background learning rates as well as reinforced re-learning of the models. In that manner, especially dynamic backgrounds are modeled robustly where fake alarms due to irrelevant motion and the learning rate of these regions are increased. Detection is followed by shadow removal and dual background modeling approaches to detect abandoned objects with high precision. Each algorithmic step is implemented on GPU and real-time performance is achieved (detection, shadow removal, and abandoned object detection) on Jetson TK1 for 720x576 videos.  Back
 
Keywords:
Intelligent Video Analytics (IVA), IoT, Video & Image Processing, GTC 2016 - ID P6147
Download:
Medical Imaging
Presentation
Media
White Matter Tractography and Human Brain Connections Using GPUs
Moises Hernandez Fernandez (Oxford Centre for Functional MRI of the Brain (FMRIB). University of Oxford)
We present a novel analysis tool for diffusion MRI (dMRI) data using NVIDIA GPUs for mapping connections in the human brain. We describe the potential of dMRI and how it allows the study of brain microstructure and 3D estimation of long-range brain c ...Read More
We present a novel analysis tool for diffusion MRI (dMRI) data using NVIDIA GPUs for mapping connections in the human brain. We describe the potential of dMRI and how it allows the study of brain microstructure and 3D estimation of long-range brain connections, non-invasively and in-vivo (tractography). Due to the multidimensional nature of the data, modeling can be computationally demanding. We present a parallel framework for analysis of dMRI data that allows accelerations of up to two orders of magnitude when comparing GPU with CPU implementations. We highlight the tremendous benefit of these accelerations in very large recent studies such as the Human Connectome Project, where comprehensive maps of brain anatomical connectivity of unprecedented quality are being generated.  Back
 
Keywords:
Medical Imaging, GTC 2016 - ID P6106
Download:
 
GPU Acceleration of Non-Iterative and Iterative Algorithms in Fluorescence Lifetime Imaging Microscopy
Gang Wu (University of Sussex)
Fluorescence lifetime imaging microscopy (FLIM) plays a significant role in biological sciences, chemistry, and medical research. We propose a GPU-based FLIM analysis tool suitable for high-speed and flexible FLIM applications. With a large number of ...Read More
Fluorescence lifetime imaging microscopy (FLIM) plays a significant role in biological sciences, chemistry, and medical research. We propose a GPU-based FLIM analysis tool suitable for high-speed and flexible FLIM applications. With a large number of parallel processors, GPUs can significantly speed up lifetime calculations compared to CPU-OpenMP (parallel computing with multiple CPU cores) based analysis. The implemented algorithms have been tested on both synthesized and experimental FLIM data. The results show that at the same precision the GPU analysis can be up to 24x faster than its CPU-OpenMP counterpart.  Back
 
Keywords:
Medical Imaging, Algorithms, GTC 2016 - ID P6114
Download:
 
GPU-Based Parallel Fuzzy C-Mean Clustering Model Through Genetic Algorithm
Che Lun Hung (Providence University), Yuan Huai Wu (Providence University)
Detection of white matter changes in brain tissue using magnetic resonance imaging has been an increasingly active and challenging research area in computational neuroscience. A genetic algorithm based on a fuzzy c-mean clustering method (GAFCM) was ...Read More
Detection of white matter changes in brain tissue using magnetic resonance imaging has been an increasingly active and challenging research area in computational neuroscience. A genetic algorithm based on a fuzzy c-mean clustering method (GAFCM) was applied to simulated images to separate foreground spot signal information from the background. However, GAFCM results in heavy computational loads. This study presents a new GPU-based parallel GAFCM algorithm for robust segmentation of brain MRIs. The experimental results indicate that the proposed algorithm achieves acceptable segmentation results and can significantly reduce computational cost.  Back
 
Keywords:
Medical Imaging, Embedded, GTC 2016 - ID P6133
Download:
 
High Accuracy High Performance 3D Cone Beam Reconstruction
Daniel Badali (Triple Ring Technologies), Tobias Funk (Triple Ring Technologies), Scott Hsieh (Triple Ring Technologies,Stanford University), Oleg Konings (Triple Ring Technologies)
We have developed a high performance high accuracy multi-GPU back projection application which generates 32-bit 3D volumes of size 1024^3 and 2048^3. Reconstructing a large 3D volume enables detailed examination of human anatomy or complex industrial ...Read More
We have developed a high performance high accuracy multi-GPU back projection application which generates 32-bit 3D volumes of size 1024^3 and 2048^3. Reconstructing a large 3D volume enables detailed examination of human anatomy or complex industrial devices with numerous small features. This has been tested with both real world CT data, and data generated from CAD models used as inputs to x-ray Monte Carlo simulations. This current implementation for 1024^3 was validated and bench-marked using the RabbitCT projection set and our results fit into the 2nd place position with the least amount of reconstruction error(32 bit).  Back
 
Keywords:
Medical Imaging, Aerospace & Defense, GTC 2016 - ID P6150
Download:
 
GPU Accelerated Compressive Sensing in MRI
Jonas Vylder (Ghent university)
We introduce a new approach to get faster MRI acquisition. By reducing the number of data samples in combination with a new MRI reconstruction method, we're able to reduce the acquisition time by a factor of 20 without introducing disturbing artifac ...Read More
We introduce a new approach to get faster MRI acquisition. By reducing the number of data samples in combination with a new MRI reconstruction method, we're able to reduce the acquisition time by a factor of 20 without introducing disturbing artifacts. To reconstruct the image, we have to iteratively apply the non-uniform fast Fourier transform. This step turns out to be a major bottleneck. Therefore, we have accelerated the NUFFT by using the GPU. The speedup achieved by the GPU acceleration now enables new options for MRI research.  Back
 
Keywords:
Medical Imaging, Tools & Libraries, GTC 2016 - ID P6177
Download:
 
Improving Spinal Cord Segmentation and Cross-Sectional Area Calculation
Joo-won Kim (Icahn School of Medicine at Mount Sinai)
We propose two medical image analysis improvements to robustly quantify the cross-sectional areas of the human spinal cord from high resolution morphological spinal cord magnetic resonance (MR) images. First, we estimate a spinal cord inner line to i ...Read More
We propose two medical image analysis improvements to robustly quantify the cross-sectional areas of the human spinal cord from high resolution morphological spinal cord magnetic resonance (MR) images. First, we estimate a spinal cord inner line to improve the segmentation of the spinal cord; secondly we introduce a robust method to approximate the tangent vector to the cross section of the spinal cord. Both methods were implemented in CUDA, which greatly improved the computation speed compared to CPU implementation.  Back
 
Keywords:
Medical Imaging, GTC 2016 - ID P6188
Download:
 
Fast Parallel GPU Implementation for Clinical Helical CT Using Branchless DD
Ayan Mitra (Washington university in St. Louis)
We present a multi-GPU-based approach for iterative image reconstruction using Branchless Distance Driven (DD) Projection and Back Projection methods. Preliminary results showed that this implementation allowed approximately 5x decrease in reconstruc ...Read More
We present a multi-GPU-based approach for iterative image reconstruction using Branchless Distance Driven (DD) Projection and Back Projection methods. Preliminary results showed that this implementation allowed approximately 5x decrease in reconstruction time for back projection and 2x for forward projection by using three NVIDIA GeForce TITAN X GPUs in parallel compared to its OpenMP CPU implementation using 16 threads in parallel. We can expect to get better computational efficiency having a higher number of GPUs, hybrid CPUGPU method and ordered subsets, which open the door to the possibility of using an iterative reconstruction algorithm in real time in clinical settings.  Back
 
Keywords:
Medical Imaging, Algorithms, GTC 2016 - ID P6234
Download:
 
Towards 3D Image Up-Scaling on the GPU
Srinivasan Ramesh (Indian Institute of Science)
Affordable medical imaging solutions are an important part of making healthcare accessible to everyone, especially in developing nations. High-resolution CT scanners are expensive instruments, and obtaining large high-resolution images may not always ...Read More
Affordable medical imaging solutions are an important part of making healthcare accessible to everyone, especially in developing nations. High-resolution CT scanners are expensive instruments, and obtaining large high-resolution images may not always be possible. We explore 3D image up-scaling as a possible software solution to this problem, by extending the ICBI 2D image up-scaling algorithm to operate in 3D. Based on the promising results obtained by parallelizing the 2D ICBI image algorithm on the GPU, we parallelize the 3D ICBI algorithm on the GPU and demonstrate the performance benefits of doing so.  Back
 
Keywords:
Medical Imaging, Video & Image Processing, GTC 2016 - ID P6235
Download:
 
MRICloud: An MRI Data Platform with Cloud Storage and Cloud Computing Service
Jian-Hua Huang (Chang Gung University)
The importance of brain researches using MRI is elevating nowadays, especially in the research fields of neurological disorders, mental illness and cognitive neuroscience. The aim of this study is to implement an MRICloud platform for MRI data storag ...Read More
The importance of brain researches using MRI is elevating nowadays, especially in the research fields of neurological disorders, mental illness and cognitive neuroscience. The aim of this study is to implement an MRICloud platform for MRI data storage, management and sharing, graph theoretical analysis based on GPU-based cloud computing, MRI data visualization by web interface. From the results, it takes 238.91 seconds to generate the brain structural connectivity matrix using NVIDIA Tesla C2075 with 990 brain regions and 2.4 million of fiber tracts from diffusion MRI and is 39x the speedup in comparison with single CPU (i7-4770). Moreover, our platform also provides the visualization of all results in any device with WebGL-enabled browser.  Back
 
Keywords:
Medical Imaging, GTC 2016 - ID P6253
Download:
 
Measuring and Modeling the Motion of Volumetrically Digitized Objects
Duane Storti (University of Washington-Seattle)
We present two results from CUDA-enabled processing of digitized objects derived from volumetric scans. (1) High-resolution, non-invasive measurement of foot bone motion during walking gait: We compute digitally reconstructed radiographs (DRRs) corre ...Read More
We present two results from CUDA-enabled processing of digitized objects derived from volumetric scans. (1) High-resolution, non-invasive measurement of foot bone motion during walking gait: We compute digitally reconstructed radiographs (DRRs) corresponding to projections of digitized bones and register with stereo fluoroscopy to obtain full 3D kinematics. Images shown include the first results obtained from full scans with multiple bones, overlaps in the projected views, and significant background noise. CUDA-powered algorithms play an essential role in speeding the DRR and registration computations to achieve rates that enable multi-patient studies. (2) Design of swept solids using a CUDA-powered image stack modeler: A multi-axis rotational sweep of a digitized talus is illustrated.  Back
 
Keywords:
Medical Imaging, GTC 2016 - ID P6270
Download:
 
Intraoperative GPU-Based Surgical Navigation for Needle Steering
Fangde Liu (Imperial College London)
Newly developed, robotically steered needles allow minimally invasive access and accurate guidance to deep-seated anatomical targets. They hope to improve efficacy of interventions such as deep brain stimulation and tumor management, while reducing p ...Read More
Newly developed, robotically steered needles allow minimally invasive access and accurate guidance to deep-seated anatomical targets. They hope to improve efficacy of interventions such as deep brain stimulation and tumor management, while reducing patient trauma. By using ultrasound to track both the tissue and needle deformation, the optimal insertion trajectory can be updated intraoperatively. The whole navigation process can be accelerated by using a GPU-implementation, which greatly reduces the navigation latency, making surgery safer and more accurate.  Back
 
Keywords:
Medical Imaging, Robotics & Autonomous Machines, GTC 2016 - ID P6279
Download:
 
GPU-Accelerated Sub-Sample Displacement Estimation Method for Real-Time Ultrasound Elastography
Bo Peng (Michigan Technological University)
Ultrasound elastography is a promising medical imaging modality that estimates mechanical properties of soft tissues. Because tissue elasticity is inferred from tissue displacements, a highly accurate displacement estimation method is critical. Recen ...Read More
Ultrasound elastography is a promising medical imaging modality that estimates mechanical properties of soft tissues. Because tissue elasticity is inferred from tissue displacements, a highly accurate displacement estimation method is critical. Recently, our group has developed an improved sub-sample displacement estimation algorithm where axial and lateral motion estimates are simultaneously performed to enhance the tracking accuracy. The proposed method calculates the sub-sample estimation while the region of interest requires no inter-process communication, thereby becoming a perfect candidate for the GPU-based parallelization. In this study, the proposed method has been implemented in CUDA. It's about 60X faster than the CPU implementation while maintaining its advantages.  Back
 
Keywords:
Medical Imaging, GTC 2016 - ID P6283
Download:
 
GPU Implementation for Non-Cartesian Parallel MRI
Hassan Shahzad (COMSATS Institute of Information Technology, Islamabad)
Non-Cartesian trajectories provide faster MRI acquisition and efficient coverage of the k-space but they are computationally more demanding as they require gridding and de-gridding operations. CG-SENSE proposed by Pruessmann reconstructs fully sample ...Read More
Non-Cartesian trajectories provide faster MRI acquisition and efficient coverage of the k-space but they are computationally more demanding as they require gridding and de-gridding operations. CG-SENSE proposed by Pruessmann reconstructs fully sampled MR images from undersampled radial and spiral trajectories. CG-SENSE is iterative and it performs gridding and de-gridding operations in each iteration which consumes most of the CPU time; however these operations contain inherent parallelism. This work focuses on the implementation of gridding and de-gridding operations on GPU to reduce the CG-SENSE reconstruction time while maintaining the overall image quality in CG-SENSE. The results show that the GPU implementation is approximately 10 times faster than its CPU implementation.  Back
 
Keywords:
Medical Imaging, Supercomputing & HPC, GTC 2016 - ID P6317
Download:
 
Improving Automated Diabetic Retinopathy Detection with Deep Convolutional Neural Networks
Meindert Niemeijer (IDx LLC)
Diabetic Retinopathy is the leading cause of blindness in the working population of the western world. Blindness due to diabetic retinopathy can be prevented, but many people with diabetes do not receive the necessary annual screening. IDx has develo ...Read More
Diabetic Retinopathy is the leading cause of blindness in the working population of the western world. Blindness due to diabetic retinopathy can be prevented, but many people with diabetes do not receive the necessary annual screening. IDx has developed an EU approved medical device, IDx-DR, that automates diabetic retinopathy screening and can lower both the barriers to high quality screening and the costs of healthcare. Our poster demonstrates the significant device performance gains that we were able to achieve using GPUs and deep learning techniques by comparing the latest, GPU based, version of the device to the previous, CPU based, version on 5,000 diabetic retinopathy screening exams.  Back
 
Keywords:
Medical Imaging, Deep Learning & Artificial Intelligence, GTC 2016 - ID P6323
Download:
 
Patient-Specific Hyperelastic Biomechanical Models for Clinical DIR Confidence Quantification
John Neylon (UCLA Radiation Oncology)
The accuracy of clinical multi-modal deformable image registration (DIR) is difficult to quantify. A framework was previously developed to validate a deformable registration algorithm (DIR) by generating patient-specific, GPU-based biomechanical mode ...Read More
The accuracy of clinical multi-modal deformable image registration (DIR) is difficult to quantify. A framework was previously developed to validate a deformable registration algorithm (DIR) by generating patient-specific, GPU-based biomechanical models from head-and-neck (HN) patient CT scans and creating clinically realistic ground truth deformations[1]. We now aim to expand the model's applicability to quantify DIR confidence for clinical registrations between the planning CT and daily positioning images.  Back
 
Keywords:
Medical Imaging, GTC 2016 - ID P6353
Download:
Performance Optimization
Presentation
Media
Fast Parallel Skew and Prefix-Doubling Suffix Array Construction on the GPU
Leyuan Wang (University of California, Davis)
This poster presents the latest techniques for accelerating suffix array construction algorithms (SACAs) using CUDA. The suffix array (SA) data structure is used in a broad spectrum of applications, including data compression, bioinformatics, and tex ...Read More
This poster presents the latest techniques for accelerating suffix array construction algorithms (SACAs) using CUDA. The suffix array (SA) data structure is used in a broad spectrum of applications, including data compression, bioinformatics, and text indexing. The recent explosion in data sizes and the emergence of commodity data-parallel processors motivate efficient parallel implementations of SACAs. Because of the high arithmetic and memory throughput of many-core GPUs and multi-core CPUs, these processors are well-suited for data-intensive computing tasks such as SACAs. We address the problem by designing, implementing, and comparing two different formulations of SACAs on NVIDIA GPUs and achieve significant speedups compared with previous CPU/GPU state-of-the-art implementations.  Back
 
Keywords:
Performance Optimization, Algorithms, GTC 2016 - ID P6131
Download:
 
Parallel Homotopy Method for the Symmetric Eigenvalue Problem
Peter Reheis (Trinity College), Peter Yoon (Trinity College)
The homotopy method can be applied to solve eigenvalue-eigenvector problems for symmetric tridiagonal matrices. It is often not of any interest to compute every eigenvalue of a matrix. Because the homotopy method possesses the order preserving proper ...Read More
The homotopy method can be applied to solve eigenvalue-eigenvector problems for symmetric tridiagonal matrices. It is often not of any interest to compute every eigenvalue of a matrix. Because the homotopy method possesses the order preserving property, the algorithm is able to compute any specific eigenvalue without the need for computing any other eigenvalues. Because of this, the homotopy method is also a highly parallel algorithm. By using CUDA and the cuBLAS and MAGMA libraries, we achieved up to 27X speedup in computation time. Numerical results show our method is highly efficient, especially for graded matrices.  Back
 
Keywords:
Performance Optimization, Algorithms, GTC 2016 - ID P6196
Download:
 
GPU Implementation of Splitting-Up Conjugate Gradient Method
Akiyoshi Wakatani (Konan University)
We implemented a preconditioned conjugate gradient (CG) method on GPUs (GeForce GTX TITAN, Tesla K80). Our method utilizes Splitting-Up (SP) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension a ...Read More
We implemented a preconditioned conjugate gradient (CG) method on GPUs (GeForce GTX TITAN, Tesla K80). Our method utilizes Splitting-Up (SP) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent, while well-known Incomplete Cholesky is hard to parallelize. A simple implementation of the SPCG cannot completely utilize coalesced communications. So, to enhance the memory bandwidth, we carry out a pseudo matrix transposition before and after solving a tridiagonal matrix. By this policy, the speedups of our approach can be enhanced by up to 93%. In addition, the number of the transpositions can be reduced by a rotation configuration.  Back
 
Keywords:
Performance Optimization, Supercomputing & HPC, GTC 2016 - ID P6276
Download:
 
A Dictionary Learning Approach in GPU for Image Denoising
Lizeth Joseline Fuentes Perez (Federal Fluminense University), Luciano Arnaldo Romero Calla (Federal Fluminense University)
Many image processing problems require image denoising as a preprocessing step. We address the problem of removing white Gaussian noise in images through dictionary learning, which is a technique that has been proved to better fit a signal than fixed ...Read More
Many image processing problems require image denoising as a preprocessing step. We address the problem of removing white Gaussian noise in images through dictionary learning, which is a technique that has been proved to better fit a signal than fixed dictionary approaches. Learning an overcomplete dictionary for sparse representation is a problem that involves a high computational cost. In this poster, we present how an efficient parallel algorithm on GPU reduces training time.  Back
 
Keywords:
Performance Optimization, Video & Image Processing, GTC 2016 - ID P6294
Download:
 
Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes
Ahmad Abdelfattah (Innovative Computing Laboratory, University of Tennessee)
This work presents a high performance solution for Cholesky factorization on batches of relatively small matrices. We discuss both fixed-size and variable-size batched problems. In order to handle the irregularity associated with this type of workloa ...Read More
This work presents a high performance solution for Cholesky factorization on batches of relatively small matrices. We discuss both fixed-size and variable-size batched problems. In order to handle the irregularity associated with this type of workloads, we present new optimization techniques that can maintain relatively high performance on such small matrix sizes. The proposed solution outperform most of the existing state-of-the-art techniques that can solve batched problems.  Back
 
Keywords:
Performance Optimization, Algorithms, Supercomputing & HPC, GTC 2016 - ID P6340
Download:
 
Using CLANG/LLVM Vectorization to Generate Mixed Precision Source Code
Regis Portalez (Altimesh)
At Supercomputing 2015, NVIDIA announced Jetson TX1. This platform is the first available to natively expose mixed precision instructions. However, this instruction set requires that operations on 16-bit precision floating points are done in pairs, r ...Read More
At Supercomputing 2015, NVIDIA announced Jetson TX1. This platform is the first available to natively expose mixed precision instructions. However, this instruction set requires that operations on 16-bit precision floating points are done in pairs, requiring usage of the half2 type which pairs two values in a single register.  Back
 
Keywords:
Performance Optimization, Programming Languages, GTC 2016 - ID P6352
 
Energy-Efficient Fine-Grained DVFS
Ben Keller (University of California, Berkeley)
Fine-grained dynamic voltage and frequency scaling is a key technology for energy-efficient processors. This work demonstrates a RISC-V vector microprocessor implemented in 28nm FDSOI with fully-integrated simultaneous-switching switched-capacitor DC ...Read More
Fine-grained dynamic voltage and frequency scaling is a key technology for energy-efficient processors. This work demonstrates a RISC-V vector microprocessor implemented in 28nm FDSOI with fully-integrated simultaneous-switching switched-capacitor DC-DC converters and adaptive clocking. The converters achieve high efficiency at the system level by switching simultaneously to avoid charge sharing losses and by using an adaptive clock to maximize performance for the resulting voltage ripple. This system pushes the capabilities of dynamic voltage scaling by demonstrating fast transitions, simple packaging, and high energy efficiency.  Back
 
Keywords:
Performance Optimization, GTC 2016 - ID P6355
 
Exploring Dynamic Parallelism for Irregular Applications on GPUs
Jin Wang (Georgia Institute of Technology)
Jin Wang is a Ph.D. student whose research interests are in architecture and runtime techniques for heterogeneous computing. Her previous work includes optimizations for applications on integrated CPU-GPU systems and the development of an OpenCL runt ...Read More
Jin Wang is a Ph.D. student whose research interests are in architecture and runtime techniques for heterogeneous computing. Her previous work includes optimizations for applications on integrated CPU-GPU systems and the development of an OpenCL runtime frontend within GPU Ocelot dynamic compilation framework. More recently, she has been constructing a compilation and runtime system for executing JavaScript code on GPUs. Currently she is working on optimization of irregular applications for GPUs. Specifically, she is investigating new execution models for dynamic parallelism and their efficient compiler and microarchitecture support.  Back
 
Keywords:
Performance Optimization, Other, GTC 2016 - ID P6254
Download:
Product & Building Design
Presentation
Media
GPU Accelerated Practical Structural Analysis Using the Boundary Element Method
Ahmed Torky (Cairo University)
An implementation of GPU computing on a direct boundary element method (BEM) for Reissner's plate bending theory. The plate bending problem conducts several independent operations to obtain boundary and internal values, which is reprogrammed using C ...Read More
An implementation of GPU computing on a direct boundary element method (BEM) for Reissner's plate bending theory. The plate bending problem conducts several independent operations to obtain boundary and internal values, which is reprogrammed using CUDA to allow parallel processing. An academic and commercial software (The PLPAK) uses BEM to solve stress resultants and deflections on plates under bending (slabs, raft foundations, piled-raft foundations, etc). The PLPAK new GPU code, with CUDA Fortran, alters three kernels to transform nested loops into parallel routines: the calculation of influence matrices, the solution to the system of linear equations, and the computation of internal point stress resultants and deflections. Practical examples are shown and accuracy is maintained.  Back
 
Keywords:
Product & Building Design, Supercomputing & HPC, GTC 2016 - ID P6322
Download:
Programming Languages
Presentation
Media
Quasar: A Programming Framework for Rapid Prototyping
Bart Goossens (Ghent University)
We present a new programming language, Quasar, which mitigates the common drawbacks of GPU programming for rapid prototyping. Quasar is an easy-to-learn, high-level programming language that is hardware-independent, ideal for both rapid prototyping a ...Read More
We present a new programming language, Quasar, which mitigates the common drawbacks of GPU programming for rapid prototyping. Quasar is an easy-to-learn, high-level programming language that is hardware-independent, ideal for both rapid prototyping and full deployment on heterogeneous hardware. By using Quasar, a researcher can write compact code in a scripting language while getting high performance due to the use of GPU acceleration. In addition to the Quasar language, we present the development tools that facilitate performance optimization, e.g., profile analyzers and automated code feedback.  Back
 
Keywords:
Programming Languages, Tools & Libraries, GTC 2016 - ID P6172
Download:
 
Locality-Aware Memory Association for Pipelining and Multi-Device Worksharing
Thomas Scogland (Lawrence Livermore National Laboratory)
Advances in directive-based programming models have made GPU programming more accessible than ever. Even so, models like OpenMP 4.0 and OpenACC lack worksharing and memory management facilities for multi-GPU environments. We present a memory-associat ...Read More
Advances in directive-based programming models have made GPU programming more accessible than ever. Even so, models like OpenMP 4.0 and OpenACC lack worksharing and memory management facilities for multi-GPU environments. We present a memory-association interface for directive-based models that enables multi-device worksharing, automated pipelining for greater support of out-of-core workloads, as well as NUMA management all as a single extension. Our implementation, AffinityTSAR, scales well to multiple GPUs, GPUs and CPUs together, and even shows improvement in CPU-only performance.  Back
 
Keywords:
Programming Languages, GTC 2016 - ID P6236
Download:
 
Dynamical Analysis of Connected Neuronal Motifs with OpenAcc and OpenMPI
Krishna Pusuluri (Georgia State University)
Large-scale analysis of the dynamical behavior of central pattern generators (CPGs) formed by neuronal networks of even small sizes is computationally intensive and grows exponentially with network size. We have developed a suite of tools to exhausti ...Read More
Large-scale analysis of the dynamical behavior of central pattern generators (CPGs) formed by neuronal networks of even small sizes is computationally intensive and grows exponentially with network size. We have developed a suite of tools to exhaustively study the behavior of such networks on modern GPGPU accelerator clusters using OpenACC and OpenMPI. Directive-based approaches simplify the task of porting serial code onto GPUs without expertise in CUDA or OpenCL. Three-cell neuronal CPGs have been explored previously using various GPGPU tools. As motifs form the building blocks of larger networks, we have employed our framework to study four-cell CPGs and two connected three-cell motifs. We discuss the performance improvements achieved using this framework and present some of our results.  Back
 
Keywords:
Programming Languages, Deep Learning & Artificial Intelligence, GTC 2016 - ID P6269
Download:
 
Accelerating Java Applications Using GPGPUs
James Clarkson (University of Manchester)
Over the last few years we have been researching ways to exploit features of managed-languages, such as Java, to simplify programming GPGPUs, we'll present our state of the art prototype: Tornado. Tornado is a framework that allows Java programmers ...Read More
Over the last few years we have been researching ways to exploit features of managed-languages, such as Java, to simplify programming GPGPUs, we'll present our state of the art prototype: Tornado. Tornado is a framework that allows Java programmers to write GPU accelerated applications in 100% pure Java. It employs a task-based programming model, which makes it simple to compose complex processing pipelines that can execute multiple kernels across multiple GPGPUs. A key outcome of Tornado is that with minimal refactoring of code it is possible to port an application onto a GPGPU. We'll demonstrate a real-time computer vision application, ported from CUDA into Java, to reconstruct a 3D scene from a stream of RGB-D data.  Back
 
Keywords:
Programming Languages, Computer Vision & Machine Vision, GTC 2016 - ID P6341
Download:
Real-Time Graphics
Presentation
Media
Fast Sorting in OpenGL Shaders for Order Independent Transparency
Pyarelal Knowles (RMIT University)
Sorting is a bottleneck when rendering large scenes with order independent transparency. The problem is to quickly sort millions of small lists of varying sizes, up to hundreds of items, on the GPU. Two techniques are shown to provide a compound perf ...Read More
Sorting is a bottleneck when rendering large scenes with order independent transparency. The problem is to quickly sort millions of small lists of varying sizes, up to hundreds of items, on the GPU. Two techniques are shown to provide a compound performance improvement of over a factor of 10. The first is backwards memory allocation (BMA), which groups similar threads and executes them in batches. This improves GPU occupancy and allows a strategy pattern approach to the sorting algorithm. The second is register-based block sort (RBS), which improves local memory use with careful and explicit use of registers and works well in combination with BMA. Their improvements are shown to increase with GPU generations.  Back
 
Keywords:
Real-Time Graphics, Performance Optimization, GTC 2016 - ID P6110
Download:
Robotics & Autonomous Machines
Presentation
Media
Hercules: High-Performance Real-time Architectures for Low-Power Embedded Systems
Paolo Burgio (University of Modena and Reggio Emilia, Italy)
Many-core architectures are the key building block for the next generation of embedded systems, where power consumption will be the primary concern. Platforms such as NVIDIA Tegra X1 with a GPU and a multi-core host provide an unprecedented perf ...Read More

Many-core architectures are the key building block for the next generation of embedded systems, where power consumption will be the primary concern. Platforms such as NVIDIA Tegra X1 with a GPU and a multi-core host provide an unprecedented performance/watt trade-off, but they are not yet widely adopted in domains such as advanced driver assistance systems (ADAS), where safety-critical requirements and a tight interaction with the surrounding environment call for predictable performance. The Hercules project will develop an integrated framework to obtain predictable performance on top of cutting-edge heterogeneous COTS many-core platforms, with the final goal of obtaining an order-of-magnitude improvement in the cost and power consumption of next-generation real-time applications.

  Back
 
Keywords:
Robotics & Autonomous Machines, Self-Driving Cars & Automotive, IoT, Automotive, GTC 2016 - ID P6167
Download:
 
Acceleration of a Pseudo-Bacterial Potential Field Algorithm for Path Planning
Ulises Orozco-Rosas (Instituto Politecnico Nacional)
Path planning of a mobile robot -- determining an optimal path from a universe of possible solutions -- is one of the most computationally intensive tasks and a challenge in dynamically changing environments. Using GPUs, it is possible to proces ...Read More

Path planning of a mobile robot -- determining an optimal path from a universe of possible solutions -- is one of the most computationally intensive tasks and a challenge in dynamically changing environments. Using GPUs, it is possible to process data-intensive tasks efficiently. This work presents the acceleration of a Pseudo-Bacterial Potential Field (PBPF) algorithm for path planning. The Matlab-CUDA implementation of the PBPF algorithm shows how to find an optimal collision-free path for a mobile robot and how to speed up the path planning computation through the use of GPUs. The simulation results demonstrate the efficiency of the PBPF implementation to solve the path planning problem in offline and online mode.

  Back
 
Keywords:
Robotics & Autonomous Machines, Self-Driving Cars & Automotive, IoT, Automotive, GTC 2016 - ID P6288
Download:
 
Parallel computation of the value set of frequency response for uncertain systems with GPGPU
Paluri Nataraj (Indian Institute of Technology Bombay)
We'll describe an efficient algorithm to compute the value set of any parametric uncertain system which can be modeled into input output transfer function. This value set of frequency response can be obtained by the calculation of the magnitudes and ...Read More
We'll describe an efficient algorithm to compute the value set of any parametric uncertain system which can be modeled into input output transfer function. This value set of frequency response can be obtained by the calculation of the magnitudes and phases for all possible uncertain transfer functions. In robotics and autonomous machines application, these are very useful for controller synthesis. This computation is independent to each other and can be executed in parallel. To show the effectiveness of the proposed method, we have chosen 3-DOF longitudinal aircraft model with large number of parameters.  Back
 
Keywords:
Robotics & Autonomous Machines, Algorithms, IoT, GTC 2016 - ID P6312
Download:
 
GPU Acceleration of Computational Electromagnetics Methods
Vivek Venugopalan (United Technologies Research Center)
High fidelity prediction of the link budget between a pair of transmitting and receiving antennas in dense and complex environments is computationally very intensive at high frequencies. Iterative physical optics (IPO) is a scalable solution for elec ...Read More
High fidelity prediction of the link budget between a pair of transmitting and receiving antennas in dense and complex environments is computationally very intensive at high frequencies. Iterative physical optics (IPO) is a scalable solution for electromagnetic (EM) simulations with complex geometry. An efficient and robust solution is presented to predict the link budget between antennas in a dense environment for two use cases: (1) Multi-objective path planning during autonomous navigation and (2) modeling the propagation of Wi-Fi signals inside aircraft cabins. Two NVIDIA GPUs with different number of cores and device memory were targeted for benchmarking the performance of the IPO algorithm.  Back
 
Keywords:
Robotics & Autonomous Machines, Aerospace & Defense, IoT, GTC 2016 - ID P6316
Download:
Self-Driving Cars & Automotive
Presentation
Media
GPU-Based Pedestrian Detection for Autonomous Driving
Victor Campmany (Computer Vision Center), Juan Carlos Moure (University Autonoma of Barcelona)
Pedestrian detection for autonomous driving has gained a lot of prominence during the last few years. Besides the fact that this is one of the hardest tasks within computer vision, it involves huge computational costs. Obtaining acceptable real- ...Read More

Pedestrian detection for autonomous driving has gained a lot of prominence during the last few years. Besides the fact that this is one of the hardest tasks within computer vision, it involves huge computational costs. Obtaining acceptable real-time performance, measured in frames per second (fps), for the most advanced algorithms is a difficult challenge. We propose a CUDA implementation of a well known pedestrian detection system (i.e., Random Forest of Local Experts). It includes LBP and HOG as feature descriptors and SVM and Random Forest as classifiers. We introduce significant algorithmic adjustments and optimizations to adapt the problem to the NVIDIA GPU architecture. The aim is to deploy a real-time system providing reliable results.

  Back
 
Keywords:
Self-Driving Cars & Automotive, Automotive, Computer Vision & Machine Vision, GTC 2016 - ID P6181
Download:
Signal & Audio Processing
Presentation
Media
Serving Multiple Concurrent Interactive Sessions with a GPU-based Speech Recognition System
Alexei V. Ivanov (Verbumware Inc.)
We explore the possibility of using GPUs in a cloud implementation of an interactive automated speech recognition server. Among the specific challenges for this system, we see rapidly performing model adaptation and context switching and scheduling p ...Read More
We explore the possibility of using GPUs in a cloud implementation of an interactive automated speech recognition server. Among the specific challenges for this system, we see rapidly performing model adaptation and context switching and scheduling process for small incoming data chunks. Ultimately, we observe significant advantages with our GPU speech recognizer implementation as it commits the total available computational resources to solving a single task at any given moment, allowing higher computational throughput under the moderate computational load. We measure performance of our system with respect to the open-source reference implementation of Kaldi using models, obtained from the large publicly available Librispeech collection.  Back
 
Keywords:
Signal & Audio Processing, Algorithms, GTC 2016 - ID P6242
Download:
 
CUDA-Accelerated Acquisition of Spread Spectrum Signal in Satellite Communication
Ying Liu (University of Chinese Academy of Sciences)
Spread spectrum communication has been used in many applications in satellite communication, e.g. GPS. Due to its high computational complexity, no real-time spread spectrum signal acquisition (a critical step in the processing flow of satellite comm ...Read More
Spread spectrum communication has been used in many applications in satellite communication, e.g. GPS. Due to its high computational complexity, no real-time spread spectrum signal acquisition (a critical step in the processing flow of satellite communication) has been achieved on a CPU-based system without FPGA or DSP. Thus, we proposed to use CUDA-enabled GPUs to accelerate it. The computation core, sliding correlation was identified and an efficient CUDA parallelization scheme was proposed. A CUDA-enabled acquisition algorithm was implemented. Experimental results on data from a real satellite spread spectrum system presented up to 212X speedup using a Tesla K20 GPU over the execution time on CPU with Intel IPP. Real-time acquisition was achieved in most cases and good scalability was observed.  Back
 
Keywords:
Signal & Audio Processing, Algorithms, GTC 2016 - ID P6267
Download:
Supercomputing & HPC
Presentation
Media
Reducing Remote GPU Execution's Overhead with mrCUDA®
Pak Markthub (Tokyo Institute of Technology)
Remote GPU execution (e.g., rCUDA) has been proven useful in many situations, including increasing overall resource utilization in multi-GPU batch-queue systems. However, for applications that intensively communicate with GPUs, remote GPU execution o ...Read More
Remote GPU execution (e.g., rCUDA) has been proven useful in many situations, including increasing overall resource utilization in multi-GPU batch-queue systems. However, for applications that intensively communicate with GPUs, remote GPU execution overhead can become large, even when InfiniBand is used as the communication backend. Since using local GPUs is always better in terms of performance, we propose mrCUDA, a middleware for transparently migrating work from remote to local GPUs. We presents how mrCUDA works and two case studies that show how good mrCUDA can improve LAMMPS's performance and the solution to the scattered idle-GPU problem compared with continuously using rCUDA.  Back
 
Keywords:
Supercomputing & HPC, Tools & Libraries, GTC 2016 - ID P6130
Download:
 
Strassen's Algorithm in a Heterogeneous CPU-GPU Distributed System
Liliana Ibeth Barbosa Santillan (University of Guadalajara)
We review the Strassen algorithm for matrix multiplication. The major difficulty for its implementation in hybrid/heterogeneous systems is the number of programming tools required for the management of hardware. Our objective is to take advantage of ...Read More
We review the Strassen algorithm for matrix multiplication. The major difficulty for its implementation in hybrid/heterogeneous systems is the number of programming tools required for the management of hardware. Our objective is to take advantage of both: task and data parallelism by using a framework that simplifies the implementation while improves the performance. The problem is divided in many independent tasks mapped to computing devices, and then each task is executed using data parallelism to harness the advantages of GPUs and multicore CPUs. To verify the advantages of this task/data parallel approach, we performed several experiments varying the size of the matrix. Our results show that we can achieve up to 3.4X of speedup using the same application.  Back
 
Keywords:
Supercomputing & HPC, Algorithms, GTC 2016 - ID P6149
Download:
 
Astro: A Low Cost Low Power Computing Cluster
Gavin Baker (California Polytechnic State University, San Luis Obispo), Chris Lupo (California Polytechnic State University, San Luis Obispo), Sean Sheen (California Polytechnic State University, San Luis Obispo)
On the path to exascale computing, the trend toward more expensive, power-hungry high performance computing (HPC) clusters presents a number of challenges to academic institutions with limited funding, that wish to contribute to the study of scalable ...Read More
On the path to exascale computing, the trend toward more expensive, power-hungry high performance computing (HPC) clusters presents a number of challenges to academic institutions with limited funding, that wish to contribute to the study of scalable architectures and software resilience. Hybrid CPU-GPU systems have the potential for reducing the cost and power consumption of next-generation computing clusters. This poster demonstrates the efficacy of such systems within commodity computing clusters through our design, implementation, and benchmarking of a hybrid CPU-GPU cluster based on the NVIDIA Jetson TK1 board. Our new computing system has a theoretical peak floating-point performance of ~24 TFLOPS with power consumption of 1KW under full load and was built at a cost of $13,000.  Back
 
Keywords:
Supercomputing & HPC, Performance Optimization, GTC 2016 - ID P6207
Download:
 
enerGyPU for Monitoring Performance and Power Consumption on Multi-GPUs
John Anderson Garcia Henao (Research Assistant at High Performance and Scientific Computing Unit ? UIS)
enerGyPU is a tool that can be used to analyze multiple tests under different combinations of parameters to observe the key factors that determine the energy efficiency in terms of "Energy per Computation" on clusters with multi-GPUs. enerG ...Read More
enerGyPU is a tool that can be used to analyze multiple tests under different combinations of parameters to observe the key factors that determine the energy efficiency in terms of "Energy per Computation" on clusters with multi-GPUs. enerGyPU is a monitor to centralize and automate the data captured in runtime while it is executed in parallel with the scientific application and displays information through sequence plots, statistical tables, and bar graphs, and shows results in terms of energy efficiency.  Back
 
Keywords:
Supercomputing & HPC, GTC 2016 - ID P6210
Download:
 
GPU Effectiveness for DART
Ye Feng (University of Wyoming)
Data Assimilation Research Testbed (DART) is a framework that makes it easy for modelers, observational scientists, and geophysicists to explore a variety of data assimilation methods and confront different numerical models with observations. Approxi ...Read More
Data Assimilation Research Testbed (DART) is a framework that makes it easy for modelers, observational scientists, and geophysicists to explore a variety of data assimilation methods and confront different numerical models with observations. Approximately 25% of the Yellowstone supercomputer resources run DART applications. The scalability and efficiency of the DART system is of paramount concern in cases where forecasting and weather modeling is required. Over the past few years, general-purpose graphics processing units (GPGPUs) have emerged as an inexpensive way to accelerate scientific applications. In these projects, we focused on implementing new parallel versions of the targeted functions with CUDA Fortran on NVIDIA GPUs (K20x), which are available on the Yellowstone supercomputer.  Back
 
Keywords:
Supercomputing & HPC, GTC 2016 - ID P6221
Download:
 
HPC for Remote Visualization and Interaction with Scientific Applications
Deyberth Riano Nunez (Universidad Industrial de Santander)
We present a work in progress, integrating remote visualization, interaction and high performance computing for the development of scientific applications, taking advantage of ultra-high-resolution display walls as a collaborative environment. ...Read More
We present a work in progress, integrating remote visualization, interaction and high performance computing for the development of scientific applications, taking advantage of ultra-high-resolution display walls as a collaborative environment.  Back
 
Keywords:
Supercomputing & HPC, Graphics Virtualization, GTC 2016 - ID P6290
Download:
 
Profiler Guided Manual Optimization for Accelerating the Cholesky Decomposition
Vinay Ramakrishnaiah (University of Wyoming/National Center for Atmospheric Research)
'fields' is a widely used spatial statistics package in R that is used to analyze spatial data. At the National Center for Atmospheric Research(NCAR) it is used by the IMAGe group. We made use of the Matrix Algebra on GPU and Multicore Architecture ...Read More
'fields' is a widely used spatial statistics package in R that is used to analyze spatial data. At the National Center for Atmospheric Research(NCAR) it is used by the IMAGe group. We made use of the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library to accelerate the Cholesky Decomposition(CD) in 'fields'. Profiling provided insight to accelerate the CD. We were able to optimize the code and environment to get a speedup greater than 75x for large matrices. We integrated our accelerated C functions with Julia and drew a performance comparison between R and Julia. Julia was found to get a speedup up to 4x for large matrices. We found a potential way to improve MAGMA functions by replacing the intra-node inter-GPU communications with direct device-to-device calls.  Back
 
Keywords:
Supercomputing & HPC, Earth System Modelling, GTC 2016 - ID P6291
Download:
 
OpenACC Enabled Benchmark Suite on Intel Ivy Bridge
Joel Bricker (University of Delaware)
We explore the new OpenACC implementation to parallelize code for a multi-core processor. We use an OpenMP implementation on the same code base and compare the performance results obtained when running the code instrumented with both standards. The r ...Read More
We explore the new OpenACC implementation to parallelize code for a multi-core processor. We use an OpenMP implementation on the same code base and compare the performance results obtained when running the code instrumented with both standards. The results are notable because the OpenMP standard has existed for some time for multi-core parallelism, whereas the OpenACC standard just recently starting supporting multi-core. As such, it is important for the OpenACC implementation performance to match, or exceed, the performance of the existing OpenMP standard.  Back
 
Keywords:
Supercomputing & HPC, Tools & Libraries, GTC 2016 - ID P6307
Download:
 
Advanced High-Productivity Framework for Large-Scale GPU/CPU Stencil Computations
Takashi Shimokawabe (Tokyo Institute of Technology)
A high-productivity framework for multi-GPU and multi-CPU computation of stencil applications is proposed. Our framework is implemented in C++ and CUDA languages. It automatically translates user-written stencil functions that update a grid point and ...Read More
A high-productivity framework for multi-GPU and multi-CPU computation of stencil applications is proposed. Our framework is implemented in C++ and CUDA languages. It automatically translates user-written stencil functions that update a grid point and generates both GPU and CPU codes. The programmers write user code just in the C++ language, and it can be executed on multiple GPUs with the auto-tuning mechanism and the overlapping method to hide communication cost by computation. It can be also executed on multiple CPUs with OpenMP without any change of code. In addition, our framework provides a data structure that supports element-wise computations, which allow us to write GPU kernel codes as inline codes. This poster presents our proposed framework and its performance evaluation.  Back
 
Keywords:
Supercomputing & HPC, Computational Physics, GTC 2016 - ID P6334
Download:
 
Testing Fine-Grained Parallelism for the ADMM on a Factor-Graph
Jose Bento (Boston College)
You'll learn how to use the popular ADMM (Alternating Direction of Multipliers) to perform optimization and how to accelerate it using GPUs in a way that avoids writing any parallel code. More specifically, you'll learn (1) how the ADMM works and h ...Read More
You'll learn how to use the popular ADMM (Alternating Direction of Multipliers) to perform optimization and how to accelerate it using GPUs in a way that avoids writing any parallel code. More specifically, you'll learn (1) how the ADMM works and how it reduces an optimization problem to an iterative scheme on a graph, (2) which computations in this scheme can be parallelized, (3) where automatic parallelism enters into the picture, (4) when to expect speedups, and (5) see speedup values in three different applications: combinatorial optimization, machine learning and optimal control. Finally, you learn about parADMM, a tool we built to quickly prototype your own optimization solvers without having to write an application-specific parallel code.  Back
 
Keywords:
Supercomputing & HPC, Algorithms, GTC 2016 - ID P6346
Download:
Tools & Libraries
Presentation
Media
A Case Study on Programming Heterogeneous Multi-GPU with StarPU Library.
Esteban Clua (Universidade Federal Fluminense), Joao Gazolla (Universidade Federal Fluminense)
We present a case study of kernel executions that take advantage of the StarPU Library to exploit heterogeneous multi-GPU systems. This third-party library, developed by INRIA in France, provides a unified view of the computational resources. This wa ...Read More
We present a case study of kernel executions that take advantage of the StarPU Library to exploit heterogeneous multi-GPU systems. This third-party library, developed by INRIA in France, provides a unified view of the computational resources. This way, the programmer can concentrate more on programming than handling low-level issues such as data transfers. It uses a task-based model and allows developers to implement custom scheduling policies. In this case study , thousands of matrix multiplications are performed with different scheduling policies and different number of processing units, resulting in different execution times, while low-level issues are handled by the StarPU Library.  Back
 
Keywords:
Tools & Libraries, Supercomputing & HPC, GTC 2016 - ID P6104
Download:
 
Utilization and Expansion of ppOpen-AT for OpenACC
Satoshi Ohshima (The University of Tokyo)
OpenACC attracts attention as an easy and useful GPU programming environment. While OpenACC is not difficult to use, users have to spend time and energy to optimize OpenACC programs. To address this, we are developing an auto-tuning (AT) language nam ...Read More
OpenACC attracts attention as an easy and useful GPU programming environment. While OpenACC is not difficult to use, users have to spend time and energy to optimize OpenACC programs. To address this, we are developing an auto-tuning (AT) language named ppOpen-AT. We have shown that this language is useful for multi- and many-core parallel programming. We investigate the usability of ppOpen-AT for OpenACC programs and propose to expand ppOpen-AT for further optimization of OpenACC. While ppOpen-AT for OpenACC is in development, the effectiveness is shown and we believe that our next-gen's ppOpen-AT will help various optimization works of OpenACC programs.  Back
 
Keywords:
Tools & Libraries, Programming Languages, GTC 2016 - ID P6163
Download:
 
Benefits of Remote GPU Virtualization: The rCUDA Perspective
Federico Silla (Technical University of Valencia)
Many applications use GPUs to accelerate their execution. However, using GPUs presents several side effects, such as increased acquisition and maintenance costs and space requirements. Moreover, these increased costs may not be easily amortized becau ...Read More
Many applications use GPUs to accelerate their execution. However, using GPUs presents several side effects, such as increased acquisition and maintenance costs and space requirements. Moreover, these increased costs may not be easily amortized because GPUs usually present very low utilization rates. In a similar way to virtual machines, the use of virtual GPUs may overcome the concerns associated with the use of real GPU devices. The remote GPU virtualization technique allows an application being executed in a computer not having a GPU to transparently make use of a GPU installed in another node of the cluster. Although the use of remote GPUs may seem to be a senseless idea, it provides several benefits as described in this poster by using the rCUDA (remote CUDA) middleware.  Back
 
Keywords:
Tools & Libraries, Data Center & Cloud Computing, GTC 2016 - ID P6164
Download:
 
CudaPAD - A Quick on-the-fly PTX/SASS Viewer
Ryan White (IT Consultant)
CudaPAD is Windows-based software that aids in the optimizing and understanding of NVIDIA's CUDA kernels by displaying an on-the-fly view of the PTX/SASS that makes up the GPU kernel. CudaPAD simply shows the PTX/SASS output. However, it has several ...Read More
CudaPAD is Windows-based software that aids in the optimizing and understanding of NVIDIA's CUDA kernels by displaying an on-the-fly view of the PTX/SASS that makes up the GPU kernel. CudaPAD simply shows the PTX/SASS output. However, it has several visual aids to help understand how minor code tweaks or compiler options can affect the PTX/SASS. Just type or paste the kernel in the left panel and the right panel will display the corresponding disassembly information. Visual aids like CUDA-to-PTX code matching lines and WinDiff are built in to help identify PTX sections quickly. Other on-the-fly information is also given, like register counts, memory usage, and error information.  Back
 
Keywords:
Tools & Libraries, Performance Optimization, GTC 2016 - ID P6247
Download:
 
Performance Comparison of a CUDA® Interval Arithmetic Library with Standard Interval Library
Paluri Nataraj (Indian Institute of Technology, Bombay, India)
We developed a CUDA-based interval arithmetic library for GPU users. This library is based on the ideas in C-XSC interval library. We compare the performance of the developed CUDA interval library with that of the C-XSC interval library for different ...Read More
We developed a CUDA-based interval arithmetic library for GPU users. This library is based on the ideas in C-XSC interval library. We compare the performance of the developed CUDA interval library with that of the C-XSC interval library for different interval arithmetic operations. The CUDA interval library is found to be much faster than the standard C-XSC library. This CUDA interval arithmetic library allows advanced interval techniques, such as interval global optimization, to be performed in comparatively very little time on GPUs.  Back
 
Keywords:
Tools & Libraries, Algorithms, GTC 2016 - ID P6248
Download:
 
Tuning Heterogeneous Computing Architectures Through Integrated Performance Tools
Robert Lim (University of Oregon)
Heterogeneous computing presents challenges in optimizing performance across diverse architectures, high-speed networks, and programming methods in these systems. Observing GPU hardware performance counters, collected either through instrumentation o ...Read More
Heterogeneous computing presents challenges in optimizing performance across diverse architectures, high-speed networks, and programming methods in these systems. Observing GPU hardware performance counters, collected either through instrumentation or sampling, elucidates kernel execution, but does not provide a means to correlate dense activity regions with source line information. GPUs specialize in executing SIMD (single instruction, multiple data) in lock-step, where threads that do not satisfy branching conditions are masked out. Control flow graphs represent program control flow and dependencies in a program. In deriving trip counts of the control flow graph, one can determine how an input of size N will perform on a GPU without having to compile or run the application.  Back
 
Keywords:
Tools & Libraries, Performance Optimization, GTC 2016 - ID P6282
Download:
Video & Image Processing
Presentation
Media
GPU Implementation and Optimization of Video Super-Resolution
Bo Xiao (Baidu Research SVAIL)
We managed to apply the SRCNN, a convolutional neural network for image super-resolution, on FHD to 4K video super-resolution. The SRCNN runs very slowly on CPU. So we parallelized the convolution and implemented the SRCNN on GPU. We make use of the ...Read More
We managed to apply the SRCNN, a convolutional neural network for image super-resolution, on FHD to 4K video super-resolution. The SRCNN runs very slowly on CPU. So we parallelized the convolution and implemented the SRCNN on GPU. We make use of the hierarchy of GPU memory and optimize the algorithm. The GPU and the optimization significantly accelerate the video super-resolution process and achieve the speed of 1.2s per frame, which is almost 300 times faster than the CPU implementation.  Back
 
Keywords:
Video & Image Processing, GTC 2016 - ID P6115
Download:
 
A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows
Timothy Blattner (NIST)
Designing scalable applications is key to improving performance in hybrid computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The H ...Read More
Designing scalable applications is key to improving performance in hybrid computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) increases programmer productivity when implementing hybrid workflows that scale to multi-core and multi-GPU systems. HTGS manages dependencies between tasks, represents CPU and GPU memories independently, overlaps computations with disk I/O and memory transfers, uses multiple GPUs, and uses all available compute resources. We present an implementation of hybrid microscopy image stitching using HTGS that reduces code size by ~25% and shows favorable performance compared to an implementation without HTGS.  Back
 
Keywords:
Video & Image Processing, Supercomputing & HPC, GTC 2016 - ID P6198
Download:
 
G-Storm: GPU-Aware Scheduling in Storm
Cheng-Hao Huang (National Tsing Hua University)
As we shift toward a data-driven economy, the ability to efficiently analyze huge amounts of data in shorter amounts of time is the key to success. Many systems for big data processing have been developed. Storm is one of them, targeting stream data ...Read More
As we shift toward a data-driven economy, the ability to efficiently analyze huge amounts of data in shorter amounts of time is the key to success. Many systems for big data processing have been developed. Storm is one of them, targeting stream data processing.  Back
 
Keywords:
Video & Image Processing, GTC 2016 - ID P6252
Download:
 
QuickEye: Highly Efficient Face Detection & Recognition in Large Video Using Hadoop GPU-cluster
Saehanseul Yi (Dasan Networks), Youngmin Yi (University of Seoul), Illo Yoon (University of Seoul)
Ignoring the debate about privacy, we cannot deny that CCTV has had a positive contribution on crime prevention. CCTV is everywhere and more are coming. More and more video files are created and they are getting bigger. It is difficult to handle thes ...Read More
Ignoring the debate about privacy, we cannot deny that CCTV has had a positive contribution on crime prevention. CCTV is everywhere and more are coming. More and more video files are created and they are getting bigger. It is difficult to handle these big video file with a single server. So let's try Hadoop! Hadoop is a big data framework that can be used easily distributed in a cluster environment. By default, Hadoop cannot utilize GPUs, because Hadoop is running on JVM. So we attached GPU code to Hadoop using JNI(Java Native Interface) and introduce a system called QuickEye. QuickEye decodes large video files detecting face recognition using CUDA.  Back
 
Keywords:
Video & Image Processing, GTC 2016 - ID P6311
Download:
 
Large Time Series Single-Molecule Tracking Including Defocus and Motion Blur Control
Xu Xiaochun (National University of Singapore)
We'll present an operational tracking implementation for multi-channel microscopy time series from hundreds to tens of thousands of frames, depicting the dim traces of single fluorescent molecules moving over time. The characteristic shape of an opt ...Read More
We'll present an operational tracking implementation for multi-channel microscopy time series from hundreds to tens of thousands of frames, depicting the dim traces of single fluorescent molecules moving over time. The characteristic shape of an optical point source is used to localize and trace thousands of molecules fast, accurately, and reliably over a timespan of several minutes.  Back
 
Keywords:
Video & Image Processing, Computational Biology, GTC 2016 - ID P6344
Download:
Virtual Reality & Augmented Reality
Presentation
Media
GPU Accelerated Method for Estimation of Light-Sources
Esteban Clua (Universidade Federal Fluminense), Bruno Augusto Dorta Marques (Universidade Federal Fluminense)
The estimation of a light source configuration of a real-world environment can benefit a wide range of applications. Intelligent applications can produce different behaviors based on the lighting present in the user environment by, for example, adjus ...Read More
The estimation of a light source configuration of a real-world environment can benefit a wide range of applications. Intelligent applications can produce different behaviors based on the lighting present in the user environment by, for example, adjusting the color scheme of the interface, changing the brightness, and controlling the white-balance of a display. One area of particular interest is augmented reality and mixed reality, which requires that both real-world and virtual elements have a consistent appearance and lighting conditions. This study proposes a GPU-accelerated method to recognize some aspects of the light source of a real environment. The method is build with a fast evaluation function that fits the highly constrained time of a real-time application.  Back
 
Keywords:
Virtual Reality & Augmented Reality, Deep Learning & Artificial Intelligence, GTC 2016 - ID P6168
Download:
 
Accelerated Transport System Simulation Using CUDA
Peter Heywood (The University of Sheffield)
Discover how GPUs are being used to accelerate predictive simulations used in transport system planning and management to alleviate the global increase in transport demand. We highlight the role of predictive, high-performance micro-simulations in tr ...Read More
Discover how GPUs are being used to accelerate predictive simulations used in transport system planning and management to alleviate the global increase in transport demand. We highlight the role of predictive, high-performance micro-simulations in transport system management and provide insight into the development process and benchmark performance of agent-based transport models developed using FLAME GPU, including the creation of a populated virtual reality environment using an omnidirectional treadmill.  Back
 
Keywords:
Virtual Reality & Augmented Reality, GTC 2016 - ID P6203
Download:
 
 
NVIDIA - World Leader in Visual Computing Technologies
Copyright © 2017 NVIDIA Corporation Legal Info | Privacy Policy