We explore two contending recognition network representations for speech inference engines: the linear lexical model (LLM) and the weighted finite state transducer (WFST) on NVIDIA GTX285 and GTX480 GPUs. We demonstrate that while an inference engine using the simpler LLM representation evaluates 22x more transitions per second than the advanced WFST representation, the simple structure of the LLM representation allows 4.7-6.4x faster evaluation and 53-65x faster operands gathering for each state transition. We illustrate that the performance of a speech inference engine based on the LLM representation is competitive with the WFST representation on highly parallel GPUs.
Automatic speech recognition (ASR) technology is emerging as a critical component in data analytics for a wealth of media data being generated everyday. ASR-based applications contain fine-grained concurrency that has great potential to be exploited on the GPU. However, the state-of-art ASR algorithm involves a highly parallel graph traversal on an irregular graph with millions of states and arcs, making efficient parallel implementations highly challenging. We present four generalizable techniques including: dynamic data-gather buffer, find-unique, lock-free data structures using atomics, and hybrid global/local task queues. When used together, these techniques can effectively resolve ASR implementation challenges on an NVIDIA GPU.
With an increasing amount of data and user demands for fast query processing, the optimization of database operations continues to be a challenging task. A common optimization method is to leverage parallel hardware architectures. With the introduction of general-purpose GPU computing, massively parallel hardware has become available within commodity hardware. To efficiently exploit this technology, we introduce the method of speculative query processing. This speculative query processing works on index structures to efficiently support heavily used database operations. To show the benefits and opportunities of our approach, we present a fine and coarse grain implementation for multidimensional queries.
We propose a mechanism to provide the benefits of a software-managed memory hierarchy on top of a hierarchy of hardware-managed caches. A virtual local store (VLS) is mapped into the virtual address space of a process and backed by physical main memory, but is stored in a partition of the hardware-managed cache when active. This reduces context switch cost, and allows VLSs to migrate with their process thread. The partition allocated to the VLS can be rapidly reconfigured without flushing the cache, allowing programmers to selectively use VLS in a library routine with low overhead.
Color space conversion or color correction is a widely used technique to adapt the color characteristics of video material to the display technology employed (e.g. CRT, LCD, projection) or to create a certain artistic look. As color correction often is an interactive task and colorists need a direct response, state-of-the-art real-time color correction systems for video are so far based on expensive dedicated hardware. This submission shows the feasibility to replace dedicated color correction systems by General Purpose GPUs. It is shown that a single Tesla C2050 GPU supports real-time color correction up to a resolution of 4096x2048 pixel.
Digital Holographic Microscopy (DHM) is based on the classical holographic principle invented by Hungarian physicist Dennis Gabor. The holographic images are acquired by a CCD camera. Depth slices can be reconstructed using Fourier transform. The numerical reconstruction and further image processing for object detection is done using General Purpose Graphical Processor Units (GPGPU).