Time synchronous Viterbi search algorithm for automatic speech recognition is implemented using a counter-intuitive single CUDA block approach. Decoding of a single utterance is carried out on a single stream multiprocessor (SM) and multiple utterances are decoded simultaneously using CUDA streams. The single CUDA block approach is shown to be substantially more efficient and enables overlapping of CPU and GPU computation by merging ten thousands of separate CUDA kernel calls for each utterance. The proposed approach has the disadvantage of large GPU global memory requirement because of the simultaneous decoding feature. However, the latest GPU cards with up to 12GB of global memory fulfill this requirement and the full utilization of the GPU card is possible using all available SMs.