Learn how to use (multi) GPU and CUDA to speed up the process of stitching very large images (up to TeraBytes in size). Image stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Image stitching is widely used in many important fields, like high resolution photo mosaics in digital maps and satellite photos or medical images. Motivated by the need to combine images produced in the study of the brain, we developed and released for free the TeraStitcher tool that we recently enhanced with a CUDA plugin that allows an astonishing speedup of the most computing intensive part of the procedure. The code can be easily adapted to compute different kinds of convolution. We describe how we leverage shuffle operations to guarantee an optimal load balancing among the threads and CUDA streams to hide the overhead of moving back and forth images from the CPU to the GPU when their size exceeds the amount of available memory. The speedup we obtain is such that jobs that took several hours are now completed in a few minutes.