Less code, more performance! Runtime compilation with NVRTC offers many potential benefits to new and existing codes, but also presents challenges when it comes to implementation. To help solve this dilemma, we've developed a small C++ library called "Jitify" that hides the complexities of runtime compilation behind a simple, high-level interface. Jitify takes care of issues like kernel caching, template instantiation, type reflection, and compilation of host code for the device. It also provides a convenient parallel_for function and lambda wrapper that enables dynamic runtime selection of host or device execution. Since source code passed to NVRTC does not require CUDA-specific annotations, porting a large C++ code to CUDA using Jitify can be as simple as replacing a for loop with Jitify's parallel_for construct. We'll present some examples of Jitify in action, demonstrating how it enables better code generation, faster compilation times, and rapid code porting.
Radio astronomy is a real-time signal processing application that requires extreme supercomputing. While today''s radio telescopes require 10-100 Tflops of computational power, by the end of the decade this will increase into the Exaflops regime, driven by the Hydrogen Epoch of Reionization Array (HERA) and the Square Kilometer Array (SKA). The most compute intensive part of this problem is the so-called cross-correlation algorithm, which can be recast as a linear-algebra problem similar in spirit to DGEMM. In this session we describe the cross-correlation engine that powers the pathfinder LEDA radio telescope and has been (re)optimized for the Kepler GK110 architecture to achieve over 2.5 Tflops in sustained performance. This level of efficiency is critical to meeting strict power and space constraints imposed by the instrument''s remote location.