We''ll discuss the GPU accelerated Monte Carlo compute at JP Morgan which was architected for C1060 cards and revamped a few times as new architectures were released. The key features of the code are exclusive use of double precision, data caching, and code structure where significant amount of CPU pre-compute is followed by running multiple GPU kernels. On the latest devices, memory per flop is a throughput limiting factor for a class of our GPU-accelerated models. As byte/flop ratio is continuing to fall from one generation of GPU to the next, we are exploring the ways to re-architecture Monte Carlo simulation code to decrease memory requirements and improve TCO of the GPU-enabled compute. Obvious next steps are store less, re-calculate more, and unified memory.
The Pascal generation of GPUs is bringing an increased compute density to data centers and NVLink on IBM Power 8 CPUs makes this compute density ever more accessible to HPC applications. However, reduced memory-to-compute ratios present some unique challenges for the cost of throughput-oriented compute. We'll present a case study of moving up production Monte Carlo GPU codes to IBM's "Minsky" S822L servers with NVIDIA Tesla P100 GPUs.