GROMACS is a state-of-the-art molecular simulation package that employs extensive multi-level heterogeneous parallelization. Our new CUDA-based algorithms provide 4x speedup over handtuned CPU SIMD assembly, and unprecedented absolute performance. However, the heterogeneity of hardware and the inherent bottlenecks involved make efficient resource utilization and strong scaling very challenging. This advanced session describes our recent efforts on multi-level load-balancing, kernel execution strategies, CPU-GPU work splitting, and ways to exploit Kepler features such as Hyper-Q. Join us to talk about current limits of GPU acceleration in MD, and how to take molecular dynamics simulations to 100 millisecond/iteration, equivalent to 10,000 fps, in the near future!