On the path to exascale, high performance computing adapts wider and wider processors that need more parallelism. The energy required to move data and the available bandwidth pose significant challenges. See how an efficient implementation of iterative Krylov solvers can help deal with these issues. As an example, we the block conjugate gradient solver in QUDA, a library for lattice quantum chromodynamics. We demonstrate how an efficient implementation can overcome scaling issues and achieve a 10X speedup compared to a regular conjugate gradient solver.