We'll introduce cuTT, a tensor transpose library for GPUs that on average achieves over 70% of the attainable memory bandwidth, independent of tensor rank. Tensor transposing is important in many applications such as multi-dimensional Fast Fourier Transforms and deep learning, and in quantum chemistry calculations. Until now, no runtime library existed that fully utilized the remarkable memory bandwidth of GPUs and could perform well independent of tensor rank. We'll describe two transpose algorithms, "Tiled" and "Packed," which achieve high-memory bandwidth in most use cases, as well as their variations that take care of many important corner cases. We'll also discuss a heuristic method based on GPU performance modeling that helps cuTT choose the optimal algorithm for the particular use case. Finally, we'll present benchmarks for tensor ranks 2 to 12 and show that cuTT, a fully runtime library, performs as well as an approach based on code generation.