In this session, we will discuss how to optimize OpenCL programs on NVIDIA GPUs. Three main aspects are discussed: memory, execution configuration, and instruction throughput. On memory optimization, we will discuss how to increase bandwidth by global memory coalescing and using local memory. Then we will discuss the concept of occupancy and various considerations in specifying the execution configuration of a kernel. Finally, we discuss techniques for improving instruction throughput.