GTC ON-DEMAND

 
SEARCH SESSIONS
SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC ON-DEMAND

HPC and AI
Presentation
Media
Programming a Hybrid CPU-GPU Cluster using Unicorn
Abstract:
We present Unicorn, a novel parallel programming model for GPU clusters. It shows that distributed shared memory systems can be efficient with the help of transactional semantics and deferred data synchronizations, and thus the simplicity of distributed shared memory systems can be carried over to CPUs and GPUs in a cluster. Unicorn is designed for easy programmability and provides a deterministic execution environment. Device, node and cluster management are completely handled by the runtime and no related API is provided to the application programmer. Load balancing, scheduling and scalability are also fully transparent to the application code. Programs written on one cluster can be run verbatim on a different cluster. Application code is agnostic to data placement within the cluster as well as the changes in network interfaces and data availability pattern. Unicorn''s programming model, being deterministic, by design eliminates several data races and deadlocks. Unicorn''s runtime employs several data optimizations including prefetching and subtask streaming in order to overlap communication and computation. Unicorn employs pipelining at two levels first to hide data transfer costs among cluster nodes and second to hide transfer latency between CPUs and GPUs on all nodes. Among other optimizations, Unicorn''s work-stealing based scheduler employs a two-level victim selection technique to reduce the overhead of steal operations. Further, it employs special proactive and aggressive stealing mechanism to prevent the said pipelines from stalling (during a steal operation). We will showcase the scalability and performance of Unicorn on several scientific workloads. We will also demonstrate the load balancing achieved in some of these experiments and the amount of time the runtime spends in communications. We find that parallelization of coarse-grained applications like matrix multiplication or 2D FFT using our system requires only about 30 lines of C code to set up the runtime. The rest of the application code is regular single CPU/GPU implementation. This indicates the ease of extending sequential code to a parallel environment. We will be showing the efficiency of our abstraction with minimal loss on performance on latest GPU architecture like Pascal and Volta. Also we will be comparing our approach to other similar implementations like StarPU-MPI and G-Charm.
 
Topics:
HPC and AI, HPC and Supercomputing
Type:
Talk
Event:
GTC Silicon Valley
Year:
2018
Session ID:
S8565
Streaming:
Download:
Share: