SEARCH SESSIONS

Search All
 
Refine Results:
 
Year(s)

SOCIAL MEDIA

EMAIL SUBSCRIPTION

 
 

GTC On-Demand

Presentation
Media
Abstract:
We will describe a new API that more effectively utilizes the GPU hardware for multiple single inference instances of the same RNN model. Many NLP applications have real-time run time requirements for multiple independent inference instances. Our proposed API accepts independent inference requests from an application and seamlessly combines them to a large batch execution. Time steps from independent inference tasks are combined together so that we achieve high performance while staying within the latency budgets of an application for a time step. We also discuss functionality that allows the user to wait on completion of a certain time step, a task that's possible because our implementation is mainly composed of non-blocking function calls. Finally, we'll present performance data from the Turing architecture for an example RNN model with LSTM cells and projections.
We will describe a new API that more effectively utilizes the GPU hardware for multiple single inference instances of the same RNN model. Many NLP applications have real-time run time requirements for multiple independent inference instances. Our proposed API accepts independent inference requests from an application and seamlessly combines them to a large batch execution. Time steps from independent inference tasks are combined together so that we achieve high performance while staying within the latency budgets of an application for a time step. We also discuss functionality that allows the user to wait on completion of a certain time step, a task that's possible because our implementation is mainly composed of non-blocking function calls. Finally, we'll present performance data from the Turing architecture for an example RNN model with LSTM cells and projections.  Back
 
Topics:
AI Application Deployment and Inference
Type:
Talk
Event:
GTC Silicon Valley
Year:
2019
Session ID:
S9422
Streaming:
Download:
Share: