We'll discuss training techniques and deep learning architectures for high-precision landmark localization. In the first part of the session, we'll talk about ReCombinator Networks, which aims at maintaining pixel-level image information, for high-accuracy landmark localization. This model combines coarse-to-fine features to first observe global (coarse) image information and then recombines local (fine) information. By using this model, we report SOTA on three facial landmark datasets. This model can be used for other tasks that require pixel-level accuracy (for example, image segmentation, image-to-image translation). In the second part, we'll talk about improving landmark localization in a semi-supervised setting, where less labeled data is provided. Specifically, we consider a scenario where few labeled landmarks are given during training, but lots of weaker labels (for example, face emotions, hand gesture) that are easier to obtain are provided. We'll describe training techniques and model architectures that can leverage weaker labels to improve landmark localization.
We'll disscuss how GPUs are playing a central role in making advances in Ion Torrent's targeted sequencing workflow and talk about the S5 DNA sequencer from Ion Torrent that is enabling democratization of sequencing market and accelerating research in precision medicine at a breathtaking pace with the help of GPUs. We'll highlight our work in liquid biopsy and non-invasive prenatal testing and how the breadth in technology offerings in semiconductor chips gives us the scale of sequencing from small panels to exomes. We'll discuss our analysis pipeline and the latest and greatest in algorithm development and acceleration on GPUs as well as our experiences ranging from Fermi to Pascal GPU architectures.
Machine Learning in Precision Medicine: Patient-Specific Treatment Enabled by Quantitative Medical Imaging, Artificial Intelligence, and GPU Efficiency The attendees will learn about the need for and use of machine learning in today's patient-centered healthcare. The talk will focus on general approaches requiring machine learning to obtain image-based quantitative features, reach patient diagnoses, predict disease outcomes, and identify proper precision-treatment strategies. While the presented methods are general in nature, examples from cardiovascular disease management will be used to demonstrate the need for and power of machine learning enabled by the performance advantages of GPU computation.
This talk will overview the fields of Personalised Computational Medicine and In Silico Clinical Trials, which are revolutionizing Medicine and Medical Product Development. This talk will introduce these concepts, provide examples of how they can transform healthcare, and emphasize why artificial intelligence and machine learning are relevant to them. We will also explain the limitations of these approaches and why it is paramout to engage in both phenomenological (data-driven) and mechanistic (principle-driven) modelling. Both areas are in desperate need for better infrastructures -sofrware and hardaware- giving access to computational and storage resources. The talk will be thought-provoking and eye-opening as to opportunities in this space for researchers and industries alike.
The Role of Data in Achieving Precision and Value in Healthcare The goal of healthcare is to provide the most effective treatment to every patient in the most efficient way. Data plays a key role in every aspect of this process from decision support systems that provide a clinician with the right information at the right time, to scheduling algorithms that predict patient flow and schedule accordingly, to analytics to coach and support patients in achieving or maintaining a healthy lifestyle. Achieving the vision of a data-informed healthcare system will require fundamental advances in many areas including causal inference, inference on complex, high-dimensional and heterogeneous data, missing data, process modeling, bias reduction, statistical validation, and model adaptation, to name a few. In this talk, I will illustrate some of these challenges through concrete examples within the Malone Center.
Road identification and route prediction in near real time remains a challenging problem for many geographic regions, particularly in the case of natural disasters or crisis situations. Existing methods such as manual road labeling or aggregation of mobile GPS track data are currently insufficient in dynamic scenarios. The frequent revisits of satellite imaging constellations may accelerate efforts to rapidly update road network and optimal path prediction, provided routing information can be extracted from imaging pixels. We'll demonstrate deep learning segmentation methods for identifying road center lines and intersections from satellite imagery, and inferring networks from these road segments. We'll also explore data quality requirements by comparing open source labels with-high precision labels created as part of the SpaceNet Roads challenge.
Mixed precision training of deep neural networks provides tremendous benefits: it requires half the storage and data movement of single-precision values, and starting with the Volta GPU's Tensor Cores, provides up to 120 TFLOPS of math throughput, an 8x speedup over FP32. In this talk, we first present the considerations and techniques when training with reduced-precision, including master weights and automatic loss scaling. After, we discuss real-world training in mixed precision with a particular focus on the PyTorch and TensorFlow frameworks.
Learn how to use the hidden computation capability of GPU texture units for general purpose computation. We describe GRASSY, a system for stellar spectral synthesis where the core problem is interpolation between pre-computed intensity value. We map these pre-computed tables to the GPU''s texture memory. Interpolation then becomes a texture lookup where the hardware automatically performs the interpolation, albeit at very low precision. Our mathematical framework reasons about the impact of this precision and our performance results show 500X speedups. This work generalizes the GPU texture units as computation engines and opens up new problems for GPU acceleration.
We''ll present a summary of ongoing work that targets the use of newer GPU architecture (Pascal and Volta) features in real-time signal processing applications in radio astronomy telescopes, and outline the future growth path for this exciting new application of GPUs. With Pascal and Volta architectures, we''ll discuss the advantage of using higher memory bandwidth, half-single precision, and integer arithmetic in existing GPU-based correlator pipeline code. This is an ongoing effort between the National Centre for Radio Astrophysics and NVIDIA. We''ll look at various processing stages involved in the pipeline for exploring optimization possibilities, and highlight interesting results that were achieved. We''ll address in detail the effect of using half precision with respect to accuracy of performance and required library changes.
Epistasis is the interaction of two or more genes in coding for a biological property. Epistasis is believed to be an important factor in an individual's susceptibility to disease, and the search for epistasis is a major component in the development of personalized approaches to genomic medicine. Statistical tests for epistasis are typically confounded by the multiple-testing problem, that is, the aggregated loss of precision incurred through repeated hypothesis testing. One way to circumvent this problem is to simulate a false-discovery rate via resampling. We report success in using GPUs to accelerate these highly compute-intensive resampling techniques.
Numerical weather prediction is one of the major applications in high performance computing and demands fast and high-precision simulation over fine-grained grids. In order to drastically shorten the runtime of a weather prediction code, we have rewritten its huge entire code for GPU computing from scratch in CUDA. The code ASUCA is a high resolution meso-scale atmosphere model that is being developed by the Japan Meteorological Agency (JMA) for the purpose of the next-generation weather forecasting service. A benchmark on the 3996 GPUs on TSUBAME 2.0 achieves extremely high performance of 145 Tflops in single precision for 14368 × 14284 × 48 mesh. With the initial data and the boundary condition currently used in the JMA weather forecast, we have carried out the run with 500m horizontal mesh 4792 × 4696 × 48, covering whole Japan area with 437 GPUs.
The AMBER molecular dynamics (MD) package is one of the fastest MD packages on commodity hardware and was one of the first widely used packages to exploit GPUs. We'll discuss the history of AMBER on NVIDIA GPUs and then highlight some of the newest advances in MD simulation that feature in the latest version 16 of AMBER. This includes extremely high-throughput thermodynamic integration free energy methods, explicit solvent constant pH simulations, advanced umbrella sampling restraints, multi-dimensional replica exchange methods, and asymmetric boundary conditions. We'll also discuss the development and validation of our latest precision model, SPXP, which is focused on maximizing the performance achievable from Maxwell-generation hardware without sacrificing accuracy.
We'll focus on one of the three pilots of the DOE and NCI partnership on precision oncology and the Cancer Moonshot, namely predicting tumor cell response to drug treatments with deep learning. Predicting tumor cell response to drug treatments is a critical challenge for accomplishing the promise of precision medicine in oncology. As part of a joint project between DOE and NCI to develop advanced computing solutions for caner, we are developing a deep learning-based framework for modeling tumor-drug interaction and predicting dose response in pre-clinical screening.
Learn about optimizations that significantly improve performance of our CUDA conjugate gradient linear solver developed for OpenFOAM, a popular open-source CFD software toolbox. We describe the challenges present in porting iterative solvers to CUDA: overhead from data structures conversion, the need for a fast GPU preconditioner, and our approaches to tackling them. We explain our optimizations: reusing the preconditioner from previous time-steps, always storing the preconditioner in low precision, etc., and their impact on performance. Finally, we show how our implementation handles solving in parallel when the number of MPI processes per node exceeds the number of GPUs.
Learn how one of the leading institutes for global weather predictions, the European Centre for Medium-Range Weather Forecasts (ECMWF), is preparing for exascale supercomputing and the efficient use of future HPC computing hardware. I will name the main reasons why it is difficult to design efficient weather and climate models and provide an overview on the ongoing community effort to achieve the best possible model performance on existing and future HPC architectures. I will present the EU H2020 projects ESCAPE and ESiWACE and discuss recent approaches to increase computing performance in weather and climate modelling such as the use of reduced numerical precision and deep learning.
The fast multipole method (FMM) is a widely used numerical algorithm in computational engineering. Accelerating the FMM on CUDA-enabled GPUs is challenging because the FMM has a complicated data access pattern, mostly during the so-called multipole-to-local (M2L) operation. We have created several schemes to optimize the M2L and have attained a performance of over 350 (resp. 160) Gflop/s for single (double) precision arithmetic. The optimal algorithm was incorporated into a complete FMM code, which can accept any smooth kernel as specified by the user, making it very flexible. We have also developed a highly efficient CPU version.
Discover how data from experiments at heavy-ion colliders (the Relativistic Heavy Ion Collider at Brookhaven National Lab and the Large Hadron Collider at CERN) can immediately be compared with first-principles simulations of Quantum Chromodynamics (QCD) to quantitatively probe the fundamental properties of strongly interacting matter, i.e., quarks and gluons at high temperature. The conditions realized in the experiments governed the early evolution of the universe. The necessary high precision for these comparisons is obtained by completely performing our calculations on the GPU. In doing so we simultaneously face a low flop/byte ratio and high-register pressure. See how we deal with these complications and achieve high performance on the Bielefeld GPU cluster with 400 Fermi GPUs.
Learn how GPUs can be used to accelerate our understanding of the structure of sub-atomic physics. In this work we deploy GPUs to probe the structure of the nucleus using lattice quantum-chromodynamics (LQCD). While LQCD is the only known method which provides a non-perturbative study of quarks (the particles that make up the nucleus) it requires extremely powerful computational resources to achieve high precision. We accelerate the CPS application (Columbia Physics System, developed by Columbia University, Brookhaven National Laboratory and UKQCD) using the QUDA (QCD on CUDA) library. QUDA is a low-level library built on the CUDA platform, and is designed to accelerate the algorithms that form the basis for LQCD applications. The conjugate gradients (CG) algorithm to solve quark propagators with the 5-d domain-wall Dirac operator is one of the most time consuming part of LQCD calculations, and it is this algorithm that this work focuses on. Running on the Kepler-enabled K20 GPU cluster at Thomas Jefferson National Lab we demonstrate sustained Tera-flops scale CG performance with less than 10 GPUs. Furthermore, we have also developed an alternative 4-d preconditioner for the domain-wall Dirac operator to implement a more efficient version of the Dirac operator (called Mobius). Lastly, we are also exploring the use of eigenvectors using Lanczos algorithm, to further accelerate CG convergence
This session will present a method for augmenting shaking and shot videos, which is an extension of VScreen: a tool that modifies a region of any video by another image or video in real-time. The technique is initially introduced in the photo augmentation task, next it is extended for video augmentation and finally it is applied in shaking and shot videos. Moving objects in the foreground (Fg) may occlude the augmented region in background (Bg). So that we use a procedure for Fg/Bg video segmentation, that is implemented in NVIDIA video cards to fulfill the real-time requirement. Finally, we will show a quantitative evaluation, where we compare the precision and time of our binary segmentation method (QMMF) against the Graph Cut method (available in the NPP library).
Proving that such a complex system as an autonomous car is safe cannot be done using existing standards. A new method needs to be invented that is much more data driven and probability based. Traditional redundant solutions don't apply when trying to optimize a Precision-Recall curve. Getting acceptance from the regulatory bodies and the public will be much easier if the industry converges on what this new method shall be.
This presentation will provide an overview of Blue River Technology's use of GPUs in developing their See and Spray technology for Precision Agriculture. We will motivate the use of Deep Learning in detection and classification of crops and weeds in production environments, and highlight the ways in which NVIDIA GPUs have provided the tools and platform for training powerful models. NVIDIA GPUs have also helped us perform real-time inference on working machines in the field. This talk will show how these systems perform and provide videos of the machines in operation.
We''ll discuss the GPU accelerated Monte Carlo compute at JP Morgan which was architected for C1060 cards and revamped a few times as new architectures were released. The key features of the code are exclusive use of double precision, data caching, and code structure where significant amount of CPU pre-compute is followed by running multiple GPU kernels. On the latest devices, memory per flop is a throughput limiting factor for a class of our GPU-accelerated models. As byte/flop ratio is continuing to fall from one generation of GPU to the next, we are exploring the ways to re-architecture Monte Carlo simulation code to decrease memory requirements and improve TCO of the GPU-enabled compute. Obvious next steps are store less, re-calculate more, and unified memory.
Employing deep learning (DL), especially deep neural networks for high performance radiological or medical image computing is the main focus of this talk. We'll present the motivation, technical details and quantitative results of our recent work at NIH for three core problems: 1) Improving Computer-aided Detection (CAD) using Convolutional Neural Networks and Decompositional Image Representations; 2) Robust Bottom-up Multi-level Deep Convolutional Networks for Automated Organ Segmentation; 3) Text/Image Deep Mining on a Large-Scale Radiology Image Database for Automated Image Interpretation. We validate some very promising observations of using DL to both significantly improve upon traditional CAD tasks in (1) and enable new exciting research directions as (2,3). This presentation is based on 11 recent papers published in MICCAI/CVPR/TMI/JMLR and three filed patents. We would expect their positive impacts in both preventative and precision medicine aspects.
We'll describe new algorithms used to train very deep networks with half-precision float. Float16 has two major potential benefits: better training speed and reduced memory footprint. But Float16 has very narrow numerical range (0.00006,65504). This narrow numerical range can result both in overflow ("inf/nan" problem) or underflow ("vanishing gradient") during training of deep networks. We'll describe the new scaling algorithm, implemented in nvcaffe, which prevents these negative effects. With this algorithm, we successfully trained such networks as Alexnet, GoogLeNet, Inception_v3, and Resnets without any loss in accuracy. Other contributors to this work are S. Nikolaev, M. Houston, A. Kiswani, A. Gholaminejad, S. Migacz, H. Wu, A. Fit-Florea, and U. Kapasi.
We'll present new techniques for training machine learning models using low-precision computation and communication. We'll start by briefly outlining new theoretical results proving that, surprisingly, many fundamental machine learning tools, such as dense generalized linear models, can be trained end-to-end (samples, model, and gradients) using low precision (as little as one bit per value), while still guaranteeing convergence. We'll then explore the implications of these techniques with respect to two key practical applications: multi-GPU training of deep neural networks, and compressed sensing for medical and astronomical data.
Deep learning tools present a tremendous opportunity to improve healthcare. By increasing efficiency and accuracy of diagnostic testing, and elevating meaning from vast troves of clinical data, deep learning provides a pathway to true precision care. However, there are challenges in the translation of this technology to the clinic: model performance, infrastructure development, data privacy, hospital policy, and vendor relationships are all critical components to this effort. We'll discuss the early experience of the MGH & BWH Center for Clinical Data Science in supporting the translation of deep learning technologies in medicine, touching upon many of the existing and emerging technical, clinical, and cultural challenges that this work presents.
We'll describe training of very deep networks with mixed-precision float (("float16") using Volta Tensor Core. Float16 has two major potential benefits: high training speed and reduced memory footprint. But float16 has smaller numerical range than regular single precision float, which can result in overflow or underflow ("vanishing gradient") during training. We'll describe simple rescaling mechanism which solves these potential issues. With this rescaling algorithm, we successfully used mixed precision training for such networks as Alexnet, GoogLeNet, Inception_v3, and Resnets without any loss in accuracy.Other contributors to this work are S. Nikolaev, M. Houston, A. Kiswani, A. Gholaminejad, S. Migacz, H. Wu, A. Fit-Florea, and U. Kapasi.
OpenSeq2Seq is an open-source, TensorFlow-based toolkit, which supports a wide range of off-the-shelf models for Natural Language Translation (GNMT, Transformer, ConvS2S), Speech Recognition (Wave2Letter, DeepSpeech2), Speech Synthesis (Tacotron 2), Language Modeling and transfer learning for NLP tasks. OpenSeq2Seq is optimized for latest GPUs. It supports multi-GPU and mixed-precision training. Benchmarks on machine translation and speech recognition tasks show that models built using OpenSeq2Seq give state-of-the-art performance at 1.5-3x faster training time.
Many algorithms used in computational physics can be greatly accelerated by the use of GPUs. However, the full double-precision floating point operations to which scientists are accustomed can prove costly, especially in compute-intensive applications where floating-point computations rather than memory bandwidth limit performance. In fact, many scientific problems actually require double precision in only a small subset of the code. In these cases, the development of mixed-precision algorithms can bring substantial improvements without sacrificing overall accuracy. Double precision is used only where it is required; the remaining calculations are carried out in single precision. Although the task of identifying "necessary" double-precision code may be non-trivial, the performance payoff can be considerable. This approach also facilitates software emulation of double precision in key portions of the code, which can be effective in accelerating GPU double-precision performance. In practice, the emulation is neither complete nor IEEE-compliant, and is cumbersome to code without support at the compiler or processor levels, but impressive improvements in speed have been obtained. This roundtable will discuss progress made so far in mixed-precision calculations and emulation techniques for GPUs, and will consider the prospects for future development of these approaches.
As a result of continuing improvements, NVIDIA offers GPU-accelerated floating-point performance in compliance with IEEE 754. It is our experience that a number of issues related to floating point accuracy and compliance are a frequent source of confusion both on CPUs and GPUs. The purpose of this talk is to discuss the most common ones related to NVIDIA GPUs and to supplement the documentation in the CUDA C Programming Guide
Learn how source-code optimizations can work alone and in combination to improve not only the performance but also the energy consumption and power draw of a modern compute GPU. In addition, understand how lowering the GPU frequency, enabling ECC, and switching from single to double precision affects runtime, energy, and power.
The Mantevo performance project is a collection of self-contained proxy applications that illustrate the main performance characteristics of important algorithms. miniFE is intended to be and approximation to an unstructured implicit finite element or finite volume application. Our work investigated algorithms for assembling a matrix on the GPU. Parallelization algorithms using both 1 thread and 8 threads per element were investigated. Using these approaches a significant speedup (over 60x for double precision) compared to the serial algorithm.
This presentation will explore the efficient GPU implementation of direct numerical simulation of turbulent viscous incompressible fluid in 3D domain. We will discuss solving the full system of Navier-Stokes and energy equations using the Alternating Direction Implicit (ADI) numerical method, as well as implementation details of a fast tridiagonal matrix solver on CUDA . Finally we will compare the performance of GPU and CPU on a particular modeling problem in which the GPU outperforms the latest multicore CPUs by an order of magnitude in double precision on the whole solver.
This session will present the Buffer Management Library, a collection of C++ templates to simplify and improve the data transfers between a GPU and a host computer. The library provides multiple known algorithms for data transfers, including chunk, double and pooled buffers. In addition, the library allows for data transformations to be performed concurrently with the transfers, simplifying the conversion of data formats between host and GPU, as is a common for double to single precision conversions, as well as AOS and SOA arrangements. Using the library reduces significantly the amount of pinned memory required for transfers and removes the limitation of having huge buffers locked for use by the GPU. Overlapping data transformations with data transfers can result in more than 20% performance improvement over separate convert and copy operations.
Learn how to write high-performance but precise GPU programs by understanding potential floating-point imprecision that your programs could have. Floating-point accuracy is often a neglected issue in GPU program development, but plays a critical role in assuring reliability. In this session, we will describe how performance tuning techniques such as changing floating-point type or changing the underlying algorithm affect floating-point precision. Furthermore, see how to estimate floating-point imprecision as well as how our novel affine-arithmetic-based and control-flow-sensitive analysis can help programmers make informed decision regarding performance/precision trade-off in their programs. These concepts will be illustrated with real examples from public GPU benchmark sets.
Learn how updates to the CUDA toolkit improve the performance of GPU-accelerated applications. Through benchmark results, we will review the impact of new libraries, updates to memory management and mixed precision programming. The session will cover performance of CUDA toolkit components including libraries and the compiler.
We adopted GPU to accelerate large-scale, parallel finite-difference (FDTD) simulation of seismic wave propagation. Effective parallel implementation is needed because the size of the memory of a single GPU is too small for real applications. Thus we describe the memory optimization, the three-dimensional domain decomposition, and overlapping the communication and computation adopted in our program. We achieved so far a high performance (single-precision) of about 61 TFlops by using 1200 GPUs of TSUBAME-2.0, the GPU supercomputer in Tokyo Institute of Technology, Japan. As an important application, we show the results of the simulation of the 2011 Tohoku-Oki mega-quake.
NVIDIA TensorRT™ is a high performance neural network inference engine for production deployment of deep learning applications. TensorRT can be used to rapidly optimize, validate and deploy trained neural network for inference to hyperscale data centers, embedded, or automotive product platforms. Developers can use TensorRT to deliver fast inference using INT8 or FP16 optimized precision that significantly reduces latency, as demanded by real-time services such as streaming video categorization on the cloud or object detection and segmentation on embedded and automotive platforms. With TensorRT developers can focus on developing novel AI-powered applications rather than performance tuning for inference deployment. TensorRT runtime ensures optimal inference performance that can meet the needs of even the most demanding throughput requirements.
This talk focuses on advances in deep learning applied to precision medicine and, especially, on "deep patient", a general-purpose patient representation derived from the electronic health records (EHRs) that facilitates clinical predictive modeling. Precision medicine raises big challenges in dealing with large and massive data from heterogeneous sources, such as EHRs, genomics, and wearables. Deep learning provides a unique opportunity to retrieve information from these complex and heterogeneous sources. Here, in particular, we show how a deep architecture was able to process aggregated EHRs from the Mount Sinai Health System data warehouse to derive domain-free patient representations that can improve automatic medical predictions given the patient clinical status.
Employing deep learning (DL), especially deep neural networks for high performance radiological image computing is the main focus. We'll present the motivation, method details, and quantitative results for three core problems: 1) Improving computer-aided detection (CAD) using convolutional neural networks and decompositional image representations; 2) Integrated bottom-up deep convolutional networks for robust automated organ segmentation; 3) Text/Image deep mining on a large-scale radiology image database for automated interpretation. We validate some very promising observations of using DL to both significantly improve upon CAD tasks in (1) and enable exciting new research directions as (2,3). We would discuss their positive impacts in both preventative and precision medicine aspects.
Tsubame2.0 has been in successful production for the last 2 years, producing numerous research results and accolades. With possible upgrade of the GPUs to Kepler 2s, it will have the capability to surpass the 10 petaflops-class supercomputers in single-precision applications, without any increase in the power consumption of 1MW average.
TSUBAME 2.5 succeeded TSUBAME 2.0 by upgrading all 4224 Tesla M2050 GPUs to Kepler K20x GPUs, achieving 5.76 / 17.1 Petaflops peak in double / single point precision respectively, latter the fastest in Japan. By overcoming several technical challenges, TSUBAME 2.5 exhibits x2-3 speedup and multi-petaflops performance for many applications, leading to TSUBAME 3.0 in 2015-16.
We will present early results from IBM Power8 systems equipped with NVLink connected NVIDIA P100 GPUs. We will show comparative results with previous NVIDIA GPU generations for a set of synthetic and application benchmarks, highlighting in particular the advances in the memory subsystem of P100. The talk will in particular demonstrate the impact of the new double precision atomic add capabilities, and will discuss some early exploration of the behavior of NVLink between the Power8 CPUs and the P100 GPUs.
In this talk we will describe our recently funded effort to create the CANcer Distributed Learning Environment (CANDLE toolkit) to address one of the challenges identified in the presidential “Precision Medicine Initiative" (PMI). The DOE laboratories in this project are drawing on their strengths in HPC, machine learning and data analytics, and coupling those to the domain strengths of the NCI, particularly in cancer biology and cancer healthcare delivery to bring the full promise of exascale computing to the problem of cancer and precision medicine. This project will focus on three driver cancer problems: RAS protein pathway, drug response, and treatment strategies. We will provide a highlight of these problems, as well as a roadmap for the projects intended research and development efforts.
Come and learn about how we're using the GPUs to enable advances in a wide-variety of healthcare technologies ranging from stem cell classification for regenerative medicine, to modeling of the optical forces applied by lasers onto micro particles for assembly of nano component devices, to using GPUs to understand traumatic brain injury via study of patterns in brain imaging data, using diffusion kurtosis imaging (DKI) data. I will conclude with some of our ongoing research in the use of GPUs to create high-precision next-generation virtual and augmented reality environments for surgery, medical training, and telemedicine.
The Cancer Moonshot was established in 2016 with the goal to double the rate of progress in cancer research -- to do in five years what normally would take 10. A major area for the acceleration of progress is the strategy to use modeling, simulation, and machine learning to advance our understanding of cancer biology and to integrate what is known into predictive models that can inform research and guide therapeutic developments. In 2015, the U.S. Department of Energy formed a collaboration with the National Cancer Institute for the joint development of advanced computing solutions for cancer.
Learn how to develop an Artificial Intelligence system to localize and recognize food on trays to generate a purchase ticket in a check out process.
(1) Solving a real business problem using Deep Learning advanced technology based on object detection and localization.
(2) Combining a pipeline of models to improve accuracy, precision and with reasonable recall levels.
(3) Discovering how to develop and train a model in the cloud to be used embedded in an NVIDIA Jetson TX1 device.
Detecting road users in real-time is key to enabling safe autonomous driving applications in crowded urban environments. The talk presents a distributed sensor infrastructure being deployed in the city of Modena (Italy) at the heart of the Italian 'Motor Valley'. Modena's Automotive Smart Area (MASA) connects hundreds of smart cameras, supporting embedded GPU modules for edge-side real-time detection, with higher performance GPU (fog) nodes at block level and low latency wireless V2X communication. A distributed deep learning paradigm balances precision and response time to give autonomous vehicles the required sensing support in a densely populated urban environment. The infrastructure will exploit a novel software architecture to help programmers and big data practitioners combine data-in-motion and data-at-rest analysis while providing Real-Time guarantees. MASA; funded under the European project CLASS, is an open testbench where interested partners may deploy and test next-generation AD applications in a tightly connected setting.
In this talk we will highlight the work we have done to develop what we term the SPXP precision model. This is the first fully single and fixed precision hybrid model to provide conservation of energy in MD simulations equivalent to full double precision runs but without the need for double precision arithmetic. By exploiting the nature of fixed precision arithmetic and custom machine code accumulator functions we can effectively emulate double precision performance in the latest generation GPUs.
Recent developments in artificial intelligence, advances in GPU computing hardware and the availability of large scale medical imaging datasets allows us to learn how the human brain truly looks like from a biological, physiological, anatomical and pathological point-of-view. This learning process can be augmented by Electronic Healthcare Record data, cognitive examinations, and diagnostic/radiological report data, thus providing an integrated view of the human interpretation of neurological diseases. This talk will present how AI models can learn from big and unstructured neurological and neuroradiological data and be used as tools for precision medicine, with the aim of translating advanced imaging technologies and biomarkers to clinical practice, streamline the clinical workflow and improve the quality-of-care. It will also explore the technological translational process, requiring full clinical support, deep algorithmic integration into the radiological workflow, and the deployment of a high-throughput hospital-integrated GPU computational platform
We believe that medicine will be more precise and affordable. Physicians will integrate relevant patient data and insights at the point of decision for precise diagnostics. Therapy will be tailored to the characteristics of both the patient and disease ? resulting in the right treatment for the right patient at the right time. AI-powered decision support could help to balance the need for personalization when it matters and standardization to reduce unwarranted variations.
An efficient and highly scalable algorithm for molecular dynamics (MD) simulation (using sophisticated many-body potentials) of solid covalent crystals is presented. Its effective memory throughput on a single C2050 GPU board reached 102 GB/s (81% of the peak), the instruction throughput reached 412 Ginstr/s (80% of the peak), and 27% of the peak flops of a single GPU was obtained. Parallel efficiency of the algorithm can be as high as 95% on all 7168 GPUs of Tianhe-1A, reaching possibly a record in high performance of MD simulations, 1.87Pflops in single precision.
The car presents a particular challenge for creators of learning systems -- it is incredibly rich in data and context, its hardware and software environments are heterogeneous and fragmented, and drivers expect incredible precision from its interactions. CloudMade has pioneered an approach to machine learning in the automotive context that leverages the richness of car data, the emerging computational power of the car, and the existing computational power of the cloud to deliver an automotive-grade machine learning toolset. With CloudMade's solutions, automotive OEMs can deliver personalized experiences to customers that together create a self-learning car that anticipates the needs and desires of the user.
Throughout the ages, the diamond''s dance of light has captured our fascination. This beauty does not happen by chance but is clearly determined by the design and craft of the diamond''s cut. This presentation will provide an overview of how GPUs are accelerating the process of converting rough stones to polished & sparkling diamonds. Technology brought a huge spike in the number of diamond pieces being processed from a meager 5 pieces per worker per day to around 150 pieces being cleared everyday by each worker. The industry challenge is to get the rough stones converted into diamonds with high precision and quality, minimum wastage and at speed of light. The range of computational visualization techniques where NVIDIA GPUs have helped address these challenges are GPU-based reconstruction, CUDA based ray tracing, surface rendering of voxel data, density contrast, and OpenGL Rendering. In this presentation attendees will learn how GPU technologies are determining the 4C''s of the diamond industry - Clarity, Color, Cut and Carat.