Computer vision today still relies on conventional camera capture that was invented decades ago to accommodate the human visual system. Yet computers and humans see the world very differently. What works for human vision does not correlate for computer vision, and vice versa. This is especially true for computer vision based on deep convolutional networks. We need better vision for computer vision. Using GPUs and deep learning, we're able to reverse the resolution degrading effects of conventional visual capture, then reconstruct on demand to radically improve the accuracy and processing efficiency of computer vision applications.
This talk explores how DeepStream enables developers to create high-stream density applications with deep learning and accelerated multimedia image processing, building IVA solutions at scale. Leverage a heterogeneous concurrent neural network architecture to bring in different deep learning techniques for more intelligent insights. The framework makes it easy to create flexible and intuitive graph-based applications, resulting in highly optimized pipelines for maximum throughput.
Learn how we overcame the odds of certifying computer vision and AI systems in an industry as risk adverse as the air traffic control sector. We use off-the-shelf cameras deployed in an airport environment to provide an out the window view of the airfield, create an enriched augmented reality view for better situational awareness, contingency and redundancy. In this talk, we take you through the steps from developing an AI using Nvidia frameworks, to deploying a camera system at an airport for air traffic control use as an imaging system as well as a tracking system using AI technology such as artificial neural networks. All the way through user acceptance tests and certification. This talk is intended as a lessons learned for your next project in smart cities or aerospace. The main focus of this talk lays on the tools used to develop AI and the tools used to understand and visualize neural networks.
Cities are always looking for new ways to maintain high standards of living, better connect with citizens and find ways to save moneyÃ¢â¬âall while serving growing populations. As city population densities increase and cities strive to increase walkability and mobility for their citizens, they have a big focus on a holistic approach to traffic safety. As part of their efforts to become smarter, more and more cities are turning to the Internet of Things (IoT) and Machine-to-Machine (M2M) technologies to improve municipal services, create additional sources of revenue, and enable city management in new and creative ways.
Learn how to develop an Artificial Intelligence system to localize and recognize food on trays to generate a purchase ticket in a check out process.
(1) Solving a real business problem using Deep Learning advanced technology based on object detection and localization.
(2) Combining a pipeline of models to improve accuracy, precision and with reasonable recall levels.
(3) Discovering how to develop and train a model in the cloud to be used embedded in an NVIDIA Jetson TX1 device.
Detecting road users in real-time is key to enabling safe autonomous driving applications in crowded urban environments. The talk presents a distributed sensor infrastructure being deployed in the city of Modena (Italy) at the heart of the Italian 'Motor Valley'. Modena's Automotive Smart Area (MASA) connects hundreds of smart cameras, supporting embedded GPU modules for edge-side real-time detection, with higher performance GPU (fog) nodes at block level and low latency wireless V2X communication. A distributed deep learning paradigm balances precision and response time to give autonomous vehicles the required sensing support in a densely populated urban environment. The infrastructure will exploit a novel software architecture to help programmers and big data practitioners combine data-in-motion and data-at-rest analysis while providing Real-Time guarantees. MASA; funded under the European project CLASS, is an open testbench where interested partners may deploy and test next-generation AD applications in a tightly connected setting.
Modern computing hardware and NVIDIA Jetson TX1 / TX2 performance create new possibilities for smart city applications and retail, parking lot, and drone industries. We'll present on how the PIXEVIA system covers vision processing and AI tasks using deep neural networks; learning using computer generated images for number plate recognition; and self-supervised learning for vehicle detection. We will explore methods for orchestrating and combining information from different type of neural networks (from SSDs, Mask-RCNNs to attention based RNNs). Real-world use cases for parking lots (empty parking space detection, number plate recognition) and retail industries (amount of stock on the shelf calculation, people counting with age and gender recognition) will also be presented.
Experience how to make spaces aware of the situation of people and objects. Explore new techniques to build real-time systems that can understand scenes with the help of hemispherical point clouds and AI at the edge. The goal of this session is to learn new ways of developing scene understanding needed for action and interaction in public spaces or smart homes. The capture, recognition and understanding of all external and internal degrees of freedom of persons and objects and of their respective states give the full information of the observed space.
While hemispherical vision provides advantages for wide-area coverage from a single point of observation, it also introduces new challenges due to its distinct projection geometry. At the example of 3-dimensional people detection and posture recognition, we explain different approaches to use deep neural networks to extract information from hemispherical RGB-D data. The talk focuses on providing an overview over methods, which attendees can be apply to custom projects and run on Jetson in real-time.
Learn how Verizon is helping create safer streets, reducing traffic congestion, aiding the navigation of both vehicles and pedestrians, and reducing energy costs and consumption through AI-enabled sensor based networks that leverage LED street lighting infrastructure. We will discuss our Vision Zero application and how use deep learning to recognize, detect, classify and concurrently track vehicles in traffic, pedestrians, bicyclists, and parked cars, and turn it into actionable data to help make better urban planning decisions and quantify the results.
Learn how to simulate transportation systems and crowds for smart city applications at massive scale. This talk will give insights into novel algorithms and techniques which are being applied to: 1) National (entire UK) scale road network flow simulations, 2) City sized simulations of intelligent individually modelled vehicles, and 3) Integrated simulations of national infrastructure with Pedestrian crowds, vehicles and rail. Examples of techniques include low-density high-diameter graph traversal, multi agent simulation and virtual reality interaction using the OmniDeck treadmill and the Oculus Rift.
A variety of aspects of modern video understanding platform will be discussed. We will cover such topics as single stage CNN-based hierarchical object detection for low level image semantics extraction, LSTM neural network architectures for efficient tracking and behavioral analysis, object descriptors for cross stream similarity. Later on we will talk on the time-optimized video analysis system architecture, which enable processing of multiple streams on a single NVIDIA GPU and running in near-real time. Finally we will demonstrate the system performance on a variety of complex use-cases.
Parsing millions of video cameras in real time to provide situational awareness is an enormous challenge. We will discuss how YITU Tech has overcome this using GPUs and TensorRT. We learned from 1 billion faces to win first place in face identification accuracy in FRPC 2017 hosted by NIST. We will show how we analyze data from 10 million cameras using several thousand NVIDIA Tesla P4s and achieve accuracy of 99% in identifying pedestrians with 100 days of data from the cameras. The result is an ability to do big data analysis on things like population density and traffic flows, that enable the development of smart cities.
We will describe a fast and accurate AI-based GPU accelerated Vehicle inspection system which scans the underside of moving vehicles to identify threatening objects or unlawful substances (bombs, unexposed weapons and drugs), vehicle leaks, wear and tear, and any damages that would previously go unnoticed.
How can we enable automated AI/ Deep learning based machines to evolve their specialties through colonies of "social networks of intelligent machines" (SNIM)? We will give an example of Qylur's QyNetTM machines cloud concept and how we utilize the power of SNIMs, GPU enabled deep learning and execution at edge systems, to enable a revolution in our guest entry operations and physical security for public venues. From mega events to parks and museums. We will also dream a bit further to how other industrial intelligent machines can benefit from the QyNet SNIM, and also touch on our responsibilities as humans as we enable this disruptive and beneficial revolution to take place.
Video is increasingly becoming a key sensor for maintaining security, business performance and efficient operations. This session will discuss the technology and application of BriefCam's video analytics solutions. Topics will include how GPUs and deep learning generates rich metadata from video and how it solves a diverse range of problems and applications.
Smart and safe cities need AI. There are approximately 500 million cameras deployed globally today. When it comes to analyzing that data, traditional methods of video analytics often fall short. AI and deep learning can provide the level of accuracy needed to extract meaningful real-time insights. The result is improved public safety and more efficient city operations. NVIDIA Metropolis is the companys edge-to-cloud platform for the AI City. It includes solutions for deep learning at the edge, in on-prem servers and in the cloud, as well as a comprehensive SDK. During this talk, well provide an overview on NVIDIA Metropolis, its different applications, and its critical role in the creation and expansion of smart and safe cities.
Leveraging NatSec technology to make real-time video streaming from vehicles possible, zero-latency, secure and affordable; and applying the latest generation of FaceRec analytics to ensure only authorised people are behind the wheel.
SeeQuestor uses Deep Learning and Affordable Supercomputers to provide Radically Faster Video Intelligence to Police and Law Enforcement Agencies who need to search 100s or 1,000s of hours of CCTV or other video data as part of a criminal investigation or a search for a missing person. Developed with input from the Met Police and the British Transport Police, SeeQuestor is now in use by law enforcement agencies around the world. This session will focus on the technology used (Deep Learning and Affordable Super Computers, powered by GPUs), the academic pedigree (two leading computer vision research groups from the UK), and illustrate the capabilities of the SeeQuestor platform with examples drawn from real use cases.
Why does building AI Cities matter now for America? Why should the U.S. industry and government aggressively develop and deploy AI and deep learning to solve important problems around public safety and operational efficiency in our urban centers? What are the global trends that make this the right time to drive these changes? We'll cover these topics and more.
We'll explore how deep learning techniques can be used to transform passive surveillance systems into active threat-detection platforms for environments that range from retail, cities, and campuses. Deep Science is deploying deep learning solutions to spot robberies and assaults as they're occurring in real time.
Government agencies and commercial companies demonstrate high demand for versatile, stable, and highly efficient person identification solutions supporting cross-domain face recognition and person database clusterization in both controlled and uncontrolled scenarios. Now it's possible to resolve cross-domain face recognition challenges using deep learning and even tasks of quadratic complexity using GPU-powered inference of CNN-based face recognition algorithms. We'll focus on (1) the concept of the GPU-powered platform for cross-domain face recognition; (2) its essential performance and critical technical characteristics; (3) an approach to reaching the demanded efficiency and quality by using the NVIDIA GPU; and (4) providing examples of completed and ongoing projects that demonstrate achieved high-performance and quality parameters in real-life conditions.
Body-worn cameras have proven to strengthen trust and accountability between law enforcement agencies and the communities they serve. However, large-scale use of body-worn cameras has generated massive amounts of data, which is practically impossible for these agencies to use effectively. This has led to significant, and unproductive, time spent manually analyzing data. Axon Research is using the latest advances in deep learning and GPU acceleration to enable increased efficiency across the body-worn camera continuum by accelerating the many manual, time-consuming workflows in public safety, such as redacting footage in response to a public request. Attendees will hear the potential impact of large-scale deep learning on law enforcement and public safety information management.
For security teams working to ensure public safety, the ability to minimize incident response time and speed forensic investigations is critical. We'll discuss a new end-to-end, deep learning, and GPU-reliant architecture and video search engine for video data being deployed to solve this.
We'll showcase both the technology and use-cases for applying convolutional neural networks and GPUs to reverse the resolution-degrading effects of optical blur and sensor sampling, in order to reconstruct color video to nine times its captured pixel density.
We'll introduce the of the AI cities challenge winners announced at GTC2017. Honghui Shi from University of Illinois at Urbana-Champaign who will do a ten minute presentation on multiple-Kernel based vehicle tracking Using 3D deformable models. Zheng Tang will then present on effective object detection from traffic camera videos.
Through the application of artificial intelligence and deep learning, "computing at the edge" is changing how safety systems are detecting, capturing, analyzing, and applying reasoning to events. Using real-time analysis of the data from cameras and inertial sensors mounted on a vehicle, we can not only detect unsafe driving events but also analyze the chain of events that lead to unsafe situations. We can recognize driver's positive performance in addition to areas where best practices need to be reinforced. Power-efficient and powerful deep learning processors enable us to process all of this data in real time at the edge of the network. This allows us to create an accurate and comprehensive record of driving performance that fleet managers can use to create incentives for safer driving. Insurance companies can also use this information to set proper premiums customized for individual drivers and potentially adjusted dynamically to reflect the driving environment.
Today, billions of sensors gathering zettabytes of data are offering organizations a treasure trove of information that can help them better serve their customer needs. With the advances of 5G infrastructure, companies now have the ability to bring AI models to the edge, where the data is generated and real-time decisions need to be made. Kubernetes eliminates many of the manual processes involved in deploying, managing and scaling applications, and is becoming a standard for deployment from the data center to the edge. NVIDIA NGC product and engineering experts will walk through the latest enhancements to its GPU-accelerated software hub, and demonstrate how NVIDIA is facilitating the deployment and management of AI applications at the edge.
A revolutionary deep data analytics platform for retailers, learn how AnyVision Insights can help you gain an end-to-end, accurate view of your stores daily operation, customers journey, shopping patterns and sales funnel. A live demo will illustrate Insights vast capabilities through its cutting-edge recognition platform specially designed for real-world retailers, including unique people counting, heat and path mapping, focus of attention, duration in store, bounce rate, rush hours, demographics classification, VIP recognition, zone breakdown, and more.
This session will give an overview on the performance and efficiency of running the Malong RetailAI® software stack on Dell EMC PowerEdge R7425 server for retail analytics. The objective is to show how the stack can deliver high throughput & low latency inferencing performance on Nvidia T4 GPU. We will look into some of the use cases that can be solved by running the algorithms developed by Malong in the area of product recognition. We will talk about Curriculum Net which is based on weakly supervised learning and how these algorithms are being run on Nvidia T4 GPU by taking advantage of TensorRT. The attendee will also get an understanding on how we take advantage of AMD CPU to deliver a higher throughput low latency solution. By taking advantage of multiple PCIe lanes we can provide a true scale-up intelligent video analytics solution with T4 GPU.
We'll discuss how Motionloft uses advanced computer vision at the edge to help retailers increase customer satisfaction and revenue. Our sensors, convolutional neural networks, and NVIDIA GPUs allow retailers to analyze movements of people inside stores, and turn these movements into actionable insights. We'll explain how our technology can monitor and better understand the impact of factors like long lines, which cost retailers $15.8 billion in lost sales annually. In the past, this sort of measurement could only be done with handheld clickers. We'll also cover how retailers can track other elements that affect the customer experience including store occupancy, service transaction times, service-area patronage, and customer abandonment.
Artificial intelligence and the latest in computer vision techniques are quickly re-shaping the future of retail. By deploying modern deep learning techniques, technology companies are improving the overall retail shopping experience by getting rid of slow, cumbersome checkout lines. We'll talk about our work on autonomous checkout, which can make shopping a seamless, magical and more human interaction. Standard Cognition, along with other technology innovators like AmazonGo, have announced plans to deploy thousands of autonomous checkout-enabled retail stores by 2021.
This session proposes a simple but robust method of data distribution for large-scale IoT deployments. Attendees will learn how to use a peer-peer publish/subscribe messaging technology based on data topics to facilitate collection of initial in-situ data, distribution of inferencing models, load-sharing between "worker nodes", collation of inferencing results from many nodes to a central "command", and collation of corner-case data to facilitate iterative updates to the trained model.
Designing an autonomous machine are about much more than just the AI. Electrical, Mechanical, Connectivity, and Security are just a few of the disciplines where you will require expertise. Not all companies will have complete expertise in all these areas. In this session, we will provide examples followed by design considerations, strategies and solutions to begin to address these challenges.
Learn how VisionLabs GPU-powered solutions contribute to creating a safer, smarter Megacity a metropolitan area with a total population in excess of ten million people. We'll do a deep dive into three implemented and ongoing huge scale smart-city projects, understand challenges, technical specifics and how GPU computing impacts each of these cases: Face authentication-based immobilizer and driver monitoring systems for municipal service vehicles powered by the NVIDIA Jetson TX2 embedded platform; Megacity scale vehicle traffic analysis and anomalies detection powered by NVIDIA Tesla P40 with over 80 million daily recognition requests; National scale face identification platform for financial services with over 110 million faces in its database. The foundation of all these projects is VisionLabs LUNA a cross-platform object recognition software based on proprietary deep neural networks (DNN) inference framework. To build cost-effective solutions, VisionLabs use know-hows in DNN quantization and acceleration. In terms of accuracy, VisionLabs is recognized as a top three best in the world by National Institute of Standards and Technology's face recognition vendor test, and LFW by University of Massachusetts challenges.
Miovision presents a video-based traffic analytics system, capable of tracking and classifying vehicles in real time throughout cities. The system leverages Jetson TX2 modules and inferencing to accurately classify vehicles at over 50 frames per second using single-shot multibox detection and DAC, a VGG-based network. We'll cover many of the issues our teams went through to design and implement the system, including data collection, annotation, training, incorporating continuous training, and deep learning iteration. We'll also illustrate how the measured traffic trends were used to reduce congestion and evaluate the health of traffic corridors.
A wide area and city surveillance system solution for running real-time video analytics on thousands of 1080p video streams will be presented. System hardware is an embedded computer cluster based on NVIDIA TX1/TX2 and NXP iMX6 modules. A custom designed system software manages job distribution, resulting in collection and system wide diagnostics including instantaneous voltage, power and temperature readings. System is fully integrated with a custom designed video management software, IP cameras and network video recorders. Instead of drawing algorithm results on the processed video frames, re-encoding and streaming back to the operator computer for display, only the obtained metadata is sent to the operator computer. Video management software streams video sources independently, and synchronizes decoded video frames with the corresponding metadata locally, before presenting the processed frames to the operator.
Robust object tracking requires knowledge and understanding of the object being tracked: its appearance, motion, and change over time. A tracker must be able to modify its underlying model and adapt to new observations. We present Re3, a real-time deep object tracker capable of incorporating temporal information into its model. Rather than focusing on a limited set of objects or training a model at test-time to track a specific instance, we pretrain our generic tracker on a large variety of objects and efficiently update on the fly; Re3 simultaneously tracks and updates the appearance model with a single forward pass. This lightweight model is capable of tracking objects at 150 FPS, while attaining competitive results on challenging benchmarks. We also show that our method handles temporary occlusion better than other comparable trackers using experiments that directly measure performance on sequences with occlusion.
Go beyond working with a single sensor and enter the realm of Intelligent Multi-Sensor Analytics (IMSA). We''ll introduce concepts and methods for using deep learning with multi-sensor, or heterogenous, data. There are many resources and examples available for learning how to leverage deep learning with public imagery datasets. However, few resources exist to demonstrate how to combine and use these techniques to process multi-sensor data. As an example, we''ll introduce some basic methods for using deep learning to process radio frequency (RF) signals and make it a part of your intelligent video analytics solutions. We''ll also introduce methods for adapting existing deep learning frameworks for multiple sensor signal types (for example, RF, acoustic, and radar). We''ll share multiple use cases and examples for leveraging IMSA in smart city, telecommunications, and security applications.
Modern computing hardware and NVIDIA Jetson TX1 / TX2 performance create new possibilities for drones and enable autonomous AI systems, where image processing can be done on-board during flight or near the camera. We'll present how PIXEVIA system covers vision processing and AI tasks for drones, e.g., image stabilization, position estimation, object detection, tracking, and classification using deep neural networks, and self-evolvement after deployment. We'll describe software frameworks Caffe/Tensorflow with cuDNN, VisionWorks, and NVIDIA CUDA to achieve real-time vision processing and object recognition. Real-world use cases with drone manufacturers Aerialtronics and Squadrons Systems, and with smart city applications in Vilnius and Tallinn will be presented during this talk.
We present a novel unsupervised method for face identity learning from video sequences. The method exploits Convolutional Neural Networks for face detection and face description together with a smart learning mechanism that exploits the temporal coherence of visual data in video streams. We introduce a novel feature matching solution based on Reverse Nearest Neighbour and a feature forgetting strategy that supports incremental learning with memory size control, while time progresses. It is shown that the proposed learning procedure is asymptotically stable and can be effectively applied to relevant applications like multiple face tracking and online open world face recognition from video streams. The whole system including the smart incremental learning mechanism take advantage of the GPU.
Computer Vision with CNNs performs well for people detection. This is not enough. A step forward can be taken to understand the aspect of people detected in low resolution, or corrupted by occlusions in the crowd; to track them in the wild; to detect saliency and pay attention to details only; to forecast motion and human actions. The next solutions will be provided by new neural architectures based on autoencoders and recurrent architectures, such as Generative Adversarial Networks and Long Short Term Memories. The session will present how they work, how they can be implemented on GPUs and how they are used in real applications, such as in AI cities form static and moving cameras and in collaborative environments.
In this talk, I will highlight the main research challenges facing the field of activity detection in untrimmed videos, as well as, deep learning based methods developed at KAUST to address them. Massive amounts of video data need to be processed for relevant semantic information that predominantly focuses on human activities (i.e. single human, human-to-human, and human-to-object interactions). While this problem is encountered in many real-world applications (e.g. video surveillance, large-scale video summarization, and ad placement in video platforms), automated vision solutions have been hindered by several challenges including the lack of large-scale datasets for learning and the need for real-time processing. I will highlight how deep learning can be used to tackle these challenges.
Learn how deep learning is used to process video streams to analyse human behaviour in real-time. We will detail our solution to recognise fine-grained movement patterns of people how they perform everyday actions like e.g. walking, eating, shaking hands, talking to each other. The novelty of our technical solution is that our system learns these capabilities from watching lots of video snippets showing such actions. This is exciting because very different applications can be realised with the same algorithms as we follow a purely data-driven, machine learning approach. We will explain what new types of deep neural networks we created and how we employ our Crowd Acting (tm) platform to cost-efficiently acquire hundred thousands videos for that.
Government agencies and commercial companies today demonstrate high demand to versatile, stable and highly-efficient person identification solutions supporting cross-domain face recognition and person database clusterization in both controlled and uncontrolled scenarios. Now it becomes possible to successfully resolve cross-domain face recognition challenge using deep learning and even tasks of quadratic complexity using GPU-powered inference of CNN-based face recognition algorithms. We''ll focus on (I) the concept of the GPU-powered platform for cross-domain face recognition; (II) its essential performance and critical technical characteristics; (III) reaching required accuracy and performance by using NVIDIA GPUs; (IV) examples of completed and ongoing face recognition projects
We present a face recognition system that can recognize multiple persons parallel in real-time running on a single Jetson TX2. Due to rapid progress in deep learning accuracy of face recognition has surpassed human level recently. GPUs became the major platform to train and run deep learning models. Speed of NVidia GPUs on deep learning tasks is increasing rapidly due to hardware and software optimizations. We present a system that combines the most accurate face detection and recognition models with the fastest software stack. Combined with a 4K camera the system can recognize over 10 persons parallel in crowd situations even from 10 meter range. The system can be deployed to low power embedded environments such as drones.
Using cutting-edge AI technologies, self-driving cars have the potential to significantly reduce traffic fatalities, improve transportation mobility and accessibility, and increase productivity. To realize the promise of these many benefits, public policy must allow innovation to flourish by, for example, removing barriers to testing and clarifying state and federal regulatory authority. During this panel, we'll hear from top policymakers and industry representatives on what key policies need to be enacted to advance the deployment of self-driving cars, and how all stakeholders are working together to further that goal.
We'll introduce DeepStream, NVIDIA's solution for high-performance video analytics. One of the grand challenges of AI is to understand video content. Applications are endless: video surveillance, live video streaming, ad placement, and more. The problem is that deep learning, which has been boosting modern AI, is computationally expensive. It's even more challenging when it comes to live stream video. That's why we're building the NVIDIA DeepStream SDK, which simplifies development of high-performance video analytics applications powered by deep learning. It's built on top of the NVIDIA Video SDK, which can leverage the GPU's hardware encoding and decoding horsepower, and on top of NVIDIA TensorRT, which accelerates deep neural network's inferencing. We have seen successful large-scale deployment of such intelligent video analytics systems, and we see this as an unstoppable trend.
This talk is a lightning introduction to object detection and image segmentation for data scientists, engineers, and technical professionals. This task of computer-based image understanding underpins many major fields such as autonomous driving, smart cities, healthcare, national defense, and robotics. Ultimately, the goals of this talk are to provide a broad context and clear roadmap from traditional computer vision techniques to the most recent state-of-the-art methods based on deep learning and convolution neural networks (CNNs). Additional considerations for network deployment at the edge or on the road in an autonomous vehicle using NVIDIA's latest TensorRT release will be discussed.
Our cities are increasingly challenged to bring intelligence and safety to their citizens' lives. New technologies using AI are promising to change the way cities operate. How will citizens benefit? What are the policies and investments necessary to bring this about as soon as possible?
We'll talk about the current performance limitations with VDI and its associated slow adoption; solutions to these limitations turning VDIs into the preferred medium for your environment; creating a better than PC experience with excellent manageability; discovering the pitfalls of various system configurations and designs; and exploring the additional advantages of a VDI deployment for any environment.
We'll present a novel framework to combine multiple layers and modalities of deep neural networks for video classification, which is fundamental to intelligent video analytics, including automatic categorizing, searching, indexing, segmentation, and retrieval of videos. We'll first propose a multilayer strategy to simultaneously capture a variety of levels of abstraction and invariance in a network, where the convolutional and fully connected layers are effectively represented by the proposed feature aggregation methods. We'll further introduce a multimodal scheme that includes four highly complementary modalities to extract diverse static and dynamic cues at multiple temporal scales. In particular, for modeling the long-term temporal information, we propose a new structure, FC-RNN, to effectively transform the pre-trained fully connected layers into recurrent layers. A robust boosting model is then introduced to optimize the fusion of multiple layers and modalities in a unified way. In the extensive experiments, we achieve state-of-the-art results on benchmark datasets.