Computer Vision with CNNs performs well for people detection. This is not enough. A step forward can be taken to understand the aspect of people detected in low resolution, or corrupted by occlusions in the crowd; to track them in the wild; to detect saliency and pay attention to details only; to forecast motion and human actions. The next solutions will be provided by new neural architectures based on autoencoders and recurrent architectures, such as Generative Adversarial Networks and Long Short Term Memories. The session will present how they work, how they can be implemented on GPUs and how they are used in real applications, such as in AI cities form static and moving cameras and in collaborative environments.