Visual recognition using end-to-end learning methodology: theory and applications


Human information processing mechanisms suggest the need of deep architectures for extracting complex structures and building internal representation of rich sensory inputs. With the increase of the storage and computation capacities (the use of GPUs), end-to-end learning systems and big data started to take the attention of the Computer Vision community. In the present, deep convolutional neural networks trained with very big datasets of millions of images, like ImageNet or Places, have shown remarkable improvements in several visual recognition tasks. The success of these approaches is mainly attributed to the simultaneously learning of the feature representation and the classification rule. End-to-end learning algorithms model the problem from pixels to outputs, and the descriptors are adjusted to the decision rule at learning time.

This project proposes the design of new end-to-end learning methodologies to overcome several open problems in visual recognition. Among them, we focus on: the use of large unlabelled sets of data available in many problems to improve the performance of end-to-end learning algorithms, the design of end-to-end learning architectures to model temporal and sequential data, and the use of sparse representations in the parameter learning to improve the training time of deep architectures. We will study these issues in the context of several computer vision specific objectives: visual representation and scene understanding, egocentric vision, medical image segmentation and face and gesture recognition. We will propose novel approaches for large scale models to learn non visual concepts like the popularity and the emotional tags of images. Scene understanding will be addressed by exploring neural networks from object-centric and place-centric perspective. Object recognition and scene understanding algorithms will be employed for event extraction in egocentric images for automatic diary construction of first-person data. We will develop robust algorithms to extract semantic information about the events of a person wearing a camera, where we will apply Deep Learning (DL) methodologies for video segmentation, object detection and recognition, place recognition and social interaction. Regarding the automated social behaviour, we will design, develop and apply end-to-end learning methodologies to different problems, where behaviour and interaction play a key role, and there exists a vast amount of data that: on one hand, makes the problem unfeasible to solve by human observers, and, on the other hand, allows the use of robust learning models based on DL algorithms. Our achievements in DL will be applied to define new image representations in medical imaging contexts.

Our scientific and technological achievements in the field of end-to-end learning algorithms will be developed in a close collaboration with several prestigious international research teams and tested in real applications coming from the EOPs of the project.In particular, we will apply the theory developed: (i) to recognise non visual concepts in repository of millions of images, (ii) to assist mild cognitive impairment intervention through egocentric vision, (iii) to develop algorithms applied to monitor animal behaviour, (iv) to define new biomarkers of brain disorders, as ADHD; and (v) to characterise atherosclerotic plaque, in order to assist the epidemiological analysis for plaque rupture prediction.

Scientific outcomes from the project:

Journal publications

[1] Bakhtiary, A. H., Lapedriza, A., & Masip, D. (2017). Winner Takes All Hashing for speeding up the training of Neural Networks in Large Class Problems. Pattern Recognition Letters.
[2] Calvet, L., Ferrer, A., Gomes, M. I., Juan, A. A., & Masip, D. (2016). Combining statistical learning with metaheuristics for the Multi-Depot Vehicle Routing Problem with market segmentation. Computers & Industrial Engineering, 94, 93-104.
[3] Alonso, F., Baró, X., Escalera, S., Gonzàlez, J., MacKay, M., & Serrahima, A. (2016). Care respite: taking care of the caregivers. International Journal of Integrated Care, 16(6).
[4] Escalera, S., Gonzalez, J., Baró, X., & Shotton, J. (2016). Guest Editors’ Introduction to the Special Issue on Multimodal Human Pose Recovery and Behavior Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8), 1489-1491.


Conference papers

[5] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2921-2929).

[6] García del Arco, J. A., Masip, D., Sbragaglia, V., & Aguzzi, J. (2016). Using ORB, BoW and SVM to identify and track tagged Norway lobster Nephrops norvegicus (L.). In Instrumentation viewpoint (No. 19, pp. 50-52). SARTI. MARTECH.
[7] García del Arco, J. A., Masip, D., Sbragaglia, V., & Aguzzi, J. (2016). Automated Identification and Tracking of Nephrops norvegicus (L.) Using Infrared and Monochromatic Blue Light. CCIA: 9-18. 2016
[8] Escalante, Hugo Jair, Vıctor Ponce-López, Jun Wan, Michael A. Riegler, Baiyu Chen, Albert Clapés, Sergio Escalera et al. (2016) “Chalearn joint contest on multimedia challenges beyond visual analysis: An overview.” Proceedings of ICPRW .
[9] Escalera, Sergio, Mercedes Torres Torres, Brais Martinez, Xavier Baró, Hugo Jair Escalante, Isabelle Guyon, Georgios Tzimiropoulos et al. (2016) “Chalearn looking at people and faces of the world: Face analysis workshop and challenge 2016.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1-8.
[10] Ponce-López, V., Chen, B., Oliu, M., Corneanu, C., Clapés, A., Guyon, I., … & Escalera, S. (2016, October). ChaLearn LAP 2016: First Round Challenge on First Impressions-Dataset and Results. In Computer Vision–ECCV 2016 Workshops (pp. 400-418). Springer International Publishing.


Technical Reports:
[11] Zhou, B., Khosla, A., Lapedriza, A., Torralba, A., & Oliva, A. (2016). Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055.
[12] Escalera, S., Baró, X., Escalante, H. J., & Guyon, I. (2017). ChaLearn Looking at People: Events and Resources. arXiv preprint arXiv:1701.02664.