We have observed huge progress in computer vision over the last four years, mainly because of the appearance of big datasets of labelled images, such as ImageNet 1 and Places , and the success of deep learning algorithms when they are trained with this large amount of data 2, 3. Since this turning point, performance has increased in many computer vision applications, such as scene recognition, object detection and recognition, image captioning, etc.
However, despite this amazing progress, there are still some tasks that are very hard for a machine to solve, such as image question-answering, or describing, in detail, the content of an image. The point is that we can perform these tasks easily not just because of our capacity for detecting and recognizing objects and places, but because of our ability to reason about what we see. To be capable of reasoning about something, one needs cognition. Nowadays computers cannot reason about visual information because computer vision systems do not include artificial cognition. One of the main obstacles to developing cognitive systems for computer vision was the lack of data to train. However, the recent work Visual Genome 4 presents the first dataset that enables the modelling of such systems and opens the door to new research goals.
This line of research aims to explore how to add cognition in vision systems, to create algorithms that can reason about the visual information.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In Proc. CVPR, 2009.
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. “Learning Deep Features for Scene Recognition using Places Database”. Advances in Neural Information Processing Systems 27 (NIPS), 2014.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks”. In Advances in Neural Information Processing Systems, 2012.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei. ”Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. 2016. https://visualgenome.org/