2013: ChaLearn Multimodal Gesture Recognition

2013: ChaLearn Multimodal Gesture Recognition banner

ChaLearn organizes in 2013 a challenge and workshop on multi-modal gesture recognition from 2D and 3D video data using Kinect, in conjunction with ICMI 2013, December 9-13, Sidney, Australia. Kinect is revolutionizing the field of gesture recognition given the set of input data modalities it provides, including RGB image, depth image (using an infrared sensor), and audio. Gesture recognition is genuinely important in many multi-modal interaction and computer vision applications, including image/video indexing, video surveillance, computer interfaces, and gaming. It also provides excellent benchmarks for algorithms. The recognition of continuous, natural signing is very challenging due to the multimodal nature of the visual cues (e.g., movements of fingers and lips, facial expressions, body pose), as well as technical limitations such as spatial and temporal resolution and unreliable depth cues.


The data generated for the challenge is made available for research purposes. In case you use this data on your work, please add the following reference.

Escalera, Sergio; Gonzàlez, Jordi; Baró, Xavier; Reyes, Miguel; Lopes, Oscar; Guyon, Isabelle; Athitsos, Vassilis; Escalante, Hugo Jair

Multi-modal Gesture Recognition Challenge 2013: Dataset and Results Inproceedings

Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 445–452, ACM, Sydney, Australia, 2013, ISBN: 978-1-4503-2129-7.

Links | BibTeX


Sergio Escalera:  Dept. Applied Mathematics, Universitat de Barcelona & Computer Vision Center.  (sergio(at)maia.ub.es)
Jordi Gonzàlez:  Dept. Computer Science & Computer Vision Center (UAB), Barcelona. (poal(at)cvc.uab.es)
Xavier Baró: EIMT at the Open University of Catalonia & Computer Vision Center. (xbaro(at)uoc.edu)
Miguel Reyes: Dept. Applied Mathematics, Universitat de Barcelona & Computer Vision Center. (mreyes(at)gmail.com)
Oscar Lopes: Computer Vision Center. (oscar.pino.lopes(at)gmail.com)
Isabelle Guyon: ChaLearn, Berkeley, California. (guyon(at)chalearn.org)
Vassilis Athitsos: University of Texas. (athitsos(at)uta.edu)
Hugo J. Escalante: INAOE, Puebla, Mexico. (hugojair(at)inaoep.mx)


[bibshow file=custom://bibtex]
PositionTeam NameScoreCode
1IVA MM[bibcite key=Wu:2013:FMF:2522848.2532589]0.12756
3ET[bibcite key=Bayer:2013:MMA:2522848.2532592]0.17105

Challearn 2013 Winners at ICMI2013

ICMI2013: Winners with the organizers


Those are the research works that are using this dataset and their published scores. If you want to appear in this list, please submit a mail to xbaro(at)uoc.edu with the reference to your published work, the obtained score and if your code is available a link to your code.

[bibshow file=custom://bibtex]

Lev. DistanceReferenceCode
0.11802Pavlakos et al. ICIP2014[bibcite key=PavlakosICIP2014]N/A
0.12756Wu et al. ICMI2013[bibcite key=Wu:2013:FMF:2522848.2532589]N/A
0.17105Bayer & Silbermann ICMI2013[bibcite key=Bayer:2013:MMA:2522848.2532592]N/A


Data description

Problem setting

The focus of the challenge is on multiple instance, user in-dependent learning of gestures from multi-modal data, which means learning to recognize gestures from several instances for each category performed by different users, drawn from a vocabulary of 20 gesture categories. A gesture vocabulary is a set of unique gestures, generally related to a particular task. In this challenge we focus on the recognition of a vocabulary of 20 Italian cultural/anthropological signs.
In all the sequences, a single user is recorded in front of a Microsoft Kinect, performing natural communicative gestures and speaking in fluent Italian. The main characteristics of the dataset of gestures are:

  • 13.858 gesture samples recorded with the Microsoft Kinect camera, including audio, skeletal model, user mask, RGB, and depth images.
  • RGB video stream, 8-bit VGA resolution (640×480) with a Bayer color filter, and depth sensing video stream in VGA resolution (640×480) with 11-bit. Both are acquired in 20 fps on average.
  • Audio data is captured using Microsoft Kinect 20 multi-array microphone.
  • A total number of 27 users appear in the data set.
  • The data set contains the following number of sequences: Development: 393 (7.754 gestures), validation: 287 (3.362 gestures) and  test: 276 (2.742 gestures). Each sequence lasts between 1 and 2 minutes and contains between 8 and 20 gesture samples, around 1.800 frames. The total number of frames of the data set is 1.720.800.
  • All the gesture samples belonging to 20 main gesture categories from an Italian gesture dictionary are annotated at frame level indicating the gesture label.
  • 81% of the participants were Italian native speakers, while the remaining 19% of the users were not Italian, but Italian-speakers.
  • All the audio that appears in the data is from the Italian dictionary. In addition, sequences may contain distracter words and gestures, which are not annotated since they do not belong to the main dictionary of 20 gestures.

Example of signs

Example of all italian signs

Data structure

We provide the X audio.ogg, X color.mp4, X depth.mp4, and X user.mp4 files containing the audio, RGB, depth, and user mask videos for a sequence X, respectively:

Image modalities

Image modalities for Montalbano dataset

We also provide a script in order to export the data in Matlab, which contains the following Matlab structures:

  • NumFrames: Total number of frames.
  • FrameRate: Frame rate of the video in fps.
  • Audio: structure that contains WAV audio data
    • y: Audio Data
    • fs: Sample rate for the data.
  • Labels: Structure that contains the data about labels contained in the sequence, sorted in order of appearance. The labels considered to the 20 gesture categories.
    • Name: The name given to this gesture.
      1. vattene 11. ok
      2. vieniqui 12. cosatifarei
      3. perfetto 13. basta
      4. furbo 14. prendere
      5. cheduepalle 15. noncenepiu
      6. chevuoi 16. fame
      7. daccordo 17. tantotempo
      8. seipazzo 18. buonissimo
      9. combinato 19. messidaccordo
      10. freganiente 20. sonostufo
    • JointPosition: It contains the joint positions in the next three coordinates:
      • WorldPosition: The world coordinates position structure represents the global position of a tracked joint. The format is X, Y, Z which represents the x, y, and z components of the subjects global position (in millimeters).
      • PixelPosition: The pixel coordinates position structure represents the position of a tracked joint. The format of the Position structure is X, Y which represent the x and y components of the joint location over the RGB map (in pixels coordinates).
      • WorldRotation: The world rotation structure contains the orientations of skeletal bones in terms of absolute transformations and is formed by a 204 matrix, where each row contains the W, X, Y, Z values of the quaternion related to the rotation. The world rotation structure provides the orientation of a bone in the 3D camera space. The orientation of a bone is relative to the child joint and the Hip Center joint still contains the orientation of the player/subject.

Image with the provided joins

Image with the provided joins. Image from

Next table show the easy and challenging aspects of the dataset:


  • Fixed camera
  • Near frontal view acquisition
  • Within a sequence the same user
    Gestures performed mostly by arms and hands
  • Camera framing upper body
  • Several available modalities: audio, skeletal model, user mask, depth, and RGB
  • Several instances of each gesture for training
  • Single person present in the visual field

Within each sequence:

  • Continuous gestures without a resting pose
  • Many gesture instances are present
  • Distracter gestures out of the vocabulary may be present in terms of both gesture and audio

Between sequences:

  • High inter and intra-class variabilities of gestures in terms of both gesture and audio
  • Variations in background, clothing, skin color, lighting, temperature, resolution
  • Some parts of the body may be occluded
  • Different Italian dialects


For each unlabeled video, the participants were instructed to provide an ordered list of labels R corresponding to the recognized gestures. We compared this list with the truth labels T i.e. the prescribed list of gestures that the user had to play during data collection. We computed the Levenshtein distance L(R; T), that is the minimum number of edit operations (substitution, insertion, or deletion) that one has to perform to go from R to T (or vice versa). The Levenhstein distance is also known as ‘edit distance’. For example:

L([124]; [32]) = 2, L([1]; [2]) = 1, L([222]; [2]) = 2.

The overall score is the sum of the Levenshtein distances for all the lines of the result file compared to the corresponding lines in the truth value file, divided by the total number of gestures in the truth value file. This score is analogous to an error rate. For simplicity, in what follows, we call it Error Rate, although it can exceed 1.0.


Matlab Code


Basic Matlab GUI to visualize the data (RGB, Depth and Audio) and export it in order to be used.
Download [25/05/2013]


Basic Matlab scripts for different purposes:

calcError.m: Get the Levenshtein distance between a provided groundtruth file and a predictions file.
getGestureID.m: Get the gesture ID from the gesture description.
getSampleData.m: If you do not want to export all the data, this script allows to access all the data from a zipped sample file. If you do not use all the modalities, you can comment out some information to speed-up the data generation.
levenshtein.m: Evaluate the levenshtein distance.
addLabels.m: Allow to add the labels to the validation and test datasets.

Download [26/05/2013]

Once the form will be submitted you will be able to see the information to access the data, if not, just refresh the page.

Your Name (required)

Your organization (required)

Your Email (required)

Research description


I have read and accept the terms and conditions.