Free Article - TPAMI: Context Based Emotion Recognition Using EMOTIC Dataset
Ronak Kosti Universitat Oberta de Catalunya, Barcelona, Spain Jose M. Alvarez NVIDIA, Santa Clara, CA, USA Adria Recasens Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA Agata Lapedriza Universitat Oberta de Catalunya, Barcelona, Spain
1 Introduction
Over the past years, the interest in developing automatic systems for recognizing emotional states has grown rapidly. We can find several recent works showing how emotions can be inferred from cues like text voice or visual information . The automatic recognition of emotions has a lot of applications in environments where machines need to interact or monitor humans. For instance, automatic tutors in an online learning platform would provide better feedback to a student according to her level of motivation or frustration. Also, a car with the capacity of assisting a driver can intervene or give an alarm if it detects the driver is tired or nervous.
In this paper we focus on the problem of emotion recognition from visual information. Concretely, we want to recognize the apparent emotional state of a person in a given image. This problem has been broadly studied in computer vision mainly from two perspectives: (1) facial expression analysis, and (2) body posture and gesture analysis. Section 2 gives an overview of related work on these perspectives and also on some of the common public datasets for emotion recognition.
Although face and body pose give lot of information on the affective state of a person, our claim in this work is that scene context information is also a key component for understanding emotional states. Scene context includes the surroundings of the person, like the place category, the place attributes, the objects, or the actions occurring around the person. Fig. 1 illustrates the importance of scene context for emotion recognition. When we just see the kid it is difficult to recognize his emotion (from his facial expression it seems he is feeling <span id="MathJax-Element-1-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Surprise
'>SurpriseSurprise). However, when we see the context (Fig. 2a) we see the kid is celebrating his birthday, blowing the candles, probably with his family or friends at home. With this additional information we can interpret much better his face and posture and recognize that he probably feels engaged , happy and excited .

Fig. 1. How is this kid feeling? Try to recognize his emotional states from the person bounding box, without scene context.

Fig. 2. Sample images in the EMOTIC dataset along with their annotations.
The importance of context in emotion perception is well supported by different studies in psychology . In general situations, facial expression is not sufficient to determine the emotional state of a person, since the perception of the emotion is heavily influenced by different types of context, including the scene context .
In this work, we present two main contributions. Our first contribution is the creation and publication of the EMOTIC (from EMOTions In Context) Dataset. The EMOTIC database is a collection of images of people annotated according to their apparent emotional states. Images are spontaneous and unconstrained, showing people doing different things in different environments. Fig. 2 shows some examples of images in the EMOTIC database along with their corresponding annotations. As shown, annotations combine 2 different types of emotion representation: Discrete Emotion Categories and 3 Continuous Emotion Dimensions Valence , Arousal , and Dominance . The EMOTIC dataset is now publicly available for download at the EMOTIC website. Details of the dataset construction process and dataset statistics can be found in Section 3.
Our second contribution is the creation of a baseline system for the task of emotion recognition in context. In particular, we present and test a Convolutional Neural Network (CNN) model that jointly processes the window of the person and the whole image to predict the apparent emotional state of the person. Section 4 describes the CNN model and the implementation details while Section 5 presents our experiments and discussion on the results. All the trained models resulting from this work are also publicly available at the EMOTIC website.<span id="MathJax-Element-2-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
1
'>11
This paper is an extension of the conference paper “Emotion Recognition in Context”, presented at the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2017 . We present here an extended version of the EMOTIC dataset, with further statistical dataset analysis, an analysis of scene-centric algorithms on the data, and a study on the annotation consistency among different annotators. This new release of the EMOTIC database contains <span id="MathJax-Element-3-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
44.4%
'>44.4%44.4% more annotated people as compared to its previous smaller version. With the new extended dataset we retrained all the proposed baseline CNN models with additional loss functions. We also present comparative analysis of two different scene context features, showing how the context is contributing to recognize emotions in the wild.
2 Related Work
Emotion recognition has been broadly studied by the Computer Vision community. Most of the existing work has focused on the analysis of facial expression to predict emotions . The base of these methods is the Facial Action Coding System which encodes the facial expression using a set of specific localized movements of the face, called Action Units . These facial-based approaches usually use facial-geometry based features or appearance features to describe the face. Afterwards, the extracted features are used to recognize Action Units and the basic emotions proposed by Ekman and Friesen : anger , disgust , fear , happiness , sadness , and surprise . Currently, state-of-the-art systems for emotion recognition from facial expression analysis use CNNs to recognize emotions or Action Units .
In terms of emotion representation, some recent works based on facial expression use the continuous dimensions of the <span id="MathJax-Element-4-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
VAD
'>VADVAD Emotional State Model . The VAD model describes emotions using 3 numerical dimensions: Valence (V), that measures how positive or pleasant an emotion is, ranging from negative to positive ; Arousal (A), that measures the agitation level of the person, ranging from non-active / in calm to agitated / ready to act ; and Dominance (D) that measures the level of control a person feels of the situation, ranging from submissive / non-control to dominant / in-control . On the other hand, Du et al. proposed a set of 21 facial emotion categories, defined as different combinations of the basic emotions, like ‘happily surprised’ or ‘happily disgusted’. With this categorization the authors can give a fine-grained detail about the expressed emotion.
Although the research in emotion recognition from a computer vision perspective is mainly focused in the analysis of the face, there are some works that also consider other additional visual cues or multimodal approaches. For instance, in the location of shoulders is used as additional information to the face features to recognize basic emotions. More generally, Schindler et al. used the body pose to recognize 6 basic emotions, performing experiments on a small dataset of non-spontaneous poses acquired under controlled conditions. Mou et al. presented a system of affect analysis in still images of groups of people, recognizing group-level arousal and valence from combining face, body and contextual information.
Emotion Recognition in Scene Context and Image Sentiment Analysis are different problems that share some characteristics. Emotion Recognition aims to identify the emotions of a person depicted in an image. Image Sentiment Analysis consists of predicting what a person will feel when observing a picture. This picture does not necessarily contain a person. When an image contains a person, there can be a difference between the emotions experienced by the person in the image and the emotions felt by observers of the image. For example, in the image of Fig. 2.b, we see a kid who seems to be annoyed for having an apple instead of chocolate and another who seems happy to have chocolate. However, as observers, we might not have any of those sentiments when looking at the photo. Instead, we might think the situation is not fair and feel disapproval. Also, if we see an image of an athlete that has lost a match, we can recognize the athlete feels sad. However, an observer of the image may feel happy if the observer is a fan of the team that won the match.
2.1 Emotion Recognition Datasets
Most of the existing datasets for emotion recognition using computer vision are centered in facial expression analysis. For example, the GENKI database contains frontal face images of a single person with different illumination, geographic, personal and ethnic settings. Images in this dataset are labelled as smiling or non-smiling . Another common facial expression analysis dataset is the ICML Face-Expression Recognition dataset that contains 28,000 images annotated with 6 basic emotions and a neutral category. On the other hand, the UCDSEE dataset has a set of 9 emotion expressions acted by 4 persons. The lab setting is strictly kept the same in order to focus mainly on the facial expression of the person.
The dynamic body movement is also an essential source for estimating emotion. Studies such as establish the relationship between affect and body posture using as ground truth the base-rate of human observers. The data consist of a spontaneous set of images acquired under a restrictive setting (people playing Wii games). The GEMEP database is multi-modal (audio and video) and has 10 actors playing 18 affective states. The dataset has videos of actors showing emotions through acting. Body pose and facial expression are combined.
The Looking at People (LAP) challenges and competitions involve specialized datasets containing images, sequences of images and multi-modal data. The main focus of these datasets is the complexity and variability of human body configuration which include data related to personality traits (spontaneous), gesture recognition (acted), apparent age recognition (spontaneous), cultural event recognition (spontaneous), action/interaction recognition and human pose recognition (spontaneous).
The Emotion Recognition in the Wild (EmotiW) challenges host 3 databases: (1 ) The AFEW database focuses on emotion recognition from video frames taken from movies and TV shows, where the actions are annotated with attributes like name, age of actor, age of character, pose, gender, expression of person, the overall clip expression and the basic 6 emotions and a neutral category; (2 ) The SFEW , which is a subset of AFEW database containing images of face-frames annotated specifically with the 6 basic emotions and a neutral category; and (3 ) the HAPPEI database which addresses the problem of group level emotion estimation. Thus, offers a first attempt to use context for the problem of predicting happiness in groups of people.
Finally, the COCO dataset has been recently annotated with object attributes including some emotion categories for people, such as happy and curious . These attributes show some overlap with the categories that we define in this paper. However, COCO attributes are not intended to be exhaustive for emotion recognition, and not all the people in the dataset are annotated with affect attributes.
3 Emotic Dataset
The EMOTIC dataset is a collection of images of people in unconstrained environments annotated according to their apparent emotional states. The dataset contains 23,571 images and 34,320 annotated people. Some of the images were manually collected from the Internet by Google search engine. For that we used a combination of queries containing various places, social environments, different activities and a variety of keywords on emotional states. The rest of images belong to 2 public benchmark datasets: COCO and Ade20k . Overall, the images show a wide diversity of contexts, containing people in different places, social settings, and doing different activities.
Fig. 2 shows three examples of annotated images in the EMOTIC dataset. Images were annotated using Amazon Mechanical Turk (AMT). Annotators were asked to label each image according to what they think people in the images are feeling. Notice that we have the capacity of making reasonable guesses about other people's emotional state due to our capacity of being empathetic, putting ourselves into another's situation, and also because of our common sense knowledge and our ability for reasoning about visual information. For example, in Fig. 2b, the person is performing an activity that requires Anticipation to adapt to the trajectory. Since he is doing a thrilling activity, he seems excited about it and he is engaged or focused in this activity. In Fig. 2c, the kid feels a strong desire (yearning ) for eating the chocolate instead of the apple. Because of his situation we can interpret his facial expression as disquietness and annoyance . Notice that images are also annotated according to the continuous dimensions <span id="MathJax-Element-5-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Valence
'>ValenceValence, <span id="MathJax-Element-6-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Arousal
'>ArousalArousal, and <span id="MathJax-Element-7-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Dominance
'>DominanceDominance. We describe the emotion annotation modalities of EMOTIC dataset and the annotation process in Sections 3.1 and 3.2, respectively.
After the first round of annotations (1 annotator per image), we divided the images into three sets: Training (70 percent), Validation (10 percent), and Testing (20 percent) maintaining a similar affective category distribution across the different sets. After that, Validation and Testing were annotated by 4 and 2 extra annotators respectively. As a consequence, images in the Validation set are annotated by a total of 5 annotators, while images in the Testing set are annotated by 3 annotators (these numbers can slightly vary for some images since we removed noisy annotations).
We used the annotations from the Validation to study the consistency of the annotations across different annotators. This study is shown in Section 3.3. The data statistics and algorithmic analysis on the EMOTIC dataset are detailed in Sections 3.4 and 3.5 respectively.
3.1 Emotion Representation
The EMOTIC dataset combines two different types of emotion representation:
Continuous Dimensions . images are annotated according to the <span id="MathJax-Element-8-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
VAD
'>VADVAD model which represents emotions by a combination of 3 continuous dimensions: Valence, Arousal and Dominance. In our representation each dimension takes an integer value that lies in the range [<span id="MathJax-Element-9-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
1−10
'>1−101−10]. Fig. 4 shows examples of people annotated by different values of the given dimension.
Emotion Categories . in addition to VAD we also established a list of 26 emotion categories that represent various state of emotions. The list of the 26 emotional categories and their corresponding definitions can be found in Table 1. Also, Fig. 3 shows (per category) examples of people showing different emotional categories.

Fig. 3. Examples of annotated people in EMOTIC dataset for each of the 26 emotion categories (Table 1). The person in the red bounding box is annotated by the corresponding category.

Fig. 4. Examples of annotated images in EMOTIC dataset for each of the 3 continuous dimensions Valence, Arousal & Dominance. The person in the red bounding box has the corresponding value of the given dimension.

Notice that the final list of affective categories also includes the 6 basic emotions (categories 2, 5, 16, 17, 21, 24), but we used the more general term Aversion for the category Disgust . Thus, the category Aversion also includes the subcategories <span id="MathJax-Element-10-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
dislike
'>dislikedislike, <span id="MathJax-Element-11-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
repulsion
'>repulsionrepulsion, and <span id="MathJax-Element-12-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
hate
'>hatehate apart from disgust .
3.2 Collecting Annotations
We used Amazon Mechanical Turk crowd-sourcing platform to collect the annotations of the EMOTIC dataset. We designed two Human Intelligence Tasks (HITs), one for each of the 2 formats of emotion representation. The two annotation interfaces are shown in Fig. 5. Each annotator is shown a person-in-context enclosed in a red bounding-box along with the annotation format next to it. Fig. 5a shows the interface for discrete category annotation while Fig. 5b displays the interface for continuous dimension annotation. Notice that, in the last box of the continuous dimension interface, we also ask AMT workers to annotate the gender and estimate the age (range) of the person enclosed in red bounding-box. The designing of the annotation interface has two main focuses: i) the task is easy to understand and ii) the interface fits the HIT in one screen which avoids scrolling.

Fig. 5. AMT interface designs (a) For Discrete Categories’ annotations & (b) For Continuous Dimensions’ annotations.
To make sure annotators understand the task, we showed them how to annotate the images step-wise, by explaining two examples in detail. Also, instructions and examples were attached at the bottom on each page as a quick reference to the annotator. Finally, a summary of the detailed instructions was shown at the top of each page (Table 2).

3.3 Agreement Level among Different Annotators
Since emotion perception is a subjective task, different people can perceive different emotions after seeing the same image. For example in both Fig. 6a and 6b, the person in the red box seems to feel Affection, Happiness and Pleasure and the annotators have annotated with these categories with consistency. However, not everyone has selected all these emotions. Also, we see that annotators do not agree in the emotions Excitement and Engagement . We consider, however, that these categories are reasonable in this situation. Another example is that of Roger Federer hitting a tennis ball in Fig. 6c. He is seen predicting the ball (or Anticipating ) and clearly looks Engaged in the activity. He also seems Confident in getting the ball.

Fig. 6. Annotations of five different annotators for 3 images in EMOTIC.
After these observations we conducted different quantitative analysis on the annotation agreement. We focused first on analyzing the agreement level in the category annotation. Given a category assigned to a person in an image, we consider as an agreement measure the number of annotators agreeing for that particular category. Accordingly, we calculated, for each category and for each annotation in the validation set, the agreement amongst the annotators and sorted those values across categories. Fig. 7 shows the distribution on the percentage of annotators agreeing for an annotated category across the validation set.

Fig. 7. Representation of agreement between multiple annotators. Categories sorted in decreasing order according to the average number of annotators who agreed for that category.
We also computed the agreement between all the annotators for a given person using Fleiss’ Kappa (<span id="MathJax-Element-13-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
κ
'>κκ) . Fleiss’ Kappa is a common measure to evaluate the agreement level among a fixed number of annotators when assigning categories to data. In our case, given a person to annotate, there is a subset of 26 categories. If we have <span id="MathJax-Element-14-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
N
'>NN annotators per image, that means that each of the 26 categories can be selected by <span id="MathJax-Element-15-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
n
'>nn annotators, where <span id="MathJax-Element-16-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
0≤n≤N
'>0≤n≤N0≤n≤N. Given an image we compute the Fleiss’ Kappa per each emotion category first, and then the general agreement level on this image is computed as the average of these Fleiss’ Kappa values across the different emotion categories. We obtained that more than 50 percent of the images have <span id="MathJax-Element-17-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
κ>0.30
'>κ>0.30κ>0.30. Fig. 8.a shows the distribution of kappa values across the validation set for all the annotated people in the validation set, sorted in decreasing order. Random annotations or total disagreement produces <span id="MathJax-Element-18-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
κ∼0
'>κ∼0κ∼0, however for our case, <span id="MathJax-Element-19-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
κ∼0.3
'>κ∼0.3κ∼0.3 (on average) suggesting significant agreement level even though the task of emotion recognition is subjective.

Fig. 8. (a) Kappa values (sorted) and (b) Standard deviation (sorted), for each annotated person in validation set.
For continuous dimensions, the agreement is measured by the standard deviation (SD) of the different annotations. The average SD across the Validation set is 1.04, 1.57 and 1.84 for Valence, Arousal and Dominance respectively - indicating that Dominance has higher (<span id="MathJax-Element-20-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
±1.84
'>±1.84±1.84) dispersion than the other dimensions. It reflects that annotators disagree more often for Dominance than for the other dimensions which is understandable since Dominance is more difficult to interpret than Valence or Arousal . As a summary, Fig. 8b shows the standard deviations of all the images in the validation set for all the 3 dimensions, sorted in decreasing order.
3.4 Dataset Statistics
EMOTIC dataset contains 34,320 annotated people, where 66 percent of them are males and 34 percent of them are females. There are 10 percent children, 7 percent teenagers and 83 percent adults amongst them.
Fig. 9a shows the number of annotated people for each of the 26 emotion categories, sorted by decreasing order. Notice that the data is unbalanced, which makes the dataset particularly challenging. An interesting observation is that there are more examples for categories associated to positive emotions, like Happiness or Pleasure , than for categories associated with negative emotions, like Pain or Embarrassment . The category with most examples is Engagement . This is because in most of the images people are doing something or are involved in some activity, showing some degree of engagement. Fig. 9b,9c and 9d show the number of annotated people for each value of the 3 continuous dimensions. In this case we also observe unbalanced data but fairly distributed across the 3 dimensions which is good for modelling.

Fig. 9. Dataset Statistics. (a) Number of people annotated for each emotion category; (b), (c) & (d) Number of people annotated for every value of the three continuous dimensions viz. Valence, Arousal & Dominance.
Fig. 10 shows the co-occurrence rates of any two categories. Every value in the matrix <span id="MathJax-Element-21-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
(r,c)
'>(r,c)(r,c) (<span id="MathJax-Element-22-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
r
'>rr represents the row category and <span id="MathJax-Element-23-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
c
'>cc column category) is a co-occurrence probability (in %) of category <span id="MathJax-Element-24-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
r
'>rr if the annotation also contains the category <span id="MathJax-Element-25-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
c
'>cc, that is, <span id="MathJax-Element-26-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
P(r|c)
'>P(r|c)P(r|c). We observe, for instance, that when a person is labelled with the category Annoyance , then there is 46.05 percent probability that this person is also annotated by the category Anger . This means that when a person seems to be feeling Annoyance it is likely (by 46.05 percent) that this person might also be feeling Anger . We also used a K-Means clustering on the category annotations to find groups of categories that occur frequently. We found, for example, that these category groups are common in the EMOTIC annotations: <span id="MathJax-Element-27-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
{
'>{{Anticipation , Engagement , Confidence <span id="MathJax-Element-28-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
}
'>}}, <span id="MathJax-Element-29-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
{
'>{{Affection , Happiness , Pleasure <span id="MathJax-Element-30-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
}
'>}}, <span id="MathJax-Element-31-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
{
'>{{Doubt/Confusion , Disapproval , Annoyance <span id="MathJax-Element-32-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
}
'>}}, <span id="MathJax-Element-33-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
{
'>{{Yearning , Annoyance , Disquietment <span id="MathJax-Element-34-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
}
'>}}.


Fig. 10. Co-variance between 26 emotion categories. Each row represents the occurrence probability of every other category given the category of that particular row.
Fig. 11 shows the distribution of each continuous dimension across the different emotion categories. For every plot, categories are arranged in increasing order of their average values of the given dimension (calculated for all the instances containing that particular category). Thus, we observe from Fig. 11a that emotion categories like Suffering, Annoyance, Pain correlate with low Valence values (feeling less positive) in average whereas emotion categories like Pleasure, Happiness, Affection correlate with higher Valence values (feeling more positive). Also interesting is to note that a category like Disconnection lies in the mid-range of Valence value which makes sense. When we observe Fig. 11b, it is easy to understand that emotional categories like Disconnection, Fatigue, Sadness show low Arousal values and we see high activeness for emotion categories like Anticipation, Confidence, Excitement . Finally, Fig. 11c shows that people are not in control when they show emotion categories like Suffering, Pain, Sadness whereas when the Dominance is high, emotion categories like Esteem, Excitement, Confidence occur more often.

Fig. 11. Distribution of continuous dimension values across emotion categories. Average value of a dimension is calculated for every category and then plotted in increasing order for every distribution.
An important remark about the EMOTIC dataset is that there are people whose faces are not visible. More than 25 percent of the people in EMOTIC have their faces partially occluded or with very low resolution, so we can not rely on facial expression analysis for recognizing their emotional state.
3.5 Algorithmic Scene Context Analysis
This section illustrates how current scene-centric systems can be used to extract contextual information that can be potentially useful for emotion recognition. In particular, we illustrate this idea with a CNN trained on Places dataset and with the Sentibanks Adjective-Noun Pair (ANP) detectors a Visual Sentiment Ontology for Image Sentiment Analysis. As a reference, Fig. 12 shows Places and ANP outputs for sample images of the EMOTIC dataset.

Fig. 12. Illustration of 2 current scene-centric methods for extracting contextual features from the scene: AlexNet Places CNN outputs (place categories and attributes) and Sentibanks ANP outputs for three example images of the EMOTIC dataset.
We used AlexNet Places CNN to predict the scene category and scene attributes for the images in EMOTIC. This information helps to divide the analysis into place category and place attribute. We observed that the distribution of emotions varies significantly among different place categories. For example, we found that people in the ‘ski_slope’ frequently experience Anticipation or Excitement , which are associated to the activities that usually happen in this place category. Comparing sport-related and working-environment related images, we find that people in sport-related images usually show Excitement, Anticipation and Confidence , however they show <span id="MathJax-Element-35-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Sadness
'>SadnessSadness or <span id="MathJax-Element-36-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Annoyance
'>AnnoyanceAnnoyance less frequently. Interestingly, <span id="MathJax-Element-37-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Sadness
'>SadnessSadness and <span id="MathJax-Element-38-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Annoyance
'>AnnoyanceAnnoyance appear with higher frequency in working environments. We also observe interesting patterns when correlating continuous dimensions with place attributes and categories. For instance, places where people usually show high Dominance are sport-related places and sport-related attributes. On the contrary, low Dominance is shown in ‘jail_cell’ or attributes like ‘enclosed_area’ or ‘working’, where the freedom of movement is restricted. In Fig. 12, the predictions by Places CNN describe the scene in general, like in the top image there is a girl sitting in a ‘kindergarten_classroom’ (places category) which usually is situated in enclosed areas with ‘no_horizon’ (attributes).
We also find interesting patterns when we compute the correlation between detected ANPs and emotions labelled in the image. For example, in images with people labelled with Affection , the most frequent ANP is ‘young_couple’, while in images with people labelled with Excitement we found frequently the ANPs ‘last_game’ and ‘playing_field’. Also, we observe a high correlation between images with Peace and ANP like ‘old_couple’ and ‘domestic_scenes’, and between Happiness and the ANPs ‘outdoor_wedding’, ‘outdoor_activities’, ‘happy_family’ or ‘happy_couple’.
Overall, these observations suggest that some common sense knowledge patterns related with emotions and context could be potentially extracted, automatically, from the data.
4 CNN Model for Emotion Recognition in Scene Context
We propose a baseline CNN model for the problem of emotion recognition in context. The pipeline of the model is shown in Fig. 13 and it is divided in three modules: body feature extraction , image (context) feature extraction and fusion network . The first module takes the whole image as input and generates scene-related features. The second module takes the visible body of the person and generates body-related features. Finally, the third module combines these features to do a fine-grained regression of the two types of emotion representations (Section 3.1).

Fig. 13. Proposed end-to-end model for emotion recognition in context. The model consists of two feature extraction modules and a fusion network for jointly estimating the discrete categories and the continuous dimensions.
The body feature extraction module takes the visible part of the body of the target person as input and generates body-related features. These features include important cues like face and head aspects and pose or body appearance. In order to capture these aspects, this module is pre-trained with ImageNet which is an object centric dataset that includes the category person .
The image feature extraction module takes the whole image as input and generates scene-context features. These contextual features can be interpreted as an encoding of the scene category, its attributes and objects present in the scene, or the dynamics between other people present in the scene. To capture these aspects, we pre-train this module with the scene-centric Places dataset .
The fusion module combines features of the two feature extraction modules and estimates the discrete emotion categories and the continuous emotion dimensions.
Both feature extraction modules are based on the one-dimensional filter CNN proposed in . These CNN networks provide competitive performance while the number of parameters is low. Each network consists of 16 convolutional layers with 1-dimensional kernels alternating between horizontal and vertical orientations, effectively modeling 8 layers using 2-dimensional kernels. Then, to maintain the location of different parts of the image, we use a global average pooling layer to reduce the features of the last convolutional layer. To avoid internal-covariant-shift we add a batch normalizing layer after each convolutional layer and rectifier linear units to speed up the training.
The fusion network module consists of two fully connected (FC) layers. The first FC layer is used to reduce the dimensionality of the features to 256 and then, a second fully connected layer is used to learn independent representations for each task . The output of this second FC layer branches off into 2 separate representations, one with 26 units representing the discrete emotion categories, and second with 3 units representing the 3 continuous dimensions (section 3.1).
4.1 Loss Function and Training Setup
We define the loss function as a weighted combination of two separate losses. A prediction <span id="MathJax-Element-39-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
y^
'>^yy^ is composed by the prediction of each of the 26 discrete categories and the 3 continuous dimensions, <span id="MathJax-Element-40-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
y^=(y^disc,y^cont)
'>^y=(^ydisc,^ycont)y^=(y^disc,y^cont). In particular, <span id="MathJax-Element-41-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
y^disc=(y^1disc,…,y^26disc)
'>^ydisc=(^ydisc1,…,^ydisc26)y^disc=(y^1disc,…,y^26disc) and <span id="MathJax-Element-42-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
y^cont=(y^1cont,y^2cont,y^3cont)
'>^ycont=(^ycont1,^ycont2,^ycont3)y^cont=(y^1cont,y^2cont,y^3cont). Given a prediction <span id="MathJax-Element-43-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
y^
'>^yy^, the loss in this prediction is defined by <span id="MathJax-Element-44-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
L=λdiscLdisc+λcontLcont
'>L=λdiscLdisc+λcontLcontL=λdiscLdisc+λcontLcont, where <span id="MathJax-Element-45-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Ldisc
'>LdiscLdisc and <span id="MathJax-Element-46-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Lcont
'>LcontLcont represent the loss corresponding to learning the discrete categories and the continuous dimensions respectively. The parameters <span id="MathJax-Element-47-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
λ(disc,cont)
'>λ(disc,cont)λ(disc,cont) weight the contribution of each loss and are set empirically using the validation set.
Criterion for Discrete categories (<span id="MathJax-Element-48-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Ldisc
'>LdiscLdisc) . The discrete category estimation is a multilabel problem with an inherent class imbalance issue, as the number of training examples is not the same for each class (see Fig. 9a).
In our experiments, we use a weighted euclidean loss for the discrete categories. Empirically, we found the euclidean loss to be more effective than using Kullback<span id="MathJax-Element-49-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
−
'>−−Leibler divergence or a multi-class multi-classification hinge loss. More precisely, given a prediction <span id="MathJax-Element-50-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
y^disc
'>^ydiscy^disc, the weighted euclidean loss is defined as follows<span id="MathJax-Element-51-Frame" class="mjx-full-width mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
(1)L2disc(y^disc)=∑i=126wi(y^idisc−yidisc)2,
'>L2disc(^ydisc)=26∑i=1wi(^ydisci−ydisci)2,(1)(1)L2disc(y^disc)=∑i=126wi(y^idisc−yidisc)2,where <span id="MathJax-Element-52-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
y^idisc
'>^ydisciy^idisc is the prediction for the i-th category and <span id="MathJax-Element-53-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
yidisc
'>ydisciyidisc is the ground-truth label. The parameter <span id="MathJax-Element-54-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
wi
'>wiwi is the weight assigned to each category. Weight values are defined as <span id="MathJax-Element-55-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
wi=1ln(c+pi)
'>wi=1ln(c+pi)wi=1ln(c+pi), where <span id="MathJax-Element-56-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
pi
'>pipi is the probability of the i-th category and <span id="MathJax-Element-57-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
c
'>cc is a parameter to control the range of valid values for <span id="MathJax-Element-58-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
wi
'>wiwi. Using this weighting scheme the values of <span id="MathJax-Element-59-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
wi
'>wiwi are bounded as the number of instances of a category approach to 0. This is particularly relevant in our case as we set the weights based on the occurrence of each category for each batch. Experimentally, we obtained better results using this approach compared to setting the global weights based on the entire dataset.
Criterion for Continuous dimensions (<span id="MathJax-Element-60-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Lcont
'>LcontLcont) . We model the estimation of the continuous dimensions as a regression problem. Due to multiple annotators annotating the data based on subjective evaluation, we compare the performance when using two different robust losses: (1) a margin euclidean loss <span id="MathJax-Element-61-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
L2cont
'>L2contL2cont, and (2) the Smooth <span id="MathJax-Element-62-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
L1
'>L1L1<span id="MathJax-Element-63-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
SL1cont
'>SL1contSL1cont. The former defines a margin of error (<span id="MathJax-Element-64-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
vk
'>vkvk) when computing the loss for which the error is not considered. The margin euclidean loss for continuous dimension is defined as:<span id="MathJax-Element-65-Frame" class="mjx-full-width mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
(2)L2cont(y^cont)=∑k=13vk(y^kcont−ykcont)2,
'>L2cont(^ycont)=3∑k=1vk(^ycontk−ycontk)2,(2)(2)L2cont(y^cont)=∑k=13vk(y^kcont−ykcont)2,where <span id="MathJax-Element-66-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
y^kcont
'>^ycontky^kcont and <span id="MathJax-Element-67-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
ykcont
'>ycontkykcont are the prediction and the ground-truth for the k-th dimension, respectively, and <span id="MathJax-Element-68-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
vk∈{0,1}
'>vk∈{0,1}vk∈{0,1} is a binary weight to represent the error margin. <span id="MathJax-Element-69-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
vk=0
'>vk=0vk=0 if <span id="MathJax-Element-70-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
|y^kcont−ykcont|<θ
'>|^ycontk−ycontk|<θ|y^kcont−ykcont|<θ. Otherwise, <span id="MathJax-Element-71-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
vk=1
'>vk=1vk=1. If the predictions are within the error margin, i.e. error is smaller than <span id="MathJax-Element-72-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
θ
'>θθ, then these predictions do not contribute to update the weights of the network.
The Smooth <span id="MathJax-Element-73-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
L1
'>L1L1 loss refers to the absolute error using the squared error if the error is less than a threshold (set to 1 in our experiments). This loss has been widely used for object detection and, in our experiments, has been shown to be less sensitive to outliers. Precisely, the Smooth <span id="MathJax-Element-74-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
L1
'>L1L1 loss is defined as follows<span id="MathJax-Element-75-Frame" class="mjx-full-width mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
(3)SL1cont(y^cont)=∑k=13vk{0.5x2,if|xk|<1|xk|−0.5,otherwise,
'>SL1cont(^ycont)=3∑k=1vk{0.5x2,if|xk|<1|xk|−0.5,otherwise,(3)(3)SL1cont(y^cont)=∑k=13vk{0.5x2,if|xk|<1|xk|−0.5,otherwise,where <span id="MathJax-Element-76-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
xk=(y^kcont−ykcont)
'>xk=(^ycontk−ycontk)xk=(y^kcont−ykcont), and <span id="MathJax-Element-77-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
vk
'>vkvk is a weight assigned to each of the continuous dimensions and it is set to 1 in our experiments.
We train our recognition system end-to-end, learning the parameters jointly using stochastic gradient descent with momentum. The first two modules are initialized using pre-trained models from Places and Imagenet while the fusion network is trained from scratch. The batch size is set to 52 - twice the size of the discrete emotion categories. We found empirically after testing multiple batch sizes (including multiples of 26 like 26, 52, 78, 108) that batch-size of 52 gives the best performance (on the validation set).
5 Experiments
We trained four different instances of our CNN model, which are the combination of two different input types and the two different continuous loss functions described in section 4.1. The input types are body (i.e., upper branch in Fig. 13), denoted by B , and body plus image (i.e., both branches shown in Fig. 13), denoted by B+I . The continuous loss types are denoted in the experiments by <span id="MathJax-Element-78-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
L2
'>L2L2 for euclidean loss (equation 2) and <span id="MathJax-Element-79-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
SL1
'>SL1SL1 for the Smooth <span id="MathJax-Element-80-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
L1
'>L1L1 (equation 3).
Results for discrete categories in the form of Average Precision per category (the higher, the better) are summarized in Table 3. Notice that the B+I model outperforms the B model in all categories except 1. The combination of body and image features (B+I(<span id="MathJax-Element-81-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
SL1
'>SL1SL1) model) is better than the B model.

Results for continuous dimensions in the form of Average Absolute Error per dimension, <span id="MathJax-Element-82-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
AAE
'>AAEAAE (the lower, the better) are summarized in Table 4. In this case, all the models provide similar results where differences are not significant.

JC
'>JCJC) for all the samples in the test set. The <span id="MathJax-Element-84-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
JC
'>JCJC coefficient is computed as follows: per each category we use as threshold for the detection of the category the value where Precision <span id="MathJax-Element-85-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
=
'>== Recall . Then, the <span id="MathJax-Element-86-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
JC
'>JCJC coefficient is computed as the number of categories detected that are also present in the ground truth (number of categories in the intersection of detections and ground truth) divided by the total number of categories that are in the ground truth or detected (union over detected categories and categories in the ground truth). The higher this <span id="MathJax-Element-87-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
JC
'>JCJC is the better, with a maximum value of 1, where the detected categories and the ground truth categories are exactly the same. In the graphic, examples are sorted in decreasing order of the <span id="MathJax-Element-88-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
JC
'>JCJC coefficient. Notice that these results also support that the B+I model outperforms the B model.

Fig. 14. Results per each sample (Test Set, sorted): (a) Jaccard Coefficient (<span id="MathJax-Element-89-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
JC
'>JCJC) of the recognized discrete categories (b) Average Absolute Error (<span id="MathJax-Element-90-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
AAE
'>AAEAAE) in the estimation of the three continuous dimensions.
For the case of continuous dimensions, Fig. 14b shows the Average Absolute Error (<span id="MathJax-Element-91-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
AAE
'>AAEAAE) obtained per each sample in the testing set. Samples are sorted by increasing order (best performances on the left). Consistent with the results shown in Table 4, we do not observe a significant difference among the different models.
Finally, Fig. 15 shows qualitative predictions for the best B and B+I models. These examples were randomly selected among samples with high <span id="MathJax-Element-92-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
JC
'>JCJC in B+I (a-b) and samples with low <span id="MathJax-Element-93-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
JC
'>JCJC in B+I (g-h). Incorrect category recognition is indicated in red. As shown, in general, B+I model outperforms B , although there are some exceptions, like Fig. 15.c.

Fig. 15. Ground truth and results on images randomly selected with different <span id="MathJax-Element-94-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
JC
'>JCJC scores.
5.1 Context Features Comparison
The goal of this section is to compare different context features for the problem of emotion recognition in context. A key aspect for incorporating the context in an emotion recognition model is to be able to obtain information from the context that is actually relevant for emotion recognition. Since the context information extraction is a scene-centric task, the information extracted from the context should be based in a scene-centric feature extraction system. That is why our baseline model uses a Places CNN for the context feature extraction module. However, recent works in sentiment analysis (detecting the emotion of a person when he/she observes an image) also provide a system for scene feature extraction that can be used for encoding the relevant contextual information for emotion recognition.
To compute body features, denoted by B<span id="MathJax-Element-95-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ff, we fine tune an AlexNet ImageNet CNN with EMOTIC database, and use the average pooling of the last convolutional layer as features. For the context (image), we compare two different feature types, which are denoted by I<span id="MathJax-Element-96-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ffand I<span id="MathJax-Element-97-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
S
'>SS. I<span id="MathJax-Element-98-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ffare obtained by fine tunning an AlexNet Places CNN with EMOTIC database, and taking the average pooling of the last convolutional layer as features (similar to B<span id="MathJax-Element-99-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ff), while I<span id="MathJax-Element-100-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
S
'>SSis a feature vector composed of the sentiment scores for the ANP detectors from the implementation of .
To fairly compare the contribution of the different context features, we train Logistic Regressors for the following features and combination of features: (1) B<span id="MathJax-Element-101-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ff, (2) B<span id="MathJax-Element-102-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ff+I<span id="MathJax-Element-103-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ff, and (3) B<span id="MathJax-Element-104-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ff+I<span id="MathJax-Element-105-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
S
'>SS. For the discrete categories we obtain mean APs <span id="MathJax-Element-106-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
AP=23.00
'>AP=23.00AP=23.00, <span id="MathJax-Element-107-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
AP=27.70
'>AP=27.70AP=27.70, and <span id="MathJax-Element-108-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
AP=29.45
'>AP=29.45AP=29.45, respectively. For the continuous dimensions, we obtain AAE 0.0704, 0.0643, and 0.0713 respectively. We observe that, for the discrete categories, both I<span id="MathJax-Element-109-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ffand I<span id="MathJax-Element-110-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
S
'>SScontribute relevant information to the emotion recognition in context. Interestingly, I<span id="MathJax-Element-111-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
S
'>SSperforms better than I<span id="MathJax-Element-112-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
f
'>ff, even though these features have not been trained using EMOTIC. However, these features are smartly designed for sentiment analysis, which is a problem closely related to extracting relevant contextual information for emotion recognition, and are trained with a large dataset of images.
6 Conclusions
In this paper we pointed out the importance of considering the person scene context in the problem of automatic emotion recognition in the wild. We presented the EMOTIC database, a dataset of 23,571 natural unconstrained images with 34,320 people labeled according to their apparent emotions. The images in the dataset are annotated using two different emotion representations: 26 discrete categories, and the 3 continuous dimensions <span id="MathJax-Element-113-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Valence
'>ValenceValence, <span id="MathJax-Element-114-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Arousal
'>ArousalArousal and <span id="MathJax-Element-115-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" role="presentation" data-mathml='
Dominance
'>DominanceDominance. We described in depth the annotation process and analyzed the annotation consistency of different annotators. We also provided different statistics and algorithmic analysis on the data, showing the characteristics of the EMOTIC database. In addition, we proposed a baseline CNN model for emotion recognition in scene context that combines the information of the person (body bounding box) with the scene context information (whole image). We also compare two different feature types for encoding the contextual information. Our results show the relevance of using contextual information to recognize emotions and, in conjunction with the EMOTIC dataset, motivate further research in this direction. All the data and trained models are publicly available for the research community in the website of the project.
Footnotes
Acknowledgments
This work has been partially supported by the Ministerio de Economia, Industria y Competitividad (Spain) , under the Grants Ref. TIN2015-66951-C2-2-R and RTI2018-095232-B-C22, and by Innovation and Universities (FEDER funds). The authors also thank NVIDIA for their generous hardware donations. Project Page: http://sunai.uoc.edu/emotic/.
References
- [1]D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 223–232.
- [2]H. Aviezer, R. R. Hassin, J. Ryan, C. Grady, J. Susskind, A. Anderson, M. Moscovitch, and S. Bentin, “Angry, disgusted, or afraid? studies on the malleability of emotion perception,” Psychological Sci., vol. 19, no. 7, pp. 724–732, 2008.
- [3]R. Righart and B. De Gelder, “Rapid influence of emotional scenes on encoding of facial expressions: An erp study,” Social Cognitive Affective Neuroscience, vol. 3, no. 3, pp. 270–278, 2008.
- [4]T. Masuda, P. C. Ellsworth, B. Mesquita, J. Leu, S. Tanida, and E. Van de Veerdonk, “Placing the face in context: cultural differences in the perception of facial emotion,” J. Personality Social Psychology, vol. 94, no. 3, 2008, Art. no. 365.
- [5]L. F. Barrett, B. Mesquita, and M. Gendron, “Context in emotion perception,” Current Directions Psychological Sci., vol. 20, no. 5, pp. 286–290, 2011.
- [6]L. F. Barrett, How Emotions Are Made: The Secret Life of the Brain. Boston, MA, USA: Houghton Mifflin Harcourt, 2017.
- [7]A. Mehrabian, “Framework for a comprehensive description and measurement of emotional states,” Genetic Social General Psychology Monographs, vol. 121, pp. 339–361, 1995.
- [8]R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotion recognition in context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
- [9]M. Pantic and L. J. Rothkrantz, “Expert system for automatic analysis of facial expressions,” Image Vis. Comput., vol. 18, no. 11, pp. 881–905, 2000.
- [10]Z. Li, J.-i. Imai, and M. Kaneko, “Facial-component-based bag of words and phog descriptor for facial expression recognition,” in Proc. IEEE Int. Conf. Syst. Man Cybern., 2009, pp. 1353–1358.
- [11]E. Friesen and P. Ekman, “Measuring facial movement. Environmental psychology and nonverbal behavior.,” Sep. 1976, vol. 1, no. 1, pp. 56–75.
- [12]P. Ekman and W. V. Friesen, “Constants across cultures in the face and emotion,” J. Personality Social Psychology, vol. 17, no. 2, 1971, Art. no. 124.
- [13]C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit, 2016, pp. 5562–5570.
- [14]M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic, “Analysis of eeg signals and facial expressions for continuous emotion detection,” IEEE Trans. Affective Comput., vol. 7, no. 1, pp. 17–28, Jan.2016.
- [15]S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” Proc. Nat. Acad. Sci., vol. 111, no. 15, pp. E1454–E1462, 2014.
- [16]M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Trans. Affective Comput., vol. 2, no. 2, pp. 92–105, Apr.-Jun.2011.
- [17]K. Schindler, L. Van Gool, and B. de Gelder, “Recognizing emotions expressed by body pose: A biologically inspired neural model,” Neural Netw., vol. 21, no. 9, pp. 1238–1246, 2008.
- [18]W. Mou, O. Celiktutan, and H. Gunes, “Group-level arousal and valence recognition in static images: Face, body and context,” in Proc. 11th IEEE Int. Conf. Workshops Autom. Face Gesture Recognit., 2015, vol. 5, pp. 1–6.
- [19]“GENKI database.” [Online]. Available: http://mplab.ucsd.edu/wordpress/?page_id=398, Accessed on: Apr.12, 2017.
- [20]“ICML face expression recognition dataset.” [Online]. Available: https://goo.gl/nn9w4R, Accessed on: Apr.12, 2017.
- [21]J. L. Tracy, R. W. Robins, and R. A. Schriber, “Development of a facs-verified set of basic and self-conscious emotion expressions,” Emotion, vol. 9, no. 4, 2009, Art. no. 554.
- [22]A. Kleinsmith and N. Bianchi-Berthouze, “Recognizing affective dimensions from body posture,” in Proc. 2nd Int. Conf. Affective Comput. Intell. Interaction, 2007, pp. 48–58. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-74889-2_5
- [23]A. Kleinsmith, N. Bianchi-Berthouze, and A. Steed, “Automatic recognition of non-acted affective postures,” IEEE Trans. Syst. Man Cybern. Part B (Cybern.), vol. 41, no. 4, pp. 1027–1038, Aug.2011.
- [24]T. Bänziger, H. Pirker, and K. Scherer, “Gemep-geneva multimodal emotion portrayals: A corpus for the study of multimodal emotional expressions,” in Proc. Int. Conf. Lang. Res. Eval., 2006, vol. 6, pp. 15–19.
- [25]S. Escalera, X. Baró, H. J. Escalante, and I. Guyon, “Chalearn looking at people: Events and resources,” CoRR, vol. abs/1701.02664, 2017. [Online]. Available: http://arxiv.org/abs/1701.02664
- [26]A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon, “Emotiw 2016: Video and group-level emotion recognition challenges,” in Proc. 18th ACM Int. Conf. Multimodal Interaction, 2016, pp. 427–432. [Online]. Available: http://doi.acm.org/10.1145/2993148.2997638
- [27]A. Dhall, et al., “Collecting large, richly annotated facial-expression databases from movies,” IEEE MultiMedia, vol. 19, no. 3, pp. 34–41, Jul.-Sep.2012.
- [28]A. Dhall, J. Joshi, I. Radwan, and R. Goecke, “Finding happiest moments in a social context,” in Proc. Asian Conf. Comput. Vis., 2012, pp. 613–626.
- [29]G. Patterson and J. Hays, “Coco attributes: Attributes for people, animals, and objects,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 85–100.
- [30]T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
- [31]B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through ade20k dataset,” 2016. [Online]. Available: https://arxiv.org/pdf/1608.05442
- [32]“Oxford english dictionary.” [Online]. Available: http://http://www.oed.com, Accessed on: Jun.9, 2017.
- [33]“Merriam-webster online english dictionary.” [Online]. Available: https://www.merriam-webster.com, Accessed on: Jun.9, 2017.
- [34]E. G. Fernández-Abascal, B. García, M. Jiménez, M. Martín, and F. Domínguez, Psicología de la emoción. Editorial Universitaria Ramón Areces, 2010.
- [35]R. W. Picard and R. Picard, Affective Computing, vol. 252. Cambridge, MA, USA: MIT Press, 1997, vol. 252.
- [36]Y. Groen, A. B. M. Fuermaier, A. E. Den Heijer, O. Tucha, and M. Althaus, “The empathy and systemizing quotient: The psychometric properties of the dutch version and a review of the cross-cultural stability,” J. Autism Developmental Disorders, vol. 45, no. 9, pp. 2848–2864, 2015. [Online]. Available: http://dx.doi.org/10.1007/s10803-015-2448-z
- [37]B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places: A 10 million image database for scene recognition,” IEEE tran. pattern analysis and machine intelligence., vol. 40, no. 6, pp. 1452–1464, Jul. 4, 2017.
- [38]D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 223–232.
- [39]T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks,” arXiv preprint arXiv:1410.8586., Oct. 30, 2014.
- [40]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
- [41]J. Alvarez and L. Petersson, “Decomposeme: Simplifying convnets for end-to-end learning,” CoRR, vol. abs/1606.05426, 2016, pp. 1–16.
- [42]S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 448–456.
- [43]R. Caruana, “A Dozen Tricks with Multitask Learning,” in Neural Networks: Tricks of the Trade.New York, NY, USA: Springer, 2012, pp. 163–189.
- [44]R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
- [45]J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
---
Ronak Kosti received the master's degree in machine intelligence from Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), in 2014. His master's research was based on depth estimation from single image using Artificial Neural Networks. He is working toward the PhD degree at Universitat Oberta de Catalunya, Spain advised by prof. Agata Lapedriza. He is working with Scene Understanding and Artificial Intelligence (SUNAI) group in computer vision, specifically in the area of affective computing.
Jose M. Alvarez received the PhD degree from Autonomous University of Barcelona, in 2010 under the supervision of Prof. Antonio Lopez and Prof. Theo Gevers. Previous to CSIRO, he worked as a postdoctoral researcher with the Courant Institute of Mathematical Science, New York University under the supervision of Prof. Yann LeCun. He is a senior research scientist with NVIDIA. Previously, he was senior deep learning researcher with Toyota Research Institute, prior to that he was with Data61, CSIRO, Australia (formerly NICTA) as a researcher.
Adria Recasens received the Telecommunications Engineer's degree and the Mathematics Licentiate degree from the Universitat Politcnica de Catalunya. He is working toward the PhD degree in computer vision in the Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology advised by professor Antonio Torralba. His research interests include various topics in computer vision and machine learning. He is focusing most of his research on automatic gaze-following.
Agata Lapedriza received the MS degree in mathematics from the Universitat de Barcelona and the PhD degree in computer science from the Computer Vision Center, Universitat Autonoma Barcelona. She is an associate professor with the Universitat Oberta de Catalunya. She was working as a visiting researcher with Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology (MIT), from 2012 until 2015. Currently she is a visiting researcher with the MIT Medialab, Affective Computing group from September 2017. Her research interests include image understanding, scene recognition and characterization, and affective computing.






