Uncertainty-aware Label Distribution Learning for Facial Expression Recognition (Part 1)
Facial expression recognition (FER) plays an important role in understanding people's feelings and interactions between humans. Recently, automatic emotion recognition has gained a lot of attention from the research community due to its tremendous applications in education, healthcare, human analysis, surveillance or human-robot interaction. Recent FER methods are mostly based on deep learning and can achieve impressive results. The success of deep models can be attributed to large-scale FER datasets [1][2]. However, ambiguities of facial expression is still a key challenge in FER. Specifically, people with different backgrounds might perceive and interpret facial expressions differently, which can lead to noisy and inconsistent annotations. In addition, real-life facial expressions usually manifest a mixture of feelings rather than only a single emotion.
Motivation and Proposed Solution
As an example, Figure 1 shows that people may have different opinions about the expressed emotion, particularly in ambiguous images. Consequently, a distribution over emotion categories is better than a single label because it takes all sentiment classes into account and can cover various interpretations, thus mitigating the effect of ambiguity. However, existing large-scale FER datasets only provide a single label for each sample instead of a label distribution, which means we do not have a comprehensive description for each facial expression. This can lead to insufficient supervision during training and pose a big challenge for many FER systems.
To overcome the ambiguity problem in FER, we proposes a new uncertainty-aware label distribution learning method that constructs emotion distributions for training samples. Specifically, we leverage the neighborhood information of samples that have similar expressions to construct the emotion distributions from single labels and utilize them as training supervision signal.
Methodology
Preliminaries
We denote as the instance variable in the input space and as the particular -th instance. The label set is denoted as where is the number of classes and is the label value of the -th class. The logical label vector of is indicated by = with and . We define the label distribution of as = with and representing the relative degree that belongs to the class .
Most existing FER datasets assign only a single class or equivalently, a logical label for each training sample . In particular, the given training dataset is a collection of samples with logical labels = . However, we find that a label distribution is a more comprehensive and suitable annotation for the image than a single label.
Inspired by the recent success of label distribution learning (LDL) in addressing label ambiguity [3], we aim to construct an emotion distribution for each training sample , thus transform the training set into = , which can provide richer supervision information and help mitigate the ambiguity issue. We use cross-entropy to measure the discrepancy between the model's prediction and the constructed target distribution. Hence, the model can be trained by minimizing the following classification loss:
where is a neural network with parameters followed by a softmax layer to map the input image into a emotion distribution.
Overview
An overview of our method is presented in Figure 2. To construct the label distribution for each training instance , we leverage its neighborhood information in the valence-arousal space. Particularly, we identify neighbor instances for each training sample and utilize our adaptive similarity mechanism to determine their contribution degrees to the target distribution . Then, we combine the neighbors' predictions and their corresponding contribution degrees with the provided label and 's uncertainty factor to obtain the label distribution . The constructed distribution will be used as supervision information to train the model via label distribution learning.
Adaptive Similarity
We assume that the label distribution of the main instance can be computed as a linear combination of its neighbors' distributions. To determine the contribution of each neighbor, we propose an adaptive similarity mechanism that not only leverages the relationships between and its neighbors in the auxiliary space but also utilizes their feature vectors extracted from the backbone. We choose the valence-arousal [4] as the auxiliary space to construct the target label distribution. We use the -Nearest Neighbor algorithm to identify closest points for each training sample , denoted as . We calculate the adaptive contribution degrees of neighbor instances as the product of the local similarity and the calibration score as follows:
where the local similarity is defined based on the distance between the instance and its neighbor in the valence-arousal space and
We utilize a multilayer perceptron (MLP) with parameter to calculate the adaptive calibration score from the extracted features of the two instances and obtained from the backbone.
The proposed adaptive similarity can correct the similarity errors in the valance-arousal space, as the valence-arousal values are not always available in practice and we leverage an existing method to generate pseudo-valence-arousal.
Uncertainty-aware Label Distribution Construction
After obtaining the contribution degree of each neighbor , we can now generate the target label distribution for the main instance . The target label distribution is calculated using the logical label and the aggregated distribution defined as follows:
where is the uncertainty factor for the logical label. It controls the balance between the provided label and the aggregated distribution from the local neighborhood.
Intuitively, a high value of indicates that the logical label is highly uncertain, which can be caused by ambiguous expression or low-quality input images, thus we should put more weight towards neighborhood information . Conversely, when is small, the label distribution should be close to since we are certain about the provided manual label. In our implementation, is a trainable parameter for each instance and will be optimized jointly with the model's parameters using gradient descent.
Loss Function
To enhance the model's ability to discriminate between ambiguous emotions, we also propose a discriminative loss to reduce the intra-class variations of the learned facial representations. We incorporate the label uncertainty factor to adaptively penalize the distance between the sample and its corresponding class center. For instances with high uncertainty, the network can effectively tolerate their features in the optimization process. Furthermore, we also add pairwise distances between class centers to encourage large margins between different classes, thus enhancing the discriminative power. Our discriminative loss is calculated as follows:
where is the class index of the -th sample while , , and are the center vectors of the -th, -th, and -th classes, respectively. Intuitively, the first term of encourages the feature vectors of one class to be close to their corresponding center while the second term improves the inter-class discrimination by pushing the cluster centers far away from each other. Finally, the total loss for training is computed as:
where is the balancing coefficient between the two losses.
References
[1] Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2019
[2] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, 2017.
[3] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 2017.