/ / 122 min read

Uncertainty-aware Label Distribution Learning for Facial Expression Recognition (Part 1)

Using Label Distribution Learning technique to address the ambiguity problem in facial expression when people express multiple emotions on their faces.

Facial expression recognition (FER) plays an important role in understanding people's feelings and interactions between humans. Recently, automatic emotion recognition has gained a lot of attention from the research community due to its tremendous applications in education, healthcare, human analysis, surveillance or human-robot interaction. Recent FER methods are mostly based on deep learning and can achieve impressive results. The success of deep models can be attributed to large-scale FER datasets [1][2]. However, ambiguities of facial expression is still a key challenge in FER. Specifically, people with different backgrounds might perceive and interpret facial expressions differently, which can lead to noisy and inconsistent annotations. In addition, real-life facial expressions usually manifest a mixture of feelings rather than only a single emotion.

Motivation and Proposed Solution

Figure 1. Examples of real-world ambiguous facial expressions that can lead to noisy and inconsistent annotation.

As an example, Figure 1 shows that people may have different opinions about the expressed emotion, particularly in ambiguous images. Consequently, a distribution over emotion categories is better than a single label because it takes all sentiment classes into account and can cover various interpretations, thus mitigating the effect of ambiguity. However, existing large-scale FER datasets only provide a single label for each sample instead of a label distribution, which means we do not have a comprehensive description for each facial expression. This can lead to insufficient supervision during training and pose a big challenge for many FER systems.

To overcome the ambiguity problem in FER, we proposes a new uncertainty-aware label distribution learning method that constructs emotion distributions for training samples. Specifically, we leverage the neighborhood information of samples that have similar expressions to construct the emotion distributions from single labels and utilize them as training supervision signal.



We denote xX\mathbf{x} \in \mathcal{X} as the instance variable in the input space X\mathcal{X} and xi\mathbf{x}^{i} as the particular ii-th instance. The label set is denoted as Y={y1,y2,...,ym}\mathcal{Y} = \{y_1, y_2,..., y_m\} where mm is the number of classes and yjy_j is the label value of the jj-th class. The logical label vector of xi\mathbf{x}^{i} is indicated by li\mathbf{l}^{i} = (ly1i,ly2i,...,lymi)(l^{i}_{y_1}, l^{i}_{y_2}, ..., l^{i}_{y_m}) with lyji{0,1}\mathbf{l}^{i}_{y_j} \in \{0, 1\} and l1=1\| \mathbf{l} \| _1 = 1. We define the label distribution of xi\mathbf{x}^{i} as di\mathbf{d}^{i} = (dy1i,dy2i,...,dymi)(d^{i}_{y_1}, d^{i}_{y_2}, ..., d^{i}_{y_m}) with d1=1\| \mathbf{d} \| _1 = 1 and dyji[0,1]d^{i}_{y_j} \in [0, 1] representing the relative degree that xi\mathbf{x}^{i} belongs to the class yjy_j.

Most existing FER datasets assign only a single class or equivalently, a logical label li\mathbf{l}^{i} for each training sample xi\mathbf{x}^{i}. In particular, the given training dataset is a collection of nn samples with logical labels DlD_l = {(xi,li)1in}\{ (\mathbf{x}^{i}, \mathbf{l}^{i}) \vert 1 \le i \le n\}. However, we find that a label distribution di\mathbf{d}^i is a more comprehensive and suitable annotation for the image than a single label.

Inspired by the recent success of label distribution learning (LDL) in addressing label ambiguity [3], we aim to construct an emotion distribution di\mathbf{d}^i for each training sample xi\mathbf{x}^i, thus transform the training set DlD_l into DdD_d = {(xi,di)1in}\{ (\mathbf{x}^{i}, \mathbf{d}^{i}) \vert 1 \le i \le n\}, which can provide richer supervision information and help mitigate the ambiguity issue. We use cross-entropy to measure the discrepancy between the model's prediction and the constructed target distribution. Hence, the model can be trained by minimizing the following classification loss:

Lcls=i=1nCE(di,f(xi;θ))=i=1nj=1mdjilogfj(xi;θ).\mathcal{L}_{cls} = \sum_{i=1}^n \text{CE}\left(\mathbf{d}^i, f(\mathbf{x}^i; \theta)\right) = -\sum_{i=1}^n \sum_{j=1}^m \mathbf{d}_j^{i} \log f_j(\mathbf{x}^{i};\theta).

where f(x;θ)f(\mathbf{x}; \theta) is a neural network with parameters θ\theta followed by a softmax layer to map the input image x\mathbf{x} into a emotion distribution.


Figure 2. An overview of our Label Distribution Learning with Valence-Arousal (LDLVA) for facial expression recognition under ambiguity.

An overview of our method is presented in Figure 2. To construct the label distribution for each training instance xi\mathbf{x}^i, we leverage its neighborhood information in the valence-arousal space. Particularly, we identify KK neighbor instances for each training sample xi\mathbf{x}^i and utilize our adaptive similarity mechanism to determine their contribution degrees to the target distribution di\mathbf{d}^i. Then, we combine the neighbors' predictions and their corresponding contribution degrees with the provided label li\mathbf{l}^i and li\mathbf{l}^i's uncertainty factor to obtain the label distribution di\mathbf{d}^i. The constructed distribution di\mathbf{d}^i will be used as supervision information to train the model via label distribution learning.

Adaptive Similarity

We assume that the label distribution of the main instance xi\mathbf{x}^i can be computed as a linear combination of its neighbors' distributions. To determine the contribution of each neighbor, we propose an adaptive similarity mechanism that not only leverages the relationships between xi\mathbf{x}^i and its neighbors in the auxiliary space but also utilizes their feature vectors extracted from the backbone. We choose the valence-arousal [4] as the auxiliary space to construct the target label distribution. We use the KK-Nearest Neighbor algorithm to identify KK closest points for each training sample xi\mathbf{x}^i, denoted as N(i)N(i). We calculate the adaptive contribution degrees of neighbor instances as the product of the local similarity skis^i_k and the calibration score ζki\zeta^i_k as follows:

cki={ζkiski,for xkN(i),0,otherwise.c^i_k = \begin{cases} \zeta^i_k s^i_k, &\text{for } \mathbf{x}^k \in N(i), \\ 0, &\text{otherwise}. \end{cases}

where the local similarity skis^i_k is defined based on the distance between the instance and its neighbor in the valence-arousal space ai\mathbf{a}^i and ak\mathbf{a}^k

ski=exp(aiak22δ2),xkN(i)s^i_k = \exp\left(-\frac{\| \mathbf{a}^i - \mathbf{a}^k \|^2_2}{\delta^2}\right), \quad \forall \mathbf{x}^k \in N(i)

We utilize a multilayer perceptron (MLP) gg with parameter ϕ\phi to calculate the adaptive calibration score from the extracted features of the two instances vi\mathbf{v}^i and vk\mathbf{v}^k obtained from the backbone.

ζki=Sigmoid(g([vi,vk];ϕ))\zeta^i_k = Sigmoid\left(g([\mathbf{v}^i,\mathbf{v}^{k}];\phi)\right)

The proposed adaptive similarity can correct the similarity errors in the valance-arousal space, as the valence-arousal values are not always available in practice and we leverage an existing method to generate pseudo-valence-arousal.

Uncertainty-aware Label Distribution Construction

After obtaining the contribution degree of each neighbor xkN(i)\mathbf{x}^k \in N(i), we can now generate the target label distribution di\mathbf{d}^i for the main instance xi\mathbf{x}^i. The target label distribution is calculated using the logical label li\mathbf{l}^i and the aggregated distribution d~i\tilde{\mathbf{d}}^i defined as follows:

di~=kckif(xk;θ)kcki,di=(1λi)li+λidi~\tilde{\mathbf{d}^i} = \frac{\sum_k c^i_k f(\mathbf{x}^{k};\theta)}{\sum_k c^i_k}, \\ \mathbf{d}^i = (1-\lambda^i) \mathbf{l}^i + \lambda^i \tilde{\mathbf{d}^i}

where λi[0,1]\lambda^i \in [0,1] is the uncertainty factor for the logical label. It controls the balance between the provided label li\mathbf{l}^i and the aggregated distribution di~\tilde{\mathbf{d}^i} from the local neighborhood.

Intuitively, a high value of λi\lambda^i indicates that the logical label is highly uncertain, which can be caused by ambiguous expression or low-quality input images, thus we should put more weight towards neighborhood information di~\tilde{\mathbf{d}^i}. Conversely, when λi\lambda^i is small, the label distribution di\mathbf{d}^i should be close to li\mathbf{l}^i since we are certain about the provided manual label. In our implementation, λi\lambda^i is a trainable parameter for each instance and will be optimized jointly with the model's parameters using gradient descent.

Loss Function

To enhance the model's ability to discriminate between ambiguous emotions, we also propose a discriminative loss to reduce the intra-class variations of the learned facial representations. We incorporate the label uncertainty factor λi\lambda^i to adaptively penalize the distance between the sample and its corresponding class center. For instances with high uncertainty, the network can effectively tolerate their features in the optimization process. Furthermore, we also add pairwise distances between class centers to encourage large margins between different classes, thus enhancing the discriminative power. Our discriminative loss is calculated as follows:

LD=12i=1n(1λi)viμyi22+j=1mk=1kjmexp(μjμk22V)\mathcal{L}_D = \frac{1}{2}\sum_{i=1}^n (1-\lambda^i)\Vert \mathbf{v}^i - \mathbf{\mu}_{y^i} \Vert_2^2 + \sum_{j=1}^m \sum_{\substack{k=1 \\ k \neq j}}^m \exp \left(-\frac{\Vert\mathbf{\mu}_{j}-\mathbf{\mu}_{k}\Vert_2^2}{\sqrt{V}}\right)

where yiy^i is the class index of the ii-th sample while μj\mathbf{\mu}_{j}, μk\mathbf{\mu}_{k}, and μyi\mathbf{\mu}_{y^i} RV\in \mathbb{R}^V are the center vectors of the j{j}-th, k{k}-th, and yiy^i-th classes, respectively. Intuitively, the first term of LD\mathcal{L}_D encourages the feature vectors of one class to be close to their corresponding center while the second term improves the inter-class discrimination by pushing the cluster centers far away from each other. Finally, the total loss for training is computed as:

L=Lcls+γLD\mathcal{L} = \mathcal{L}_{cls} + \gamma\mathcal{L}_D

where γ\gamma is the balancing coefficient between the two losses.


[1] Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2019

[2] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, 2017.

[3] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 2017.

Like What You See?