# Global-Local Attention for Context-aware Emotion Recognition (Part 1)

Propose new method for emotion recognition by jointly learn both the facial information and the salient information of the surrounding context.

Automatic emotion recognition has been a longstanding problem in both academia and industry. It enables a wide range of applications in various domains, ranging from healthcare, surveillance to robotics and human-computer interaction. Recently, significant progress has been made in the field and many methods have demonstrated promising results. However, recent works mainly focus on facial regions while ignoring the surrounding context, which is shown to play an important role in the understanding of the perceived emotion, especially when the emotions on the face are ambiguous or weakly expressed (see the examples in Figure 1).

Figure 1. Facial expression information is not always sufficient to infer people's emotions, especially when facial regions can not be seen clearly or are occluded.

We hypothesize that the local information (i.e., facial region) and global information (i.e., context background) have a correlative relationship, and by simultaneously learning the attention using both of them, the accuracy of the network can be improved. This is based on the fact that the emotion of one person can be indicated by not only the face’s emotion (i.e., local information) but also other context information such as the gesture, pose, or emotion/pose of a nearby person. To that end, we propose a new deep network, namely, Global-Local Attention for Emotion Recognition Network (GLAMOR-Net), to effectively recognize human emotions using a novel global-local attention mechanism. Our network is designed to extract features from both facial and context regions independently, then learn them together using the attention module. In this way, both the facial and contextual information is used altogether to infer human emotions.

## #Overview

Figure 2. The architecture of our proposed network. The whole process includes three steps. We extract the facial information (local) and context information (global) using two Encoding Modules. We then perform attention inference on the global context using the Global-Local Attention mechanism. Lastly, we fuse both features to determine the emotion.
Figure 2 shows an overview of our method. Specifically, we assume that emotions can be recognized by understanding the context components of the scene together with the facial expression. Our method aims to do emotion recognition in the wild by incorporating both facial information of the person’s face and contextual information surrounding that person. Our model consists of three components: Encoding Module, Global-Local Attention (GLA) Module, and Fusion Module. Our key design is the novel GLA module, which utilizes facial features as the local information to attend better to salient locations in the global context.

### #Face and Context Encoding

Our Encoding Module comprises the Facial Encoding Module to learn the face-specific features, and the Context Encoding Module to learn the context-specific features. Specifically, both the Face Encoding and Context Enconding Module are built on several convolutional layers to extract meaningful features from the corresponding input. Each module is comprised of five convolutional layers followed by a Batch Normalization layer an ReLU activation function. The number of filters starts with 32 in the first layer, increasing by a factor of 2 at each subsequent layer except the last one. Our network ends up with 256-channel feature map, which is the embedded representation with respect to the input image. In practice, we also mask the facial regions in the raw input to prevent the attention module from only focusing on the facial region while omitting the context information in other parts of the image.

## #Global-Local Attention

Inspired by the attention mechanism [1], to model the associative relationship of the local information (i.e., the facial region in our work) and global information (i.e., the surrounding context background), we propose the Global-Local Attention Module to guide the network focus on meaningful regions (Figure 3). In particular, our attention mechanism models the hidden correlation between the face and different regions in the context by capturing their similarity.

Figure 3. The proposed Global-Local Attention module takes the extracted face feature vector and the context feature map as the input to perform context attention inference.

We first reduce the facial feature map $\mathbf{F}_f$ into vector representation using the Global Pooling operator, denoted as $\mathbf{v}_f$. The context feature map can be viewed as a set of $W_c \times H_c$ vectors with $D_c$ dimensions, each vector in each cell $(i,j)$ represents the embedded features at that location with the corresponding patch in the input image. Therefore, at each region $(i,j)$ in the context feature map, we have $\mathbf{F}_c^{(i,j)} = \mathbf{v}_{i,j}$.

We then concatenate $[\mathbf{v}_f; \mathbf{v}_{i,j}]$ into a holistic vector $\bar{\mathbf{v}}_{i,j}$, which contains both information about the face and some small regions of the scene. We then employ a feed-forward neural network to compute the score corresponding to that region by feeding $\bar{\mathbf{v}}_{i,j}$ into the network. By applying the same process for all regions, each region $(i,j)$ will output a raw score value $s_{i,j}$, we spatially apply the Softmax function to produce the attention map $a_{i,j} = \text{Softmax}(s_{i,j})$. To obtain the final context representation vector, we squish the feature maps by taking the average over all the regions weighted by $a_{i,j}$ as follow:

$\mathbf{v}_c = \Sigma_i\Sigma_j(a_{i,j} \odot \mathbf{v}_{i,j})$

where $\mathbf{v}_c \in \mathbb{R}^{D_c}$ is the final single vector encoding the context information Intuively, $\mathbf{v}_c$ mainly contains information from regions that have high attention, while other nonessential parts of the context are mostly ignored. With this design, our attention module can guide the network focus on important areas based on both facial information and context information of the image.

## #Face and Context Fusion

Figure 4. Detailed illustration of the Adaptive Fusion.

The Fusion Module takes the face $\mathbf{v}_f$ and the context reprsentation $\mathbf{v}_c$ as inputs, then the face score and context score are produced separately by two neural networks:

$s_f = \mathcal{F}(\mathbf{v}_f; \phi_f), \quad\quad s_c = \mathcal{F}(\mathbf{v}_c; \phi_c)$

where $\phi_f$ and $\phi_c$ are the network parameters of the face branch and context branch, respectively. Next, we normalize these scores by the Softmax function to produce weights for each face and context branch

$w_f = \frac{\exp(s_f)}{\exp(s_f)+\exp(s_c)}, \quad w_c = \frac{\exp(s_c)}{\exp(s_f)+\exp(s_c)}$

In this way, we let the two networks competitively determine which branch is more useful than the other. Then we amplify the more useful branch and lower the effect of the other by multiplying the extracted features with the corresponding weight:

$\mathbf{v}_f \leftarrow \mathbf{v}_f \odot w_f , \quad\quad \mathbf{v}_c \leftarrow \mathbf{v}_c \odot w_c$

Finally, we use these vectors to estimate the emotion category. Specifically, in our experiments, after multiplying both $\mathbf{v}_f$ and $\mathbf{v}_c$ by their corresponding weights, we concatenate them together as the input for a network to make final predictions. Figure 4 shows our fusion procedure in detail.

## #References

[1] Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In NIPS, 2015.

[2] Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition networks. In ICCV, 2019.