Uncertainty-aware Label Distribution Learning for Facial Expression Recognition (Part 2)

December 2, 2022 · 4 min read

In the previous post, we have introduced the our proposed method for Facial Expression Recognition. In this post, we will examine the effectiveness and efficiency of the proposal.

Experimental Results

Noisy and Inconsistent Labels

Table 1. Test performance with synthetic noise.

We conduct experiments to study the robustness of our LDLVA on mislabelled data by adding synthetic noise to AffectNet, RAF-DB, and SFEW datasets. Specifically, we randomly flip the manual labels to one of the other categories. . We report the mean accuracy and standard error in Table 1. The results clearly show that our method consistently outperforms other approaches in all cases. We also observe that the improvements are even more apparent when the noise ratio increases, for example, the accuracy improvement on RAF-DB is 4.7\% with 10\% noise and 6.93\% with 30\% noise. The consistent results under various settings demonstrate the ability of our method to effectively deal with noisy annotation, which is crucial in the robustness against label ambiguity.

Table 2. Test performance with inconsistent labels between cross-datasets.

Since the annotations for large-scale FER data are commonly obtained via crowd-sourcing, this can create label inconsistency, especially between different datasets. To examine the effectiveness of our proposed methods in dealing with this problem, we also perform experiment with the cross-dataset protocol. Table 2 shows that our method achieves the best performance on all three datasets and the highest average accuracy and surpasses the current state-of-the-art methods. This confirms the advantages of our method over previous works and demonstrates the generalization ability to data with label inconsistency, which is essential for real-world FER applications.

Comparison with state of the arts

Table 3. Comparison with recent methods on the original datasets.

We further compare our method with several state of the arts on the original AffectNet, RAF-DB, and SFEW to evaluate the robustness of our method to the uncertainty and ambiguity that unavoidably exists in real-world FER datasets. The results are presented in Table 3. By leveraging label distribution learning on valence-arousal space, our model outperforms other methods and achieves state-of-the-art performance on AffectNet, RAF-DB, and SFEW. Although these datasets are considered to be "clean", the results suggest that they indeed suffer from uncertainty and ambiguity.

Qualitative Analysis

Real-world Ambiguity: To understand more about real-world ambiguous expressions, we conducted a user study in which we asked participants to choose the most clearly expressed emotion on random test images. We compare our model's predictions with the survey results in Figure 3. We can see that these images are ambiguous as they express a combination of different emotions, hence the participants do not fully agree and have different opinions about the most prominent emotion on the faces. It is further shown that our model can give consistent results and agree with the perception of humans to some degree.

Figure 3. Comparison of the results from our user study and our model.

Uncertainty Factor: Figure 4 shows the estimated uncertainty factors of some training images and their original labels. The uncertainty values decrease from top to bottom. Highly uncertain labels can be caused by low-quality inputs (as shown in Angry and Surprise columns) or ambiguous facial expressions. In contrast, when the emotions can be easily recognized as those in the last row, the uncertainty factors are assigned low values. This characteristic can guide the model to decide whether to put more weight on the provided label or the neighborhood information. Therefore, the model can be more robust against uncertainty and ambiguity.

Figure 4. Visualization of uncertainty values of some examples from RAF-DB dataset.

Conclusion

We have introduced a new label distribution learning method for facial expression recognition by leveraging structure information in the valence-arousal space to recover the intensities distributed over emotion categories. The constructed label distribution provides rich information about the emotions, thus can effectively describe the ambiguity degree of the facial image. Intensive experiments on popular datasets demonstrate the effectiveness of our method over previous approaches under inconsistency and uncertainty conditions in facial expression recognition.

References

[1] Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2019

[2] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, 2017.

[3] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 2017.

Uncertainty-aware Label Distribution Learning for Facial Expression Recognition (Part 1)

December 2, 2022 · 8 min read

Facial expression recognition (FER) plays an important role in understanding people's feelings and interactions between humans. Recently, automatic emotion recognition has gained a lot of attention from the research community due to its tremendous applications in education, healthcare, human analysis, surveillance or human-robot interaction. Recent FER methods are mostly based on deep learning and can achieve impressive results. The success of deep models can be attributed to large-scale FER datasets [1][2]. However, ambiguities of facial expression is still a key challenge in FER. Specifically, people with different backgrounds might perceive and interpret facial expressions differently, which can lead to noisy and inconsistent annotations. In addition, real-life facial expressions usually manifest a mixture of feelings rather than only a single emotion.

Motivation and Proposed Solution

Figure 1. Examples of real-world ambiguous facial expressions that can lead to noisy and inconsistent annotation.

As an example, Figure 1 shows that people may have different opinions about the expressed emotion, particularly in ambiguous images. Consequently, a distribution over emotion categories is better than a single label because it takes all sentiment classes into account and can cover various interpretations, thus mitigating the effect of ambiguity. However, existing large-scale FER datasets only provide a single label for each sample instead of a label distribution, which means we do not have a comprehensive description for each facial expression. This can lead to insufficient supervision during training and pose a big challenge for many FER systems.

To overcome the ambiguity problem in FER, we proposes a new uncertainty-aware label distribution learning method that constructs emotion distributions for training samples. Specifically, we leverage the neighborhood information of samples that have similar expressions to construct the emotion distributions from single labels and utilize them as training supervision signal.

Methodology

Preliminaries

We denote $\mathbf{x} \in \mathcal{X}$ as the instance variable in the input space $\mathcal{X}$ and $\mathbf{x}^{i}$ as the particular $i$ -th instance. The label set is denoted as $\mathcal{Y} = \{y_1, y_2,..., y_m\}$ where $m$ is the number of classes and $y_j$ is the label value of the $j$ -th class. The logical label vector of $\mathbf{x}^{i}$ is indicated by $\mathbf{l}^{i}$ = $(l^{i}_{y_1}, l^{i}_{y_2}, ..., l^{i}_{y_m})$ with $\mathbf{l}^{i}_{y_j} \in \{0, 1\}$ and $\| \mathbf{l} \| _1 = 1$ . We define the label distribution of $\mathbf{x}^{i}$ as $\mathbf{d}^{i}$ = $(d^{i}_{y_1}, d^{i}_{y_2}, ..., d^{i}_{y_m})$ with $\| \mathbf{d} \| _1 = 1$ and $d^{i}_{y_j} \in [0, 1]$ representing the relative degree that $\mathbf{x}^{i}$ belongs to the class $y_j$ .

Most existing FER datasets assign only a single class or equivalently, a logical label $\mathbf{l}^{i}$ for each training sample $\mathbf{x}^{i}$ . In particular, the given training dataset is a collection of $n$ samples with logical labels $D_l$ = $\{ (\mathbf{x}^{i}, \mathbf{l}^{i}) \vert 1 \le i \le n\}$ . However, we find that a label distribution $\mathbf{d}^i$ is a more comprehensive and suitable annotation for the image than a single label.

Inspired by the recent success of label distribution learning (LDL) in addressing label ambiguity [3], we aim to construct an emotion distribution $\mathbf{d}^i$ for each training sample $\mathbf{x}^i$ , thus transform the training set $D_l$ into $D_d$ = $\{ (\mathbf{x}^{i}, \mathbf{d}^{i}) \vert 1 \le i \le n\}$ , which can provide richer supervision information and help mitigate the ambiguity issue. We use cross-entropy to measure the discrepancy between the model's prediction and the constructed target distribution. Hence, the model can be trained by minimizing the following classification loss:

\mathcal{L}_{cls} = \sum_{i=1}^n \text{CE}\left(\mathbf{d}^i, f(\mathbf{x}^i; \theta)\right) = -\sum_{i=1}^n \sum_{j=1}^m \mathbf{d}_j^{i} \log f_j(\mathbf{x}^{i};\theta).

where $f(\mathbf{x}; \theta)$ is a neural network with parameters $\theta$ followed by a softmax layer to map the input image $\mathbf{x}$ into a emotion distribution.

Overview

Figure 2. An overview of our Label Distribution Learning with Valence-Arousal (LDLVA) for facial expression recognition under ambiguity.

An overview of our method is presented in Figure 2. To construct the label distribution for each training instance $\mathbf{x}^i$ , we leverage its neighborhood information in the valence-arousal space. Particularly, we identify $K$ neighbor instances for each training sample $\mathbf{x}^i$ and utilize our adaptive similarity mechanism to determine their contribution degrees to the target distribution $\mathbf{d}^i$ . Then, we combine the neighbors' predictions and their corresponding contribution degrees with the provided label $\mathbf{l}^i$ and $\mathbf{l}^i$ 's uncertainty factor to obtain the label distribution $\mathbf{d}^i$ . The constructed distribution $\mathbf{d}^i$ will be used as supervision information to train the model via label distribution learning.

Adaptive Similarity

We assume that the label distribution of the main instance $\mathbf{x}^i$ can be computed as a linear combination of its neighbors' distributions. To determine the contribution of each neighbor, we propose an adaptive similarity mechanism that not only leverages the relationships between $\mathbf{x}^i$ and its neighbors in the auxiliary space but also utilizes their feature vectors extracted from the backbone. We choose the valence-arousal [4] as the auxiliary space to construct the target label distribution. We use the $K$ -Nearest Neighbor algorithm to identify $K$ closest points for each training sample $\mathbf{x}^i$ , denoted as $N(i)$ . We calculate the adaptive contribution degrees of neighbor instances as the product of the local similarity $s^i_k$ and the calibration score $\zeta^i_k$ as follows:

c^i_k = \begin{cases} \zeta^i_k s^i_k, &\text{for } \mathbf{x}^k \in N(i), \\ 0, &\text{otherwise}. \end{cases}

where the local similarity $s^i_k$ is defined based on the distance between the instance and its neighbor in the valence-arousal space $\mathbf{a}^i$ and $\mathbf{a}^k$

s^i_k = \exp\left(-\frac{\| \mathbf{a}^i - \mathbf{a}^k \|^2_2}{\delta^2}\right), \quad \forall \mathbf{x}^k \in N(i)

We utilize a multilayer perceptron (MLP) $g$ with parameter $\phi$ to calculate the adaptive calibration score from the extracted features of the two instances $\mathbf{v}^i$ and $\mathbf{v}^k$ obtained from the backbone.

\zeta^i_k = Sigmoid\left(g([\mathbf{v}^i,\mathbf{v}^{k}];\phi)\right)

The proposed adaptive similarity can correct the similarity errors in the valance-arousal space, as the valence-arousal values are not always available in practice and we leverage an existing method to generate pseudo-valence-arousal.

Uncertainty-aware Label Distribution Construction

After obtaining the contribution degree of each neighbor $\mathbf{x}^k \in N(i)$ , we can now generate the target label distribution $\mathbf{d}^i$ for the main instance $\mathbf{x}^i$ . The target label distribution is calculated using the logical label $\mathbf{l}^i$ and the aggregated distribution $\tilde{\mathbf{d}}^i$ defined as follows:

\tilde{\mathbf{d}^i} = \frac{\sum_k c^i_k f(\mathbf{x}^{k};\theta)}{\sum_k c^i_k}, \\ \mathbf{d}^i = (1-\lambda^i) \mathbf{l}^i + \lambda^i \tilde{\mathbf{d}^i}

where $\lambda^i \in [0,1]$ is the uncertainty factor for the logical label. It controls the balance between the provided label $\mathbf{l}^i$ and the aggregated distribution $\tilde{\mathbf{d}^i}$ from the local neighborhood.

Intuitively, a high value of $\lambda^i$ indicates that the logical label is highly uncertain, which can be caused by ambiguous expression or low-quality input images, thus we should put more weight towards neighborhood information $\tilde{\mathbf{d}^i}$ . Conversely, when $\lambda^i$ is small, the label distribution $\mathbf{d}^i$ should be close to $\mathbf{l}^i$ since we are certain about the provided manual label. In our implementation, $\lambda^i$ is a trainable parameter for each instance and will be optimized jointly with the model's parameters using gradient descent.

Loss Function

To enhance the model's ability to discriminate between ambiguous emotions, we also propose a discriminative loss to reduce the intra-class variations of the learned facial representations. We incorporate the label uncertainty factor $\lambda^i$ to adaptively penalize the distance between the sample and its corresponding class center. For instances with high uncertainty, the network can effectively tolerate their features in the optimization process. Furthermore, we also add pairwise distances between class centers to encourage large margins between different classes, thus enhancing the discriminative power. Our discriminative loss is calculated as follows:

\mathcal{L}_D = \frac{1}{2}\sum_{i=1}^n (1-\lambda^i)\Vert \mathbf{v}^i - \mathbf{\mu}_{y^i} \Vert_2^2 + \sum_{j=1}^m \sum_{\substack{k=1 \\ k \neq j}}^m \exp \left(-\frac{\Vert\mathbf{\mu}_{j}-\mathbf{\mu}_{k}\Vert_2^2}{\sqrt{V}}\right)

where $y^i$ is the class index of the $i$ -th sample while $\mathbf{\mu}_{j}$ , $\mathbf{\mu}_{k}$ , and $\mathbf{\mu}_{y^i}$ $\in \mathbb{R}^V$ are the center vectors of the ${j}$ -th, ${k}$ -th, and $y^i$ -th classes, respectively. Intuitively, the first term of $\mathcal{L}_D$ encourages the feature vectors of one class to be close to their corresponding center while the second term improves the inter-class discrimination by pushing the cluster centers far away from each other. Finally, the total loss for training is computed as:

\mathcal{L} = \mathcal{L}_{cls} + \gamma\mathcal{L}_D

where $\gamma$ is the balancing coefficient between the two losses.

References

[1] Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2019

[2] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, 2017.

[3] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 2017.

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge (Part 3)

July 31, 2022 · 7 min read

In this part, we will show the effectivness and the ablation studies of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge.

Dataset

As mentioned in [1], we train method on two types of scans: Liver CT scans and Brain MRI scans.

For Liver CT scans, we use 5 datasets:

LiTS contains 131 liver segmentation scans.
MSD has 70 liver tumor CT scans, 443 hepatic vessels scans, and 420 pancreatic tumor scans.
BFH is a smaller dataset with 92 scans.
SLIVER is a challenging dataset with 20 liver segmentation scans and annotated by 3 expert doctors.
LSPIG (Liver Segmentation of Pigs) contains 17 pairs of CT scans from pigs, provided by the First Affiliated Hospital of Harbin Medical University.

For Brain MRI scans, we use 4 datasets: 1. ADNI contains 66 scans. 2. ABIDE contains 1287 scans. 3. ADHD contains 949 scans. 4. LPBA has 40 scans, each featuring a segmentation ground truth of 56 anatomical structures.

Baselines

We compare LDR ALDK method with the following recent deformable registration methods:

ANTs SyN and Elastix B-spline are methods that find an optimal transformation by iteratively update the parameters of the defined alignment.
VoxelMorph predicts a dense deformation in an unsupervised manner by using deconvolutional layers.
VTN is an end-to-end learning framework that uses convolutional neural networks to register 3D medical images, especially large displaced ones.
RCN is a recent recursive deep architecture that utilizes learnable cascade and performs progressive deformation for each warped image.

Results

Table 1 summarizes the overall performance, testing speed, and the number of parameters compared with recent state-of-the-art methods in the deformable registration task. The results clearly show that Light-weight Deformable Registration network (LDR) accompanied by Adversarial Learning with Distilling Knowledge (ALDK) algorithm significantly reduces the inference time and the number of parameters during the inference phase. Moreover, the method achieves competitive accuracy with the most recent highly performed but expensive networks, such as VTN or VoxelMorph. We notice that this improvement is consistent across all experiments on different datasets SLIVER, LiTS, LSPIG, and LPBA.

In particular, we observe that on the SLIVER dataset the Dice score of best model with 3 cascades (3-cas LDR + ALDK) is 0.3% less than the best result of 3-cas VTN + Affine, while inference speed is ?21 times faster on a CPU and the parameters used during inference is ~8 times smaller. Including benchmarking results in three other datasets, i.e., LiTS, LSPIG, and LPBA, light-weight model only trades off an average of 0.5% in Dice score and 1.25% in Jacc score for a significant gain of speed and a massive reduction in the number of parameters. We also notice that method is the only work that achieves the inference time of approximately 1s on a CPU. This makes method well suitable for deployment as it does not require expensive GPU hardware for inference.

Fig-1

Table 1: COMPARISON AMONG LDR ALDK MODEL WITH RECENT APPROACHES.

Ablation Study

Effectiveness of ALDK. Table 2 summarizes the effectiveness of Adversarial Learning with Distilling Knowledge (ALDK) when being integrated into the light-weight student network. Note that LDR without ALDK is trained using only the reconstruction loss in an unsupervised learning setup. From this table, we clearly see that ALDK algorithm improves the Dice score of the LDR tested in the SLIVER dataset by 3.4%, 4.0%, and 3.1% for 1-cas, 2-cas, and 3-cas setups, respectively. Additionally, using ALDK also increases the Jacc score by 5.2%, 4.9%, and 3.9% for 1-cas LDR, 2-cas LDR, and 3-cas LDR. These results verify the stability of adversarial learning algorithm in the inference phase, under the differences evaluation metrics, as well as the number of cascades setups. Furthermore, Table 2 also clearly shows the effectiveness and generalization of ALDK when being applied to the student network. Since the deformations extracted from the teacher are used only in the training period, adversarial learning algorithm fully maintains the speed and the number of parameters for the light-weight student network during inference. All results indicate that student network incorporated with the adversarial learning algorithm successfully achieves the performance goal, while maintaining the efficient computational cost of the light-weight setup.

Fig-2

Table 2: COMPARISON AMONG LDR ALDK MODEL WITH RECENT APPROACHES.

Accuracy vs. Complexity. Figure 1 demonstrates the experimental results from the SLIVER dataset between LDR + ALDK and the baseline VTN under multiple recursive cascades setup on both CPU and GPU. On the CPU (Figure 1-a), in terms of the 1-cascade setup, the Dice score of method is 0.2% less than VTN while the speed is ~15 times faster. The more the number of cascades is leveraged, the higher the speed gap between LDR + ALDK and the baseline VTN, e.g. the CPU speed gap is increased to ~21 times in a 3-cascades setup. We also observe the same effect on GPU (Figure 1-b), where method achieves slightly lower accuracy results than VTN, while clearly reducing the inference time. These results indicate that LDR + ALDK can work well with the teacher network to improve the accuracy while significantly reducing the inference time on both CPU and GPU in comparison with the baseline VTN network.

Fig-3

Figure 1:Plots of Dice score and Inference speed with respect to the number of cascades of the baseline Affine + VTN and LDR + ALDK. (a) for CPU speed and (b) for GPU speed. Note that results are reported for the SLIVER dataset; bars represent the CPU speed; lines represent the Dice score. All methods use an Intel Xeon E5-2690 v4 CPU and Nvidia GeForce GTX 1080 Ti GPU for inference.

Visualization

Figure 2 illustrates the visual comparison among 1-cas LDR, 1-cas LDR + ALDK, and the baseline 1-cas RCN. Five different moving images in a volume are selected to apply the registration to a chosen fixed image. It is important to note that though the sections of the warped segmentations can be less overlap with those of the fixed one, the segmentation intersection over union is computed for the volume and not the sections. In the segmented images in Figure 2, besides the matched area colored by white, we also marked the miss-matched areas by red for an easy-to-read purpose.

From Figure 2, we can see that the segmentation resutls of 1-cas LDR network without using ALDK (Figure 2-a) contains many miss-matched areas (denoted in red color). However, when we apply ALDK to the student network, the registration results are clearly improved (Figure 2-b). Overall, LDR + ALDK visualization results in Figure 2-b are competitive with the baseline RCN network (Figure 2-c). This visualization confirms that framework for deformable registration can achieve comparable results with the recent RCN network.

Fig-3

Figure 2:The visualization comparison between LDR (a), LDR + ALDK (b), and the baseline RCN (c). The left images are sections of the warped images; the right images are sections of the warped segmentation (white color represents the matched areas between warped image and fixed image, red color denotes the miss-matched areas). The segmentation visualization indicates that LDR + ALDK (b) method reduces the miss-matched areas of the student network LDR (a) significantly. Best viewed in color.

Reference

[1] Tran, Minh Q., et al. "Light-weight deformable registration using adversarial learning with distilling knowledge." IEEE Transactions on Medical Imaging, 2022.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge (Part 2)

June 20, 2022 · 6 min read

In this part, we will introduce the Architecture of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge.

The Architecture of Light-weight Deformable Registration Network

In practice, recent deformation networks follow an encoder-decoder architecture and use 3D convolution to progressively down-sample the image, and deconvolution (transposed convolution) to recover spatial resolution [1, 3]. However, this setup consumes a large number of parameters. Therefore, the built models are computationally expensive and time-consuming. To overcome this problem we design a new light-weight student network as illustrated in Figure 1.

In particular, the proposed light-weight network has four convolution layers and three deconvolution layers. Each convolutional layer has a bank of $4 \times 4 \times 4$ filters with strides of $2 \times 2 \times 2$ , followed by a ReLU activation function. The number of output channels of the convolutional layers starts with $16$ at the first layer, doubling at each subsequent layer, and ends up with $128$ . Skip connections between the convolutional layers and the deconvolutional layers are added to help refine the dense prediction. The subnetwork outputs a dense flow prediction field, i.e., a $3$ channels volume feature map with the same size as the input.

In comparison with the current state-of-the-art dense deformable registration network [3], the number of parameters of our proposed light-weight student network is reduced approximately $10$ times. In practice, this significant reduction may lead to an accuracy drop. Therefore, we propose a new Adversarial Learning with Distilling Knowledge algorithm to effectively leverage the teacher deformations $\phi_t$ to our introduced student network, making it light-weight but achieving competitive performance.

Fig-1

Figure 1: The structure of Light-weight Deformable Registration student network. The number of channels is annotated above the layer. Curved arrows represent skip paths (layers connected by an arrow are concatenated before transposed convolution). Smaller canvas means lower spatial resolution (Source).

Adversarial Learning Algorithm with Distilling Knowledge

Our adversarial learning algorithm aims to improve the student network accuracy through the distilled teacher deformations extracted from the teacher network. The learning method comprises a deformation-based adversarial loss $\mathcal{L}_{adv}$ and its accompanying learning strategy (Algorithm 1).

Fig-2

Figure 2: Adversarial Learning Strategy(Source).

Adversarial Loss. The loss function for the light-weight student network is a combination of the discrimination loss $l_{dis}$ and the reconstruction loss $l_{res}$ . However, the forward and backward process through loss function is controlled by the Algorithm 1. In particular, the last deformation loss $\mathcal{L}_{adv}$ that outputs the final warped image can be written as:

\mathcal{L}_{adv} = \gamma l_{rec} + (1 - \gamma) l_{dis}

where $\gamma$ controls the contribution between $l_{rec}$ and $l_{dis}$ . Note that, the $\mathcal{L}_{adv}$ is only applied for the final warped image.

Discrimination Loss. In the student network the discrimination loss is computed in Equation below}.

l_{{dis}} = \left\lVert D_\mathbf{\theta}(\phi_{s}) - D_\mathbf{\theta}(\phi_{t}) \right\lVert_2^{2} + \lambda\bigg(\left\lVert \nabla_{\hat\phi_{s}}D_\mathbf{\theta}(\hat\phi_{s}) \right\lVert_2 - 1\bigg)^{2}

where $\lambda$ controls gradient penalty regularization. The joint deformation $\hat\phi_{s}$ is computed from the teacher deformation $\phi_{t}$ and the predicted student deformation $\phi_{s}$ as follow:

\hat\phi_{s} = \beta \phi_{t} + (1 - \beta) \phi_{s}

where $\beta$ control the effect of the teacher deformation.

In Discrimination Loss, $D_\mathbf{\theta}$ is the discriminator, formed by a neural network with learnable parameters ${\theta}$ . The details of $D_\mathbf{\theta}$ is shown in Figure 3. In particular, $D_\mathbf{\theta}$ consists of six $3D$ convolutional layers, the first layer is $128 \times 128 \times 128 \times 3$ and takes the $c \times c \times c \times 1$ deformation as input. $c$ is equaled to the scaled size of the input image. The second layer is $64 \times 64 \times 64 \times 16$ . From the second layer to the last convolutional layer, each convolutional layer has a bank of $4 \times 4 \times 4$ filters with strides of $2 \times 2 \times 2$ , followed by a ReLU activation function except for the last layer which is followed by a sigmoid activation function. The number of output channels of the convolutional layers starts with $16$ at the second layer, doubling at each subsequent layer, and ends up with $256$ .

Basically, this is to inject the condition information with a matched tensor dimension and then leave the network learning useful features from the condition input. The output of the last neural layer is the mean feature of the discriminator, denoted as $M$ . Note that in the discrimination loss, a gradient penalty regularization is applied to deal with critic weight clipping which may lead to undesired behavior in training adversarial networks.

Fig-3

Figure 3: The structure of the discriminator $D_\mathbf{\theta}$ used in the Discrimination Loss ( $l_{dis}$ ) of our Adversarial Learning with Distilling Knowledge algorithm (Source).

Reconstruction Loss. The reconstruction loss $l_{rec}$ is an important part of a deformation estimator. Follow the VTN [3] baseline, the reconstruction loss is written as:

l_{{rec}} (\textbf{\textit{I}}_m^h,\textbf{\textit{I}}_f) = 1 - CorrCoef [\textbf{\textit{I}}_m^h,\textbf{\textit{I}}_f]

where

CorrCoef[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] = \frac{Cov[\textbf{\textit{I}}_1,\textbf{\textit{I}}_2]}{\sqrt{Cov[\textbf{\textit{I}}_1,\textbf{\textit{I}}_1]Cov[\textbf{\textit{I}}_2,\textbf{\textit{I}}_2]}}

Cov[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] = \frac{1}{|\omega|}\sum_{x \in \omega} \textbf{\textit{I}}_1(x)\textbf{\textit{I}}_2(x) - \frac{1}{|\omega|^{2}}\sum_{x \in \omega} \textbf{\textit{I}}_1(x)\sum_{y \in \omega}\textbf{\textit{I}}_2(y)

where $CorrCoef[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2]$ is the correlation between two images $\textbf{\textit{I}}_1$ and $\textbf{\textit{I}}_2$ , $Cov[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2]$ is the covariance between them. $\omega$ denotes the cuboid (or grid) on which the input images are defined.

Learning Strategy. The forward and backward of the aforementioned $\mathcal{L}_{adv}$ is controlled by the adversarial learning strategy described in Algorithm 1.

In our deformable registration setup, the role of real data and attacking data is reversed when compared with the traditional adversarial learning strategy. In adversarial learning, the model uses unreal (generated) images as attacking data, while image labels are ground truths. However, in our deformable registration task, the model leverages the unreal (generated) deformations from the teacher as attacking data, while the image is the ground truth for the model to reconstruct the input information. As a consequence, the role of images and the labels are reversed in our setup. Since we want the information to be learned more from real data, the generator will need to be considered more frequently. Although the knowledge in the discriminator is used as attacking data, the information it supports is meaningful because the distilled information is inherited from the high-performed teacher model. With these characteristics of both the generator and discriminator, the light-weight student network is expected to learn more effectively and efficiently.

Reference

[1] S. Zhao, Y. Dong, E. I. Chang, Y. Xu, et al., Recursive cascaded networks for unsupervised medical image registration, in ICCV, 2019.

[2] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, ArXiv, 2015.

[3] S. Zhao, T. Lau, J. Luo, I. Eric, C. Chang, and Y. Xu, Unsupervised 3d end-to-end medical image registration with volume tweening network, IEEE J-BHI, 2019.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge

April 19, 2022 · 7 min read

Introduction: Medical image registration

Medical image registration is the process of systematically placing separate medical images in a common frame of reference so that the information they contain can be effectively integrated or compared. Applications of image registration include combining images of the same subject from different modalities, aligning temporal sequences of images to compensate for the motion of the subject between scans, aligning images from multiple subjects in cohort studies, or navigating with image guidance during interventions. Since many organs do deform substantially while being scanned, the rigid assumption can be violated as a result of scanner-induced geometrical distortions that differ between images. Therefore, performing deformable registration is an essential step in many medical procedures.

Previous Studies, Remaining Challenges, and Motivation

Recently, learning-based methods have become popular to tackle the problem of deformable registration. These methods can be split into two groups: (i) supervised methods that rely on the dense ground-truth flows obtained by either traditional algorithms or simulating intra-subject deformations. Although these works achieve state-of-the-art performance, they require a large amount of manually labeled training data, which are expensive to obtain; and (ii) unsupervised learning methods that use a similarity measurement between the moving and the fixed image to utilize a large amount of unlabelled data. These unsupervised methods achieve competitive results in comparison with supervised methods. However, their deformations are reconstructed without the direct ground-truth guidance, hence leading to the limitation of leveraging learnable information. Furthermore, recent unsupervised methods all share an issue of great complexity as the network parameters increase significantly when multiple progressive cascades are taken into account. This leads to the fact that these works can not achieve real-time performance during inference while requiring intensively computational resources when deploying.

In practice, there are many scenarios when medical image registration are needed to be fast - consider matching preoperative and intra-operative images during surgery, interactive change detection of CT or MRI data for a radiologist, deformation compensation or 3D alignment of large histological slices for a pathologist, or processing large amounts of images from high-throughput imaging methods. Besides, in many image-guided robotic interventions, performing real-time deformable registration is an essential step to register the images and deal with organs that deform substantially. Economically, the development of a CPU-friendly solution for deformable registration will significantly reduce the instrument costs equipped for the operating theatre, as it does not require GPU or cloud-based computing servers, which are costly and consume much more power than CPU. This will benefit patients in low- and middle-income countries, where they face limitations in local equipment, personnel expertise, and budget constraints infrastructure. Therefore, design an efficient model which is fast and accurate for deformable registration is a crucial task and worth for study in order to improve a variety of surgical interventions.

Contribution

Deformable registration is a crucial step in many medical procedures such as image-guided surgery and radiation therapy. Most recent learning-based methods focus on improving the accuracy by optimizing the non-linear spatial correspondence between the input images. Therefore, these methods are computationally expensive and require modern graphic cards for real-time deployment. Thus, we introduce a new Light-weight Deformable Registration network that significantly reduces the computational cost while achieving competitive accuracy (Fig.1). In particular, we propose a new adversarial learning with distilling knowledge algorithm that successfully leverages meaningful information from the effective but expensive teacher network to the student network. We design the student network such as it is light-weight and well suitable for deployment on a typical CPU. The extensively experimental results on different public datasets show that our proposed method achieves state-of-the-art accuracy while significantly faster than recent methods. We further show that the use of our adversarial learning algorithm is essential for a time-efficiency deformable registration method.

Fig-1

(a)

(b) Figure 1: Comparison between typical deep learning-based methods for deformable registration (a) and our approach using adversarial learning with distilling knowledge for deformable registration (b). In our work, the expensive Teacher Network is used only in training; the Student Network is light-weight and inherits helpful knowledge from the Teacher Network via our Adversarial Learning algorithm. Therefore, the Student Network has high inference speed, while achieving competitive accuracy (Source).

Methodology

Method overview

We describe our method for Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge. Our method is composed of three main components: (i)) a Knowledge Distillation module which extracts meaningful deformations $\bm{\phi_t}$ from the Teacher Network; (ii) a Light-weight Deformable Registration (LDR) module which outputs a high-speed Student Network; and (iii) an Adversarial Learning with Distilling Knowledge (ALDK) algorithm which effectively leverages teacher deformations $\bm{\phi}_t$ to the student deformations. An overview of our proposed deformable registration method can be found in Fig.2.

Fig-2

Figure 2: An overview of our proposed Light-weight Deformable Registration (LDR) method using Adversarial Learning with Distilling Knowledge (ALDK). Firstly, by using knowledge distillation, we extract the deformations from the Teacher Network as meaningful ground-truths. Secondly, we design a light-weight student network, which has competitive speed. Finally, We employ the Adversarial Learning with Distilling Knowledge algorithm to effectively transfer the meaningful knowledge of distilled deformations from the Teacher Network to the Student Network (Source).

Since the content may over-length, in this part, we introduce the background theory for Deformable Registration and Knowledge Distillation for Deformation. In the next part, we will introduce the Architecture of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge. In the final part, we will introduce the effectiveness of the method in comparison with recent states of the arts and detailed analysis.

Background: Deformable Registration

We follow RCN [1] to define deformable registration task recursively using multiple cascades. Let $\textbf{\textit{I}}_m, \textbf{\textit{I}}_f$ denote the moving image and the fixed image respectively, both defined over $d$ -dimensional space $\bm{\Omega}$ . A deformation is a mapping $\bm{\phi} : \bm{\Omega} \rightarrow \bm{\Omega}$ . A reasonable deformation should be continuously varying and prevented from folding. The deformable registration task is to construct a flow prediction function $\textbf{F}$ which takes $\textbf{\textit{I}}_m, \textbf{\textit{I}}_ f$ as inputs and predicts a dense deformation $\bm{\phi}$ that aligns $\textbf{\textit{I}}_m$ to $\textbf{\textit{I}}_f$ using a warp operator $\circ$ as follows:

\textbf{F}^{(n)}(\textbf{\textit{I}}^{(n-1)}_m,\textbf{\textit{I}}_f)=\phi^{(n)} \circ \textbf{F}^{(n-1)}(\phi^{(n-1)} \circ \textbf{\textit{I}}^{(n-2)}_m,\textbf{\textit{I}}_f)

where $\textbf{F}^{(n-1)}$ is the same as $\textbf{F}^{(n)}$ , but in a different flow prediction function. Assuming for $n$ cascades in total, the final output is a composition of all predicted deformations, i.e.,

\textbf{F}(\textbf{\textit{I}}_m, \textbf{\textit{I}}_f)=\phi^{(n)} \circ...\circ \phi^{(1)},

and the final warped image is constructed by

\textbf{\textit{I}}_{m}^{(n)}=\textbf{F}(\textbf{\textit{I}}_m,\textbf{\textit{I}}_f) \circ \textbf{\textit{I}}_m

In general, previous Equations form the hypothesis function $\mathcal{F}$ under the learnable parameter $\mathbf{W}$ ,

\mathcal{F}(\textbf{\textit{I}}_{m}, \textbf{\textit{I}}_f, \mathbf{W}) = (\mathbf{v}_{\phi}, \textbf{\textit{I}}_m^{(n)})

where $\mathbf{v}_{\phi} = [\bm{\phi}^{(1)}, \bm{\phi}^{(2)}, ..., \bm{\phi}^{(k)},..., \bm{\phi}^{(n)}]$ is a vector containing predicted deformations of all cascades. Each deformation $\bm{\phi}^{(k)}$ can be computed as

\bm{\phi}^{(k)} = {\mathcal{F}}^{(k)}\left(\textbf{\textit{I}}_{m}^{(k-1)}, \textbf{\textit{I}}_f, \mathbf{W}_{\phi^{(k)}}\right)

To estimate and achieve a good deformation, different networks are introduced to define and optimize the learnable parameter $\mathbf{W}$ .

Knowledge Distillation for Deformation

Knowledge distillation is the process of transferring knowledge from a cumbersome model (teacher model) to a distilled model (student model). The popular way to achieve this goal is to train the student model on a transfer set using a soft target distribution produced by the teacher model.

Different from the typical knowledge distillation methods that target the output softmax of neural networks as the knowledge, in the deformable registration task, we leverage the teacher deformation $\bm{\phi}_t$ as the transferred knowledge. As discussed in [2], teacher networks are usually high-performed networks with good accuracy. Therefore, our goal is to leverage the current state-of-the-art Recursive Cascaded Networks (RCN) [1] as the teacher network for extracting meaningful deformations to the student network. The RCN network contains an affine transformation and a large number of dense deformable registration sub-networks designed by VTN [3]. Although the teacher network has expensive computational costs, it is only applied during the training and will not be used during the inference.

Reference

[1] S. Zhao, Y. Dong, E. I. Chang, Y. Xu, et al., Recursive cascaded networks for unsupervised medical image registration, in ICCV, 2019.

[2] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, ArXiv, 2015.

[3] S. Zhao, T. Lau, J. Luo, I. Eric, C. Chang, and Y. Xu, Unsupervised 3d end-to-end medical image registration with volume tweening network, IEEE J-BHI, 2019.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Multiple Meta-model Quantifying for Medical Visual Question Answering

July 5, 2021 · 8 min read

Motivation

A medical Visual Question Answering (VQA) system can provide meaningful references for both doctors and patients during the treatment process. Extracting image features is one of the most important steps in a medical VQA framework which outputs essential information to predict answers. Transfer learning, in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet, is a popular way to initialize the feature extraction process. However, due to the difference in visual concepts between ImageNet images and medical images, finetuning process is not sufficient. Recently, Model Agnostic Meta-Learning (MAML) has been introduced to overcome the aforementioned problem by learning meta-weights that quickly adapt to visual concepts. However, MAML is heavily impacted by the meta-annotation phase for all images in the medical dataset. Different from normal images, transfer learning in medical images is more challenging due to:

(i) noisy labels may occur when labeling images in an unsupervised manner;
(ii) high-level semantic labels cause uncertainty during learning;
(iii) difficulty in scaling up the process to all unlabeled images in medical datasets.

Deep Metric Learning Meets Deep Clustering - An Novel Unsupervised Approach for Feature Embedding

May 25, 2021 · 8 min read

Motivation

The objective of deep distance metric learning (DML) is to train a deep learning model that maps training samples into feature embeddings that are close together for samples that belong to the same category and far apart for samples from different categories. Traditional DML approaches require supervised information, i.e., class labels, to supervise the training. Although the supervised DML achieves impressive results on different tasks, it requires large amount of annotated training samples to train the model. Unfortunately, such large datasets are not always available and they are costly to annotate for specific domains. That disadvantage also limits the transferability of supervised DML to new domain/applications which do not have labeled data. These reasons have motivated recent studies aiming at learning feature embeddings without annotated datasets --- unsupervised deep distance metric learning (UDML). Our study is in that same direction, i.e., learning embeddings from unlabeled data.

There are two main challenges for UDML:

Firstly, how to define positive and negative samples for a given anchor data point, such that we can apply distance-based losses, e.g., pairwise loss or triplet loss, in the embedding space.
Secondly, how to make the training efficient, given a large number of pairs or triplets of samples, in the order of $\mathcal{O}(N^2)$ or $\mathcal{O}(N^3)$ , respectively, in which $N$ is the number of training samples.

In this paper, we propose a new method that utilizes deep clustering for deep metric learning to address the two challenges mentioned above. In particular,

We propose to use a deep clustering loss to learn centroids, i.e., pseudo labels, that represent semantic classes.
During learning, these centroids are also used to reconstruct the input samples. It hence ensures the representativeness of centroids — each centroid represents visually similar samples. Therefore, the centroids give information about positive (visually similar) and negative (visually dissimilar) samples.
Based on pseudo labels, we propose a novel unsupervised metric loss which enforces the positive concentration and negative separation of samples in the embedding space.

Methodology

Figure 1: Illustration of the proposed framework which consists of an encoder (G), an embedding module (F), a decoder (D) and three losses, i.e., clustering loss $L_{rim}$ , reconstruction loss $L_{rec}$ and metric loss $L_m$ . The details are presented in the text (Source).

The proposed framework is presented in Figure 1.

For every original image in a batch, we make an augmented version by using a random geometric transformation.
The input images are fed into the backbone network which is also considered as the encoder (G) to get image representations.
The image representations are passed through the embedding module which consists of fully connected and L2 normalization layers (F), which results in unit norm image embeddings.
The clustering module takes image embeddings as inputs, performs the clustering with a clustering loss, and outputs the cluster assignments.
Given the cluster assignments, centroid representations are computed from image representations, which are then passed through the decoder (D) with a reconstruction loss to reconstruct images that belong to the corresponding clusters.
The centroid representations are also passed through the embedding module (F) to get centroid embeddings. The centroid embeddings and image embeddings are used as inputs for the metric loss.

Discriminative Clustering

We formulate the clustering of embedding features as a classification problem. Given a set of embedding features $X=\{x_i\}_{i=1}^m \in \mathbb{R}^{128 \times m}$ in a batch and the number of clusters $K\le m$ (i.e., the number of clusters $K$ is limited by the batch size $m$ ). The cluster assignment for $x$ then is estimated by $c^* = argmax_c y$ . Let $Y=\{y_i\}_{i=1}^m$ be the set of softmax outputs for $X$ .

Inspired by Regularized Information Maximization (RIM), we use the following objective function (1) for the clustering.

L_{rim} = \mathcal{R}(\theta) - \lambda \left[ H(Y) - H(Y|X) \right] (1)

where $H(.)$ and $H(.|.)$ are entropy and conditional entropy, respectively; $\mathcal{R}(\theta)$ regularizes the classifier parameters (in this work we use $l_2$ regularization); $\lambda$ is a weighting factor to control the importance of two terms.

Minimizing (1) is equivalent to maximizing $H(Y)$ and minimizing $H(Y|X)$ . Increasing the marginal entropy $H(Y)$ encourages cluster balancing, while decreasing the conditional entropy $H(Y|X)$ encourages cluster separation.

Reconstruction

In order to enhance the representativeness of centroids, we introduce a reconstruction loss (2) that penalizes high reconstruction errors from centroids to corresponding samples. Specifically, the decoder takes a centroid representation of a cluster and minimizes the difference between input images that belong to the cluster and the reconstructed image from the centroid representation.

L_{rec} = \frac{1}{m} \sum_{j=1}^K\sum\limits_{I_i\in X_j}||I_i - D(r_j)||^2, (2)

where $D(.)$ is the decoder which reconstructs samples in the batch using their corresponding centroid representations and $m$ is the number of images in the batch.

Metric loss

Let $f_i \in \mathbb{R}^{128}$ and $\hat{f_i} \in \mathbb{R}^{128}$ be the image embeddings of $I_i$ and $\hat{I_i}$ , respectively. The proposed metric loss (3) aims to minimize the distance between $f_i$ and $\hat{f_i}$ while pushing $f_i$ far away from negative clusters.

L_{m} = -\sum_i log \left( l(I_i,\hat{I_i}) \right) - \sum_i\sum\limits_{j=1, j \neq q}^K log \left(l(I_i,c_j) \right) (3)

Final Loss

The network in Figure 1 is trained in an end-to-end manner with the following multi-task loss.

L = \alpha L_{m} + \beta L_{rim} + \gamma L_{rec}, (4)

where $L_m$ is the center-based softmax loss (3) for deep metric learning, $L_{rim}$ is the clustering loss (1), and $L_{rec}$ is the reconstruction loss (2).

Experimental Results

Ablation Study

We denote our model with:

only clustering loss (1) as only $L_{rim}$ .
both clustering and the metric losses (2) and (3) as Center-based Softmax (CBS).
Center-based Softmax with Reconstruction (CBSwR).

Table 1: The impact of each loss component on the performance on CUB200-2011 dataset and the comparison to the baseline (Source).

Table 2: The impact of each loss component on the performance on Car196 dataset and the comparison to the baseline (Source).

Tables 1 and 2 present the comparative results between methods. The results show that using only the clustering loss, the accuracy is significantly lower than the baseline SME. However, when using the centroids from the clustering for calculating the metric loss (i.e., CBS), it gives the performance boost over the baseline (i.e., SME). Furthermore, the reconstruction loss enhances the representativeness of centroids, as confirmed by the improvements of CBSwR over CBS on both datasets.

Table 3: The training time (seconds) of different methods on CUB200-2011 and Car196 datasets with 20 epochs. The models are trained on a NVIDIA GeForce GTX 1080-Ti GPU (Source).

Table 4: The impact of the number of clusters of the final model CBSwR on the performance on CUB200-2011 dataset (Source).

Table 3 presents the training time of different methods on the CUB200-2011 and Car196 datasets. Although the asymptotic complexity of CBSwR for training one batch is $\mathcal{O}(Km)$ , it also consists of a decoder part which affects the real training. It is worth noting that the decoder is only involved during training. During testing, our method has similar computational complexity as SME.

Table 4 presents the impact of the number of clusters $K$ in the clustering loss on the CUB200-2011 dataset with our proposed model CBSwR (recall that the number of clusters $K$ is limited by the batch size $m$ ). During training, the number of samples per clusters vary depending on batches and the number of clusters. At $K=32$ which is our final setting, the number of samples per cluster varies from 2 to 11, on the average. The retrieval performance is just slightly different for the different number of clusters. This confirms the robustness of the proposed method w.r.t. the number of clusters.

Comparison to the state of the art

Table 5: Clustering and Recall performance on the CUB200-2011 dataset (Source).

Table 6: Clustering and Recall performance on the Car196 dataset (Source).

Table 5 presents the comparative results on CUB200-2011 dataset. In terms of clustering quality (NMI metric), the proposed method and the state-of-the-art UDML methods MOM and SME achieve comparable accuracy. However, in terms of retrieval accuracy R@K, our method outperforms other approaches. Our proposed method is also competitive to most of the supervised DML methods.

Table 6 presents comparative results on Car196 dataset. Compared to unsupervised methods, the proposed method outperforms other approaches in terms of retrieval accuracy at all ranks of K. Our method is comparable to other unsupervised methods in terms of clustering quality.

Quantity Visualization

Figure 2: Barnes-Hut t-SNE visualization of our embedding on the CUB200-2011 dataset.

Figure 2 shows the t-SNE plots on our learned embedding features on CUB200-2011. We can see that our embedding produces reasonable results in grouping similar visual objects despite the significant variations in view-point, pose, and configuration.

Conclusion

We propose a new method that utilizes deep clustering for deep metric learning to address the two challenges in UDML, i.e., positive/negative mining and efficient training. The method is based on a novel loss that consists of a learnable clustering function, a reconstruction function, and a center-based metric loss function. Our experiments on CUB200-2011 and Car196 datasets show state-of-the-art performance on the retrieval task, compared to other unsupervised learning methods.

Open Source

🐱 Github: https://github.com/aioz-ai/BMVC20_CBSwR

Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering.

April 24, 2021 · 8 min read

Different approaches have been proposed to Visual Question Answering (VQA). However, few works are aware of the behaviors of varying joint modality methods over question type prior knowledge extracted from data in constraining answer search space, of which information gives a reliable cue to reason about answers for questions asked in input images. In this blog, we share a novel VQA model that utilizes the question-type prior information to improve VQA by leveraging the multiple interactions between different joint modality methods based on their behaviors in answering questions from different types. The solid experiments on two benchmark datasets, i.e., VQA 2.0 and TDIUC, indicate that the proposed method yields the best performance with the most competitive approaches.

Introduction

There are works that consider types of question as the side information whichgives a strong cue to reason about the answer. However, the relation between question types and answers from training data have not been investi-gated yet. Fig. 1 shows the correlation between question types and some answersin the VQA 2.0 dataset. It suggests that a question regarding the quantityshould be answered by a number, not a color. The observation indicated that theprior information got from the correlations between question types and answers open an answer search space constrain for the VQA model. The search spaceconstrain is useful for VQA model to give out final prediction and thus, improvethe overall performance. The Fig. 1 is consistent with our observation, e.g., itclearly suggests that a question regarding the quantity should be answered by anumber, not a color.
Fig1

Figure 1. The distribution of candidate answers in each question type in VQA 2.0. Although different joint modality methods or attention mechanisms have been proposed, we hypothesize that each method may capture different aspects of the input. That means different attentions may provide different answers for questions belonged to different question types.
Fig.2 shows examples in which the attention models (SAN and BAN) attend on different regions of input images when dealing with questions from different types. Unfortunately, most of recent VQA systems are based on single attention models BAN2, SAN, MLP, MCB, STL. From the above observation, it is necessary to develop a VQA system which leverages the power of different attention models to deal with questions from different question types.
Fig1

Figure 2. Examples of attention maps of different attention mechanisms. BAN and SAN identify different visual areas when answering questions from different types.

Methodology

The proposed multiple interaction learning with question-type prior knowledge (MILQT) is illustrated in Fig. 3. Similar to the most of the VQA systems, multiple interaction learning with question-type prior knowledge (MILQT) consists of the joint learning solution for input questions and images, followed by a multi-class classification over a set of predefined candidate answers. However, MILQT allows to leverage multiple joint modality methods under the guiding of question-types to output better answers.
Fig3

Figure 3. The introduced MILQT for VQA. As in Fig.3, MILQT consists of two modules: Question-type awareness

\mathcal{A}

, and Multi-hypothesis interaction learning

\mathcal{M}

. The first module aims to learn the question-type representation, which is further used to enhance the joint visual-question embedding features and to constrain answer search space through prior knowledge extracted from data. Based on the question-type information, the second module aims to identify the behaviors of multiple joint learning methods and then justify adjust contributions to giving out final predictions.

Question representation. Given an input question, follow the recent state-of-the-art BAN, we trim the question to a maximum of 12 words. The questions that are shorter than 12 words are zero-padded. Each word is then represented by a 600-D vector that is a concatenation of the 300-D GloVe word embedding and the augmenting embedding from training data. This step results in a sequence of word embeddings with size of $12 \times 600$ and is denoted as $f_w$ . In order to obtain the intent of question, the $f_w$ is passed through a Gated Recurrent Unit (GRU) which results in a 1024-D vector representation $f_q$ for the input question.

Image representation. We use bottom-up attention, i.e. an object detection which takes as FasterRCNN backbone, to extract image representation. At first, the input image is passed through bottom-up networks to get $K \times 2048$ bounding box representation which is denotes as $f_v$ in Fig. 3.

Multi-level multi-modal fusion. Unlike the previous works that perform only one level of fusion between linguistic and visual features that may limit the capacity of these models to learn a good joint semantic space. In our work, a multi-level multi-modal fusion that encourages the model to learn a better joint semantic space is introduced which takes the question-type representation got from question-type classification component as one of inputs.

First level multi-modal fusion: The first level fusion is similar to previous works. Given visual features $f_v$ , question features $f_{q}$ , and any joint modality mechanism,
we combines visual features with question features and learn attention weights to weight for visual and/or linguistic features. Different attention mechanisms have different ways for learning the joint semantic space. The output of first level multi-modal fusion is denoted as $f_{att}$ in the Fig.3.
Second level multi-modal fusion: In order to enhance the joint semantic space, the output of the first level multi-modal fusion $f_{att}$ is combined with the question-type feature $f_{qt}$ , which is the output of the last FC layer of the Question-type classification component.
We try two simple but effective operators, i.e. element-wise multiplication --- EWM or element-wise addition --- EWA, to combine $f_{att}$ and $f_{qt}$ . The output of the second level multi-modal fusion, which is denoted as $f_{att-qt}$ in Fig.3, can be seen as an attention representation that is aware of the question-type information.

Given an attention mechanism, the $f_{att-qt}$ will be used as the input for a classifier that predicts an answer for the corresponding question. This is shown at the Answer prediction boxes in the Fig.3.

Multi-hypothesis interaction learning As presented in Fig.3, MILQT allows to utilize multiple hypotheses (i.e., joint modality mechanisms). Specifically, we propose a multi-hypothesis interaction learning design $\mathcal{M}$ that takes answer predictions produced by different joint modality mechanisms and interactively learn to combine them.
Let $g \in \R^{A \times J}$ be the matrix of predicted probability distributions over $A$ answers from the $J$ joint modality mechanisms. $\mathcal{M}$ outputs the distribution $\rho \in \R^{A}$ , which is calculated from $g$ through Equation below:

\rho = \mathcal{M} \left(g,w_{mil}\right) = \sum_{j}\left(m^T_{qt-ans}w_{mil} \odot g\right)

$w_ {mil} \in \textbf{R}^{P \times J}$ is the learnable weight which control the contributions of $J$ considered joint modality mechanisms on predicting answer based on the guiding of $P$ question types; $\odot$ denotes Hardamard product.

Results

Experiments on VQA 2.0 test-dev and test-standard.

We evaluate MILQTon the test-dev and test-standard of VQA 2.0 dataset. To train the model,similar to previous works, we use both training set and validationset of VQA 2.0. We also use the Visual Genome as additional training data. MILQT consists of three joint modality mechanisms, i.e., BAN-2, BAN-2-Counter, and SAN accompanied with the EWM for the multi-modal fusion, andthe predicted question type together with the prior information to augment theVQA loss. Table 4 presents the results of different methods on test-dev and test-std of VQA 2.0. The results show that our MILQT yields the good performance with the most competitive approaches.

Table 1. Comparison to the state of the arts on the test-dev and test-standard of VQA 2.0. For fair comparison, Glove embedding and GRU are leveraged for question embedding and Bottom-up features are used to extract visual information. CMP, i.e.Cross-Modality with Pooling, is the LXMERT with the aforementioned setup (Source).

Experiments on TDIUC.

In order to prove the stability of MILQT, we evaluate MILQT on TDIUC dataset.
The results in Table.2 show that the proposed model establishes the state-of-the-art results on both evaluation metrics Arithmetic MPT and Harmonic MPT. Specifically, our model significantly outperforms the recent QTA, i.e., on the overall, we improve over QTA $6.1\%$ and $11.1\%$ with Arithemic MPT and Harmonic MPT metrics, respectively. It is worth noting that the results of QTA in Table. 2, which are cited from QTA, are achieved when QTA used the one-hot predicted question type of testing question to weight visual features. When using the groundtruth question type to weight visual features, QTA reported $69.11\%$ and $60.08\%$ for Arithemic MPT and Harmonic MPT metrics, respectively. Our model also outperforms these performances a large margin, i.e., the improvements are $3.9\%$ and $6.8\%$ for Arithemic MPT and Harmonic MPT metrics, respectively.

Table 2. The comparative results between the proposed model and other models onthe validation set of TDIUC (Source).

Conclusion

We present a multiple interaction learning with question-type prior knowledge for constraining answer search space--- MILQT that takes into account the question-type information to improve the VQA performance at different stages. The system also allows to utilize and learn different attentions under a unified model in an interacting manner. The extensive experimental results show that all proposed components improve the VQA performance. We yields the best performance with the most competitive approaches on VQA 2.0 and TDIUC dataset.

Open Source

Github: https://github.com/aioz-ai/ECCVW20_MILQT

A Brief Introduction to Visual Question Answering

April 22, 2021 · 5 min read

1. Visual Question Answering - Overview

Visual Question Answering (VQA) aims to figure out a correct answer for a given question consistent with the visual content of a given image. The overarching goal of this issue is to create systems that can comprehend the contents of an image in the same way that humans do and communicate effectively about that image in natural language. It is indeed a challenging task as it necessitates the interaction and complementation of both image feature extractor and natural language processor.

There are two main variants of VQA which are Free-Form Opened-Ended (FFOE) VQA and Multiple Choice (MC) VQA. In FFOE VQA, an answer is a free-form response to a given image-question pair input, while in MC VQA, an answer is chosen from an answer list for a given image-question pair input. The discussion of VQA variants will be shared in the next post.

2. Approaches for solving VQA task.

There are three main approaches for VQA:

*Compositional VQA models:* the questions are interpreted as a set of many sub-tasks.
Bayesian and Question-Aware models: this method is not suitable for use in systems that respond to image-related questions. Since the algorithm based on this method does not try looking at the picture and instead predicts the response based on the Bayesian model by determining the probability of the words in the dataset's responses.
Attention based models: this method try to learn the interaction between image and question features in VQA task through a module called attention. Then, the joint features got from that module are leveraged for answering the corresponding question.

The final one is the most successful approach since recent states of the arts included attention mechanisms.

3. Attention based VQA approach.

In general, attention based VQA approaches have four main steps (See Figure 1):

Visual Representation: Encode the information from the image into vector(s) by using Convolutional Neural Network (CNN)
Textual Representation: Encode the information of question into vector(s) by using Embedding.
Joint Representation: A further step to learn the interaction between question(s) and image(s). Output joint features can be vector(s).
Answer prediction: the joint features from the previous step are then passed through this module to obtain the predicted answer. This module is mostly formed by a Classification.

Overall

Figure 1. The general approach for Visual Question Answering.

3.1. Visual Representation

The basic attributes or aspects that clearly help us recognize a specific object, image, or something are known as features. The distinguishing characteristics are the distinct properties. When operating on a VQA dataset, we must extract the features of various images in order to separate the images based on specific features or aspects. Image features are one of the most important pieces of information for a VQA system to output the correct answer.

Convolutional neural networks have emerged as the gold standard for image pattern recognition. An input image is converted into image features after it is passed through a convolutional network. Each filter in a CNN layer detects various patterns, such as corners, vertex, shapes, curves, and symmetries (See Figure 2).

Img Extraction

Figure 2. An example of feature extraction in VQA classification.

The majority of VQA literature employs CNNs for image processing. The network's final layer is removed, and the remaining network is used to extract image features. For image representation in VQA, objects in images represented by features extracted from an object detector such as the Faster-RCNN bottom-up model.

3.2. Textual Representation

Textual embeddings can be offered in a variety of ways. Count-based and frequency-based techniques such as count vectorization and TF-IDF are examples of older approaches. There are also prediction-based approaches such as a continuous bag of words and skip grams. Pretrained Word2Vec models are also openly accessible. Embeddings can also be created using deep learning architectures such as RNNs, LSTMs, GRUs, and 1-D CNNs. LSTMs are one of the most often used in VQA literature. For question embedding in VQA, Glove or BERT are used widely for capturing the representation of words and sentences in different contexts (See Figure 3 for a sample structure of question embedding).

QEmb

Figure 3. An example of question embedding for VQA.

3.3. Joint Representation

In current VQA systems, the joint modality component plays an essential role since it would learn meaningful joint representations between linguistic and visual inputs by applying the attention mechanism. There are many works that learn the interaction between question and image. For instance, a novel trilinear interaction model which simultaneously learns high level associations between image, question and answer information- CTI (Do et al. 2018). See Figure 4 for more details.

CTI

Figure 4. Compact Trilinear Interaction mechanism for VQA (Source).

3.4. Answer Prediction

In most recent works, the joint features got from the attention mechanism is then passed through a classifier to output predicted answer. However, more modules can also be applied to produce external knowledge and deal with difficult questions. ````

Overcoming Data Limitation in Medical Visual Question Answering

April 22, 2021 · 8 min read

What are the difficulties when dealing with Medical VQA task?

Visual Question Answering (VQA) aims to provide a correct answer to a given question such that the answer is consistent with the visual content of a given image.

In medical domain, VQA could benefit both doctors and patients. For example, doctors could use answers provided by VQA system as support materials in decision making, while patients could ask VQA questions related to their medical images for better understanding their health.

Figure 1: An example of Medical VQA (Source).

However, one major problem with medical VQA is the lack of large scale labeled training data which usually requires huge efforts to build.

The first attempt for building the dataset for medical VQA is by ImageCLEF-Med. In this, images were automatically captured from PubMed Central articles. The questions and answers were automatically generated from corresponding captions of images. By that construction, the data has high noisy level, i.e., the dataset includes many images that are not useful for direct patient care and it also contains questions that do not make any sense.
Recently, the first manually constructed VQA-RAD dataset for medical VQA task is released. Unfortunately, it contains only 315 images, which prevents to directly apply the powerful deep learning models for the VQA problem. One may think about the use of transfer learning in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet are used for finetuning on the medical VQA. However, due to difference in visual concepts between ImageNet images and medical images, finetuning with very few medical images is not sufficient.

Therefore it is necessary to develop a new VQA framework that can improve the accuracy while still only needs a small labeled training data.

The motivation for our approach to overcome the data limitation of medical VQA comes from two observations:

Firstly, we observe that there are large scale unlabeled medical images available. These images are from same domain with medical VQA images. Hence if we train an unsupervised deep learning model using these unlabeled images, the trained weights may be easier to be adapted to the medical VQA problem than the pretrained weights on ImageNet images.
Another observation is that although the labeled dataset VQA-RAD is primarily designed for VQA, by spending a little effort, we can extract the new class labels for that dataset. The new class labels allow us to apply the recent meta-learning technique for learning meta-weights, that can be quickly adapted to the VQA problem later.

Methodology

The proposed medical VQA framework is presented in Figure 2. In our framework, the image feature extraction component is initialized by pretrained weights from MAML and CDAE. After that, the VQA framework will be finetuned in an end-to-end manner on the medical VQA data. In the following sections, we detail the architectures of MAML, CDAE, and our framework.

Figure 2: The proposed medical VQA. The image feature extraction is denoted as 'Mixture of Enhanced Visual Features (MEVF)' and is marked with the red dashed box. The weights of MEVF are intialized by MAML and CDAE (Source).

Model-Agnostic Meta-Learning -- MAML

The MAML model consists of four $3\times3$ convolutional layers with stride $2$ and is ended with a mean pooling layer; each convolutional layer has $64$ filters and is followed by a ReLu layer.

We create the dataset for training MAML by manually reviewing around three thousand question-answer pairs from the training set of VQA-RAD dataset. In our annotation process, images are split into three parts based on its body part labels (head, chest, abdomen). Images from each body part are further divided into three subcategories based on the interpretation from the question-answer pairs corresponding to the images. These subcategories are: 1. normal images in which no pathology is found. 2. abnormal present images in which there are the existence of fluid, air, mass, or tumor. 3. abnormal organ images in which the organs are large in size or in wrong position.

Thus, all the images are categorized into 9 classes:

| head normal      | head abnormal present     | head abnormal organ           |
 | chest normal     | chest abnormal organ      | chest abnormal present        |
 | abdominal normal | abdominal abnormal organ  | abdominal abnormal present    |

For every iteration of MAML training (line 3 in Alg.1), 5 tasks are sampled per iteration. For each task, we randomly select 3 classes (from 9 classes). For each class, we randomly select 6 images in which 3 images are used for updating task models and the remaining 3 images are used for updating meta-model.

Denoising Auto Encoder -- CDAE

The encoder maps an image $x'$ , which is the noisy version of the original image $x$ , to a latent representation $z$ which retains useful amount of information. The decoder transforms $z$ to the output $y$ . The training algorithm aims to minimize the reconstruction error between $y$ and the original image $x$ as follows

L_{rec} = \left \| x-y \right \|_2^2

In our design, the encoder is a stack of convolutional layers; each of them is followed by a max pooling layer. The decoder is a stack of deconvolutional and convolutional layers. The noisy version $x'$ is achieved by adding Gaussian noise to the original image $x$ .

To train CDAE, we collect $11,779$ unlabeled images available online which are brain MRI images, chest X-ray images and CT abdominal images. The dataset is split into train set with $9,423$ images and test set with $2,356$ images. We use Gaussian noise to corrupt the input images before feeding them to the encoder.

Our VQA framework

After training MAML and CDAE, we use their trained weights to initialize the MEVF image feature extraction component in the VQA framework. We then finetune the whole VQA model using the training set of VQA-RAD dataset.

To train the proposed model, we introduce a multi-task loss func-tion to incorporate the effectiveness of the CDAE to VQA. Formally, our lossfunction is defined as follows:

L = \alpha_1 L_{vqa} + \alpha_2 L_{rec}

where $L_{vqa}$ is a Cross Entropy loss for VQA classification and $L_{rec}$ stands for the reconstruction loss of CDAE . The whole VQA model is finetuned in an end-to-end manner.

Results

Table 1: VQA results on VQA-RAD test set. All reference methods differ at the image feature extraction component. Other components are similar. The Stacked Attention Network (SAN) is used as the attention mechanism in all methods (Source).

Table 1 presents VQA accuracy in both VQA-RAD open-ended and close-ended questions on the test set. The results show that for both MAML and CDAE, by firstly pretraining then finetuning, the finetuning significantly improves the performance over the training from scratch using only VQA-RAD.

In addition, the results also show that our pretraining and finetuning of MAML and CDAE give better performance than the finetuning of VGG-16 which is pretrained on the ImageNet dataset. Our proposed image feature extraction MEVF which leverages both pretrained weights of MAML and CDAE, then finetuning them give the best performance. This confirms the effectiveness of the proposed MEVF for dealing with the limitation of labeled training data for medical VQA.

Table 2: Performance comparison on VQA-RAD test set (Source).

Table 2 presents comparative results between methods. Note that for the image feature extraction, the baselines use the pretrained models (VGG or ResNet) that have been trained on ImageNet and then finetune on the VQA-RAD dataset. For the question feature extraction, all baselines and our framework use the same pretrained models (i.e., Glove) and finetuning on VQA-RAD. The results show that when BAN or SAN is used as the attention mechanism in our framework, it significantly outperforms the baseline frameworks BAN and SAN. Our best setting, i.e. the one with BAN as the attention, achieves the state-of-the-art results and it significantly outperforms the best baseline framework BAN, i.e., the improvements are $16.3\%$ and $8.6\%$ on open-ended and close-ended VQA, respectively.

Conclusion

In this paper, we proposed a novel medical VQA framework that leverages the meta-learning MAML and denoising auto-encoder CDAE for image feature extraction in order to overcome the limitation of labeled training data. Specifically, CDAE helps to leverage information from the large scale unlabeled images, while MAML helps to learn meta-weights that can be quickly adapted to the VQA problem. We establish new state-of-the-art results on VQA-RAD dataset for both close-ended and open-ended questions.

Open Source

🐱 Github: https://github.com/aioz-ai/MICCAI19-MedVQA

Graph-based Person Signature for Person Re-Identifications

April 10, 2021 · 9 min read

Motivation

Person re-identification (ReID) aims to retrieve a particular person image in a collection of images captured by multiple cameras from various viewpoints across time.

The challenges of the person ReID task come from significant variations of human attributes such as poses, gaits, clothes, as well as challenging environmental settings like illumination, complex background, and occlusions. With the rise of deep learning, most of the recent studies utilize Convolutional Neural Network (CNN) to tackle the person ReID problem.

Recently, attribute-based methods have shown great success in providing semantic features for the deep network. Unlike the person identity label, which offers only coarse information to identify one identity among all other person identities, the attributes are the detailed descriptions that are highly intuitive and mostly unchanged between images captured from different cameras. Therefore, they can be used to explicitly guide the model to learn a robust person representation by defining human characteristics.

In this work, we propose to utilize the person attribute information with its associated body part to encode the visual person signature in one unified framework.

We hypothesize that the detailed person descriptions (attributes labels) can be integrated with visual features (body parts and global features) to create a unique signature for a particular person.
Since both body parts and attributes provide local representations, by linking them together, the network can have a better understanding of the relationship between visual features and attribute descriptions.
Although previous works have investigated how person identity, body parts, and attributes benefit the task of person ReID, our key difference is that we utilize Graph Convolutional Networks (GCN) to effectively construct and model the correlation between attributes and body parts with global features. In particular, we treat body part regions and attributes as nodes in a graph and utilize a GCN to learn the topological structure of a person's signatures. The GCN propagates messages on a graph structure. After message traversal on the graph, the node's final representations are obtained from its data and from other node's information. Figure 1 shows the effectiveness of our approach.

Figure 1: The effectiveness of our GPS in improving retrieval results on Market-1501 dataset. The details are presented in the text (Source).

Methodology

Figure 2: Illustration of our proposed framework including two branches: (1) global branch which extracts person global features; (2) GPS branch which performs reasoning the person attributes and body parts using GCN. The details are presented in the text (Source).

The proposed framework is presented in Figure 2.

We denote $I$ is a probe person image. This probe image $I$ is first passed through a backbone CNN to get the feature map $\mathbf{F}$ .
By utilizing a human parsing pretrained model, we extract the body part masks to obtain the visual features of each part.
The person attributes are then represented by a lookup word embedding.
Given body part features and attribute features, we construct the Graph-based Person Signature which includes attribute nodes and body part nodes conditioned on the correlation matrix. We employ the GCN for reasoning on the person signature graph and encoding the graph into more representativeness features.

Our proposed method is a multi-branch multi-task framework for person ReID, where the main branch performs the verification task by optimizing two well-known loss functions: Triplet loss and Center loss. The auxiliary branch performs reasoning on the proposed person signature graph and solves the attribute recognition as well as the person identity classification tasks.