15 posts tagged with "type: post"

View All Tags

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge (Part 3)

In this part, we will show the effectivness and the ablation studies of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge.


As mentioned in [1], we train method on two types of scans: Liver CT scans and Brain MRI scans.

For Liver CT scans, we use 5 datasets:

  1. LiTS contains 131 liver segmentation scans.
  2. MSD has 70 liver tumor CT scans, 443 hepatic vessels scans, and 420 pancreatic tumor scans.
  3. BFH is a smaller dataset with 92 scans.
  4. SLIVER is a challenging dataset with 20 liver segmentation scans and annotated by 3 expert doctors.
  5. LSPIG (Liver Segmentation of Pigs) contains 17 pairs of CT scans from pigs, provided by the First Affiliated Hospital of Harbin Medical University.

For Brain MRI scans, we use 4 datasets: 1. ADNI contains 66 scans. 2. ABIDE contains 1287 scans. 3. ADHD contains 949 scans. 4. LPBA has 40 scans, each featuring a segmentation ground truth of 56 anatomical structures.


We compare LDR ALDK method with the following recent deformable registration methods:

  • ANTs SyN and Elastix B-spline are methods that find an optimal transformation by iteratively update the parameters of the defined alignment.
  • VoxelMorph predicts a dense deformation in an unsupervised manner by using deconvolutional layers.
  • VTN is an end-to-end learning framework that uses convolutional neural networks to register 3D medical images, especially large displaced ones.
  • RCN is a recent recursive deep architecture that utilizes learnable cascade and performs progressive deformation for each warped image.


Table 1 summarizes the overall performance, testing speed, and the number of parameters compared with recent state-of-the-art methods in the deformable registration task. The results clearly show that Light-weight Deformable Registration network (LDR) accompanied by Adversarial Learning with Distilling Knowledge (ALDK) algorithm significantly reduces the inference time and the number of parameters during the inference phase. Moreover, the method achieves competitive accuracy with the most recent highly performed but expensive networks, such as VTN or VoxelMorph. We notice that this improvement is consistent across all experiments on different datasets SLIVER, LiTS, LSPIG, and LPBA.

In particular, we observe that on the SLIVER dataset the Dice score of best model with 3 cascades (3-cas LDR + ALDK) is 0.3% less than the best result of 3-cas VTN + Affine, while inference speed is ?21 times faster on a CPU and the parameters used during inference is ~8 times smaller. Including benchmarking results in three other datasets, i.e., LiTS, LSPIG, and LPBA, light-weight model only trades off an average of 0.5% in Dice score and 1.25% in Jacc score for a significant gain of speed and a massive reduction in the number of parameters. We also notice that method is the only work that achieves the inference time of approximately 1s on a CPU. This makes method well suitable for deployment as it does not require expensive GPU hardware for inference.



Ablation Study

Effectiveness of ALDK. Table 2 summarizes the effectiveness of Adversarial Learning with Distilling Knowledge (ALDK) when being integrated into the light-weight student network. Note that LDR without ALDK is trained using only the reconstruction loss in an unsupervised learning setup. From this table, we clearly see that ALDK algorithm improves the Dice score of the LDR tested in the SLIVER dataset by 3.4%, 4.0%, and 3.1% for 1-cas, 2-cas, and 3-cas setups, respectively. Additionally, using ALDK also increases the Jacc score by 5.2%, 4.9%, and 3.9% for 1-cas LDR, 2-cas LDR, and 3-cas LDR. These results verify the stability of adversarial learning algorithm in the inference phase, under the differences evaluation metrics, as well as the number of cascades setups. Furthermore, Table 2 also clearly shows the effectiveness and generalization of ALDK when being applied to the student network. Since the deformations extracted from the teacher are used only in the training period, adversarial learning algorithm fully maintains the speed and the number of parameters for the light-weight student network during inference. All results indicate that student network incorporated with the adversarial learning algorithm successfully achieves the performance goal, while maintaining the efficient computational cost of the light-weight setup.



Accuracy vs. Complexity. Figure 1 demonstrates the experimental results from the SLIVER dataset between LDR + ALDK and the baseline VTN under multiple recursive cascades setup on both CPU and GPU. On the CPU (Figure 1-a), in terms of the 1-cascade setup, the Dice score of method is 0.2% less than VTN while the speed is ~15 times faster. The more the number of cascades is leveraged, the higher the speed gap between LDR + ALDK and the baseline VTN, e.g. the CPU speed gap is increased to ~21 times in a 3-cascades setup. We also observe the same effect on GPU (Figure 1-b), where method achieves slightly lower accuracy results than VTN, while clearly reducing the inference time. These results indicate that LDR + ALDK can work well with the teacher network to improve the accuracy while significantly reducing the inference time on both CPU and GPU in comparison with the baseline VTN network.


Figure 1:Plots of Dice score and Inference speed with respect to the number of cascades of the baseline Affine + VTN and LDR + ALDK. (a) for CPU speed and (b) for GPU speed. Note that results are reported for the SLIVER dataset; bars represent the CPU speed; lines represent the Dice score. All methods use an Intel Xeon E5-2690 v4 CPU and Nvidia GeForce GTX 1080 Ti GPU for inference.


Figure 2 illustrates the visual comparison among 1-cas LDR, 1-cas LDR + ALDK, and the baseline 1-cas RCN. Five different moving images in a volume are selected to apply the registration to a chosen fixed image. It is important to note that though the sections of the warped segmentations can be less overlap with those of the fixed one, the segmentation intersection over union is computed for the volume and not the sections. In the segmented images in Figure 2, besides the matched area colored by white, we also marked the miss-matched areas by red for an easy-to-read purpose.

From Figure 2, we can see that the segmentation resutls of 1-cas LDR network without using ALDK (Figure 2-a) contains many miss-matched areas (denoted in red color). However, when we apply ALDK to the student network, the registration results are clearly improved (Figure 2-b). Overall, LDR + ALDK visualization results in Figure 2-b are competitive with the baseline RCN network (Figure 2-c). This visualization confirms that framework for deformable registration can achieve comparable results with the recent RCN network.


Figure 2:The visualization comparison between LDR (a), LDR + ALDK (b), and the baseline RCN (c). The left images are sections of the warped images; the right images are sections of the warped segmentation (white color represents the matched areas between warped image and fixed image, red color denotes the miss-matched areas). The segmentation visualization indicates that LDR + ALDK (b) method reduces the miss-matched areas of the student network LDR (a) significantly. Best viewed in color.


[1] Tran, Minh Q., et al. "Light-weight deformable registration using adversarial learning with distilling knowledge." IEEE Transactions on Medical Imaging, 2022.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge (Part 2)

In this part, we will introduce the Architecture of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge.

The Architecture of Light-weight Deformable Registration Network

In practice, recent deformation networks follow an encoder-decoder architecture and use 3D convolution to progressively down-sample the image, and deconvolution (transposed convolution) to recover spatial resolution [1, 3]. However, this setup consumes a large number of parameters. Therefore, the built models are computationally expensive and time-consuming. To overcome this problem we design a new light-weight student network as illustrated in Figure 1.

In particular, the proposed light-weight network has four convolution layers and three deconvolution layers. Each convolutional layer has a bank of 4Γ—4Γ—44 \times 4 \times 4 filters with strides of 2Γ—2Γ—22 \times 2 \times 2, followed by a ReLU activation function. The number of output channels of the convolutional layers starts with 1616 at the first layer, doubling at each subsequent layer, and ends up with 128128. Skip connections between the convolutional layers and the deconvolutional layers are added to help refine the dense prediction. The subnetwork outputs a dense flow prediction field, i.e., a 33 channels volume feature map with the same size as the input.

In comparison with the current state-of-the-art dense deformable registration network [3], the number of parameters of our proposed light-weight student network is reduced approximately 1010 times. In practice, this significant reduction may lead to an accuracy drop. Therefore, we propose a new Adversarial Learning with Distilling Knowledge algorithm to effectively leverage the teacher deformations Ο•t\phi_t to our introduced student network, making it light-weight but achieving competitive performance.


Figure 1: The structure of Light-weight Deformable Registration student network. The number of channels is annotated above the layer. Curved arrows represent skip paths (layers connected by an arrow are concatenated before transposed convolution). Smaller canvas means lower spatial resolution (Source).

Adversarial Learning Algorithm with Distilling Knowledge

Our adversarial learning algorithm aims to improve the student network accuracy through the distilled teacher deformations extracted from the teacher network. The learning method comprises a deformation-based adversarial loss Ladv\mathcal{L}_{adv} and its accompanying learning strategy (Algorithm 1).


Figure 2: Adversarial Learning Strategy(Source).

Adversarial Loss. The loss function for the light-weight student network is a combination of the discrimination loss ldisl_{dis} and the reconstruction loss lresl_{res}. However, the forward and backward process through loss function is controlled by the Algorithm 1. In particular, the last deformation loss Ladv\mathcal{L}_{adv} that outputs the final warped image can be written as:

Ladv=Ξ³lrec+(1βˆ’Ξ³)ldis\mathcal{L}_{adv} = \gamma l_{rec} + (1 - \gamma) l_{dis}

where Ξ³\gamma controls the contribution between lrecl_{rec} and ldisl_{dis}. Note that, the Ladv\mathcal{L}_{adv} is only applied for the final warped image.

Discrimination Loss. In the student network the discrimination loss is computed in Equation below}.

ldis=βˆ₯DΞΈ(Ο•s)βˆ’DΞΈ(Ο•t)βˆ₯22+Ξ»(βˆ₯βˆ‡Ο•^sDΞΈ(Ο•^s)βˆ₯2βˆ’1)2l_{{dis}} = \left\lVert D_\mathbf{\theta}(\phi_{s}) - D_\mathbf{\theta}(\phi_{t}) \right\lVert_2^{2} + \lambda\bigg(\left\lVert \nabla_{\hat\phi_{s}}D_\mathbf{\theta}(\hat\phi_{s}) \right\lVert_2 - 1\bigg)^{2}

where Ξ»\lambda controls gradient penalty regularization. The joint deformation Ο•^s\hat\phi_{s} is computed from the teacher deformation Ο•t\phi_{t} and the predicted student deformation Ο•s\phi_{s} as follow:

Ο•^s=Ξ²Ο•t+(1βˆ’Ξ²)Ο•s\hat\phi_{s} = \beta \phi_{t} + (1 - \beta) \phi_{s}

where Ξ²\beta control the effect of the teacher deformation.

In Discrimination Loss, DΞΈD_\mathbf{\theta} is the discriminator, formed by a neural network with learnable parameters ΞΈ{\theta}. The details of DΞΈD_\mathbf{\theta} is shown in Figure 3. In particular, DΞΈD_\mathbf{\theta} consists of six 3D3D convolutional layers, the first layer is 128Γ—128Γ—128Γ—3128 \times 128 \times 128 \times 3 and takes the cΓ—cΓ—cΓ—1c \times c \times c \times 1 deformation as input. cc is equaled to the scaled size of the input image. The second layer is 64Γ—64Γ—64Γ—1664 \times 64 \times 64 \times 16. From the second layer to the last convolutional layer, each convolutional layer has a bank of 4Γ—4Γ—44 \times 4 \times 4 filters with strides of 2Γ—2Γ—22 \times 2 \times 2, followed by a ReLU activation function except for the last layer which is followed by a sigmoid activation function. The number of output channels of the convolutional layers starts with 1616 at the second layer, doubling at each subsequent layer, and ends up with 256256.

Basically, this is to inject the condition information with a matched tensor dimension and then leave the network learning useful features from the condition input. The output of the last neural layer is the mean feature of the discriminator, denoted as MM. Note that in the discrimination loss, a gradient penalty regularization is applied to deal with critic weight clipping which may lead to undesired behavior in training adversarial networks.


Figure 3: The structure of the discriminator DΞΈD_\mathbf{\theta} used in the Discrimination Loss (ldisl_{dis}) of our Adversarial Learning with Distilling Knowledge algorithm (Source).

Reconstruction Loss. The reconstruction loss lrecl_{rec} is an important part of a deformation estimator. Follow the VTN [3] baseline, the reconstruction loss is written as:

lrec(Imh,If)=1βˆ’CorrCoef[Imh,If]l_{{rec}} (\textbf{\textit{I}}_m^h,\textbf{\textit{I}}_f) = 1 - CorrCoef [\textbf{\textit{I}}_m^h,\textbf{\textit{I}}_f]


CorrCoef[I1,I2]=Cov[I1,I2]Cov[I1,I1]Cov[I2,I2]CorrCoef[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] = \frac{Cov[\textbf{\textit{I}}_1,\textbf{\textit{I}}_2]}{\sqrt{Cov[\textbf{\textit{I}}_1,\textbf{\textit{I}}_1]Cov[\textbf{\textit{I}}_2,\textbf{\textit{I}}_2]}}
Cov[I1,I2]=1βˆ£Ο‰βˆ£βˆ‘xβˆˆΟ‰I1(x)I2(x)βˆ’1βˆ£Ο‰βˆ£2βˆ‘xβˆˆΟ‰I1(x)βˆ‘yβˆˆΟ‰I2(y)Cov[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] = \frac{1}{|\omega|}\sum_{x \in \omega} \textbf{\textit{I}}_1(x)\textbf{\textit{I}}_2(x) - \frac{1}{|\omega|^{2}}\sum_{x \in \omega} \textbf{\textit{I}}_1(x)\sum_{y \in \omega}\textbf{\textit{I}}_2(y)

where CorrCoef[I1,I2]CorrCoef[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] is the correlation between two images I1\textbf{\textit{I}}_1 and I2\textbf{\textit{I}}_2, Cov[I1,I2]Cov[\textbf{\textit{I}}_1, \textbf{\textit{I}}_2] is the covariance between them. Ο‰\omega denotes the cuboid (or grid) on which the input images are defined.

Learning Strategy. The forward and backward of the aforementioned Ladv\mathcal{L}_{adv} is controlled by the adversarial learning strategy described in Algorithm 1.

In our deformable registration setup, the role of real data and attacking data is reversed when compared with the traditional adversarial learning strategy. In adversarial learning, the model uses unreal (generated) images as attacking data, while image labels are ground truths. However, in our deformable registration task, the model leverages the unreal (generated) deformations from the teacher as attacking data, while the image is the ground truth for the model to reconstruct the input information. As a consequence, the role of images and the labels are reversed in our setup. Since we want the information to be learned more from real data, the generator will need to be considered more frequently. Although the knowledge in the discriminator is used as attacking data, the information it supports is meaningful because the distilled information is inherited from the high-performed teacher model. With these characteristics of both the generator and discriminator, the light-weight student network is expected to learn more effectively and efficiently.


[1] S. Zhao, Y. Dong, E. I. Chang, Y. Xu, et al., Recursive cascaded networks for unsupervised medical image registration, in ICCV, 2019.

[2] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, ArXiv, 2015.

[3] S. Zhao, T. Lau, J. Luo, I. Eric, C. Chang, and Y. Xu, Unsupervised 3d end-to-end medical image registration with volume tweening network, IEEE J-BHI, 2019.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge

Introduction: Medical image registration

Medical image registration is the process of systematically placing separate medical images in a common frame of reference so that the information they contain can be effectively integrated or compared. Applications of image registration include combining images of the same subject from different modalities, aligning temporal sequences of images to compensate for the motion of the subject between scans, aligning images from multiple subjects in cohort studies, or navigating with image guidance during interventions. Since many organs do deform substantially while being scanned, the rigid assumption can be violated as a result of scanner-induced geometrical distortions that differ between images. Therefore, performing deformable registration is an essential step in many medical procedures.

Previous Studies, Remaining Challenges, and Motivation

Recently, learning-based methods have become popular to tackle the problem of deformable registration. These methods can be split into two groups: (i) supervised methods that rely on the dense ground-truth flows obtained by either traditional algorithms or simulating intra-subject deformations. Although these works achieve state-of-the-art performance, they require a large amount of manually labeled training data, which are expensive to obtain; and (ii) unsupervised learning methods that use a similarity measurement between the moving and the fixed image to utilize a large amount of unlabelled data. These unsupervised methods achieve competitive results in comparison with supervised methods. However, their deformations are reconstructed without the direct ground-truth guidance, hence leading to the limitation of leveraging learnable information. Furthermore, recent unsupervised methods all share an issue of great complexity as the network parameters increase significantly when multiple progressive cascades are taken into account. This leads to the fact that these works can not achieve real-time performance during inference while requiring intensively computational resources when deploying.

In practice, there are many scenarios when medical image registration are needed to be fast - consider matching preoperative and intra-operative images during surgery, interactive change detection of CT or MRI data for a radiologist, deformation compensation or 3D alignment of large histological slices for a pathologist, or processing large amounts of images from high-throughput imaging methods. Besides, in many image-guided robotic interventions, performing real-time deformable registration is an essential step to register the images and deal with organs that deform substantially. Economically, the development of a CPU-friendly solution for deformable registration will significantly reduce the instrument costs equipped for the operating theatre, as it does not require GPU or cloud-based computing servers, which are costly and consume much more power than CPU. This will benefit patients in low- and middle-income countries, where they face limitations in local equipment, personnel expertise, and budget constraints infrastructure. Therefore, design an efficient model which is fast and accurate for deformable registration is a crucial task and worth for study in order to improve a variety of surgical interventions.


Deformable registration is a crucial step in many medical procedures such as image-guided surgery and radiation therapy. Most recent learning-based methods focus on improving the accuracy by optimizing the non-linear spatial correspondence between the input images. Therefore, these methods are computationally expensive and require modern graphic cards for real-time deployment. Thus, we introduce a new Light-weight Deformable Registration network that significantly reduces the computational cost while achieving competitive accuracy (Fig.1). In particular, we propose a new adversarial learning with distilling knowledge algorithm that successfully leverages meaningful information from the effective but expensive teacher network to the student network. We design the student network such as it is light-weight and well suitable for deployment on a typical CPU. The extensively experimental results on different public datasets show that our proposed method achieves state-of-the-art accuracy while significantly faster than recent methods. We further show that the use of our adversarial learning algorithm is essential for a time-efficiency deformable registration method.


Figure 1: Comparison between typical deep learning-based methods for deformable registration (a) and our approach using adversarial learning with distilling knowledge for deformable registration (b). In our work, the expensive Teacher Network is used only in training; the Student Network is light-weight and inherits helpful knowledge from the Teacher Network via our Adversarial Learning algorithm. Therefore, the Student Network has high inference speed, while achieving competitive accuracy (Source).


Method overview

We describe our method for Light-weight Deformable Registration using Adversarial Learning with Distilling Knowledge. Our method is composed of three main components: (i)) a Knowledge Distillation module which extracts meaningful deformations Ο•t\bm{\phi_t} from the Teacher Network; (ii) a Light-weight Deformable Registration (LDR) module which outputs a high-speed Student Network; and (iii) an Adversarial Learning with Distilling Knowledge (ALDK) algorithm which effectively leverages teacher deformations Ο•t\bm{\phi}_t to the student deformations. An overview of our proposed deformable registration method can be found in Fig.2.


Figure 2: An overview of our proposed Light-weight Deformable Registration (LDR) method using Adversarial Learning with Distilling Knowledge (ALDK). Firstly, by using knowledge distillation, we extract the deformations from the Teacher Network as meaningful ground-truths. Secondly, we design a light-weight student network, which has competitive speed. Finally, We employ the Adversarial Learning with Distilling Knowledge algorithm to effectively transfer the meaningful knowledge of distilled deformations from the Teacher Network to the Student Network (Source).

Since the content may over-length, in this part, we introduce the background theory for Deformable Registration and Knowledge Distillation for Deformation. In the next part, we will introduce the Architecture of Light-weight Deformable Registration Network and Adversarial Learning Algorithm with Distilling Knowledge. In the final part, we will introduce the effectiveness of the method in comparison with recent states of the arts and detailed analysis.

Background: Deformable Registration

We follow RCN [1] to define deformable registration task recursively using multiple cascades. Let Im,If\textbf{\textit{I}}_m, \textbf{\textit{I}}_f denote the moving image and the fixed image respectively, both defined over dd-dimensional space Ξ©\bm{\Omega}. A deformation is a mapping Ο•:Ξ©β†’Ξ©\bm{\phi} : \bm{\Omega} \rightarrow \bm{\Omega}. A reasonable deformation should be continuously varying and prevented from folding. The deformable registration task is to construct a flow prediction function F\textbf{F} which takes Im,If\textbf{\textit{I}}_m, \textbf{\textit{I}}_ f as inputs and predicts a dense deformation Ο•\bm{\phi} that aligns Im\textbf{\textit{I}}_m to If\textbf{\textit{I}}_f using a warp operator ∘\circ as follows:

F(n)(Im(nβˆ’1),If)=Ο•(n)∘F(nβˆ’1)(Ο•(nβˆ’1)∘Im(nβˆ’2),If)\textbf{F}^{(n)}(\textbf{\textit{I}}^{(n-1)}_m,\textbf{\textit{I}}_f)=\phi^{(n)} \circ \textbf{F}^{(n-1)}(\phi^{(n-1)} \circ \textbf{\textit{I}}^{(n-2)}_m,\textbf{\textit{I}}_f)

where F(nβˆ’1)\textbf{F}^{(n-1)} is the same as F(n)\textbf{F}^{(n)}, but in a different flow prediction function. Assuming for nn cascades in total, the final output is a composition of all predicted deformations, i.e.,

F(Im,If)=Ο•(n)∘...βˆ˜Ο•(1),\textbf{F}(\textbf{\textit{I}}_m, \textbf{\textit{I}}_f)=\phi^{(n)} \circ...\circ \phi^{(1)},

and the final warped image is constructed by

Im(n)=F(Im,If)∘Im\textbf{\textit{I}}_{m}^{(n)}=\textbf{F}(\textbf{\textit{I}}_m,\textbf{\textit{I}}_f) \circ \textbf{\textit{I}}_m

In general, previous Equations form the hypothesis function F\mathcal{F} under the learnable parameter W\mathbf{W},

F(Im,If,W)=(vΟ•,Im(n))\mathcal{F}(\textbf{\textit{I}}_{m}, \textbf{\textit{I}}_f, \mathbf{W}) = (\mathbf{v}_{\phi}, \textbf{\textit{I}}_m^{(n)})

where vΟ•=[Ο•(1),Ο•(2),...,Ο•(k),...,Ο•(n)]\mathbf{v}_{\phi} = [\bm{\phi}^{(1)}, \bm{\phi}^{(2)}, ..., \bm{\phi}^{(k)},..., \bm{\phi}^{(n)}] is a vector containing predicted deformations of all cascades. Each deformation Ο•(k)\bm{\phi}^{(k)} can be computed as

Ο•(k)=F(k)(Im(kβˆ’1),If,WΟ•(k))\bm{\phi}^{(k)} = {\mathcal{F}}^{(k)}\left(\textbf{\textit{I}}_{m}^{(k-1)}, \textbf{\textit{I}}_f, \mathbf{W}_{\phi^{(k)}}\right)

To estimate and achieve a good deformation, different networks are introduced to define and optimize the learnable parameter W\mathbf{W}.

Knowledge Distillation for Deformation

Knowledge distillation is the process of transferring knowledge from a cumbersome model (teacher model) to a distilled model (student model). The popular way to achieve this goal is to train the student model on a transfer set using a soft target distribution produced by the teacher model.

Different from the typical knowledge distillation methods that target the output softmax of neural networks as the knowledge, in the deformable registration task, we leverage the teacher deformation Ο•t\bm{\phi}_t as the transferred knowledge. As discussed in [2], teacher networks are usually high-performed networks with good accuracy. Therefore, our goal is to leverage the current state-of-the-art Recursive Cascaded Networks (RCN) [1] as the teacher network for extracting meaningful deformations to the student network. The RCN network contains an affine transformation and a large number of dense deformable registration sub-networks designed by VTN [3]. Although the teacher network has expensive computational costs, it is only applied during the training and will not be used during the inference.


[1] S. Zhao, Y. Dong, E. I. Chang, Y. Xu, et al., Recursive cascaded networks for unsupervised medical image registration, in ICCV, 2019.

[2] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, ArXiv, 2015.

[3] S. Zhao, T. Lau, J. Luo, I. Eric, C. Chang, and Y. Xu, Unsupervised 3d end-to-end medical image registration with volume tweening network, IEEE J-BHI, 2019.

Open Source

🐱 Github: https://github.com/aioz-ai/LDR_ALDK

Medical Visual Question Answering Challenges

Visual Question Answering in Medical Domain

Visual Question Answering (VQA) aims to provide a correct answer for a given question consistent with the visual content of a given image. The overarching goal of this issue is to create systems that can comprehend the contents of an image in the same way that humans do and communicate effectively about that image in natural language. It is indeed a challenging task as it necessitates the interaction and complementation of both image feature extractor and natural language processor.

In medical domain, VQA could benefit both doctors and patients. VQA systems capable of understanding medical images and answering questions related to their content could support clinical education, clinical decision, and patient education. For example, doctors could use answers provided by VQA system as support materials in decision making. It can also help doctors to get a second opinion in diagnosis and reduce the high cost of training medical professionals. While patients could ask VQA questions related to their medical images for better understanding their health.

Compared with VQA in the general domain, Med-VQA is a much more challenging problem. On one hand, clinical questions are more difficult but need to be answered with higher accuracy since they relate to health and safety


Figure 1: An example of Medical VQA (Source).

There are challenges that need to be addressed when using VQA in Medical domain:

  • Lack of large scale labeled datasets.
  • Using Transfer learning in medical domain.

In this blog, we will analyze this challenges.

Challenges in Medical VQA

1. Lack of large scale labeled datasets.

One major problem with medical VQA is the lack of large scale labeled training data which usually requires huge efforts to build. Difficulties in building a medical VQA dataset

  • (i) designing goal-oriented VQA systems and datasets
  • (ii) categorizing the clinical questions
  • (iii) selecting (clinically) relevant images
  • (iv) capturing the context and the medical knowledge.

The first attempt for building the dataset for medical VQA is by ImageCLEF-Med. In this, images were automatically captured from PubMed Central articles. The questions and answers were automatically generated from corresponding captions of images. By that construction, the data has high noisy level, i.e., the dataset includes many images that are not useful for direct patient care and it also contains questions that do not make any sense.

Well-annotated datasets for training Medical VQA systems are extremely lacking, but it is very laborious and expensive to obtain high-quality annotations by medical experts. The first manually constructed VQA-RAD dataset for medical VQA task is released. Unfortunately, it contains only 315 images, which prevents to directly apply the powerful deep learning models for the VQA problem.

2. Using Transfer learning in a specific domain.

Transfer learning is an important step to extract meaningful features and overcome the data limitation in the medical Visual Question Answering (VQA) task. Transfer learning, in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet, is a popular way to initialize the feature extraction process. However, due to the difference in visual concepts between ImageNet images and medical images, finetuning process is not sufficient. Recently, Model Agnostic Meta-Learning (MAML) has been introduced to overcome the aforementioned problem by learning meta-weights that quickly adapt to visual concepts. However, MAML is heavily impacted by the meta-annotation phase for all images in the medical dataset. Different from normal images, transfer learning in medical images is more challenging due to:

  • (i) noisy labels may occur when labeling images in an unsupervised manner;
  • (ii) high-level semantic labels cause uncertainty during learning;
  • (iii) difficulty in scaling up the process to all unlabeled images in medical datasets.

Multiple Meta-model Quantifying for Medical Visual Question Answering


A medical Visual Question Answering (VQA) system can provide meaningful references for both doctors and patients during the treatment process. Extracting image features is one of the most important steps in a medical VQA framework which outputs essential information to predict answers. Transfer learning, in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet, is a popular way to initialize the feature extraction process. However, due to the difference in visual concepts between ImageNet images and medical images, finetuning process is not sufficient. Recently, Model Agnostic Meta-Learning (MAML) has been introduced to overcome the aforementioned problem by learning meta-weights that quickly adapt to visual concepts. However, MAML is heavily impacted by the meta-annotation phase for all images in the medical dataset. Different from normal images, transfer learning in medical images is more challenging due to:

  • (i) noisy labels may occur when labeling images in an unsupervised manner;
  • (ii) high-level semantic labels cause uncertainty during learning;
  • (iii) difficulty in scaling up the process to all unlabeled images in medical datasets.

Federated Learning, Challenges and Hot Trends for research

Federated learning and its remaining challenge

Federated learning is the process of training statistical models via a network of distant devices or siloed data centers, such as mobile phones or hospitals, while keeping data locally. In terms of federated learning, there are five major obstacles that have a direct impact on the paper publishing trend.

1. Expensive Communication

Due to the internet connection, huge number of users, and administrative costs, there is a bottleneck in communication between devices and server-devices.

2. Systems Heterogeneity

Because of differences in hardware (CPU, RAM), network connection (3G, 4G, 5G, wifi), and power, each device in federated networks may have different storage, computational, and communication capabilities (battery level).

3. Statistical Heterogeneity

Devices routinely create and collect data in non-identically dispersed ways across the network; for example, in the context of a next word prediction, mobile phone users employ a variety of languages. Furthermore, the quantity of data points on different devices may differ greatly, and an underlying structure may exist that describes the interaction between devices and their related distributions. This data generation paradigm violates frequently-used independent and identically distributed (I.I.D. problem) assumptions in distributed optimization, increases the likelihood of stragglers, and may add complexity in terms of modeling, analysis, and evaluation.

4. Privacy Concerns

In federated learning applications, privacy is a crucial problem. Federated learning takes a step toward data protection by sharing model changes, such as gradient information, rather than the raw data created on each device. Nonetheless, transmitting model updates during the training process may divulge sensitive information to a third party or the central server.

5. Domain transfer

Not any task can be applied to the federated learning paradigm to finish their training process due to the aforementioned four challenges.

Hot trends

Data distribution heterogeneity and label inadequacy.

  • Distributed Optimization
  • Non-IID and Model Personalization
  • Semi-Supervised Learning
  • Vertical Federated Learning
  • Decentralized FL
  • Hierarchical FL
  • Neural Architecture Search
  • Transfer Learning
  • Continual Learning
  • Domain Adaptation
  • Reinforcement Learning
  • Bayesian Learning

Security, privacy, fairness, and incentive mechanisms:

  • Adversarial-Attack-and-Defense
  • Privacy
  • Fairness
  • Interpretability
  • Incentive Mechanism

Communication and computational resource constraints, software and hardware heterogeneity, the FL system

  • Communication-Efficiency
  • Straggler Problem
  • Computation Efficiency
  • Wireless Communication and Cloud Computing
  • FL System Design

Models and Applications

  • Models
  • Natural language Processing
  • Computer Vision
  • Health Care
  • Transportation
  • Recommendation System
  • Speech
  • Finance
  • Smart City
  • Robotics
  • Networking
  • Blockchain
  • Other

Benchmark, Dataset, and Survey

  • Benchmark and Dataset
  • Survey

Autonomous Navigation in Complex Environments with Deep Multimodal Fusion Network


Autonomous navigation is a long-standing field of robotics research, which provides an essential capability for mobile robot to execute a series of tasks on the same environments performed by human everyday. In general, the task of autonomous navigation is to control a robot navigate around the environment without colliding with obstacles. It can be seen that navigation is an elementary skill for intelligent agents, which requires decision-making across a diverse range of scales in time and space. In practice, autonomous navigation is not a trivial task since the robot needs to close the perception-control loop under the uncertainty in order to obtain the autonomy.

Furthermore, autonomous navigation with intelligent robots in complex environments is also a crucial task, especially in time-sensitive scenarios such as disaster response, search and rescue, etc. Recently, the DARPA Subterranean Challenge was organized to explore novel methods for quickly mapping, navigating, and searching in complex underground environments such as human-made tunnel systems, urban underground, and natural cave networks.

Inspired by the DARPA Subterranean Challenge, in this work, we propose a learning-based system for end-to-end mobile robot navigation in complex environments such as collapsed cities, collapsed houses, and natural caves. To overcome the difficulty in data collection, we build a large scale simulation environment which allows us to collect the training data and deploy the learned control policy. We then propose a new Navigation Multimodal Fusion Network (NMFNet) that effectively learns the visual perception from sensor fusion and allows the robot to autonomously navigate in complex environments. To summarize, our main contributions are as follows:

  • We introduce new simulation models that can be used to record large-scale datasets for autonomous navigation in complex environments.
  • We present a new deep learning method that fuses both laser data, 2D images, and 3D point cloud to improve the navigation ability of the robot in complex environments.
  • We show that the use of multiple visual modalities is essential to learn a robust robot control policy for autonomous navigation in complex environments in order to deploy in real-world scenarios


Figure 1: A visualization of a collapsed city, a complex environment from our simulation. Top: The collapsed city from a top view; Bottom: A mobile robot’s view..


2D Features

In this work, we use ResNet8 to extract deep features from the input RGB image and laser distance map. The ResNet8 has 3 residual blocks, each block consists of a convolutional layer, ReLU, skip links and batch normalization operations. A detailed visualization of ResNet8 architecture can be found in Fig. 2. We choose ResNet8 to extract deep features from the 2D images since it is a light weight architecture and can achieve competitive performance while being robust again the vanishing/exploding gradient problems during training.


Figure 2: A detailed visualization of ResNet8.

3D Point Cloud

To extract the point cloud feature vector, our network has to learn a model that is invariant to input permutation. We extract the features from the unordered point cloud by learning a symmetric function on transformed elements. Given an unordered point set {x1,x2,...,xn}\{x_1, x_2,..., x_n\} with xi∈R3x_i \in \mathbb{R}^3 , we can define a symmetric function f:Xβ†’Rf : X \rightarrow \mathbb{R} that maps a set of unordered points to a feature vector as follow:

f(x1,x2,...,xn)=Ξ΄(MAXi=1..n{Ξ³(xi)})f(x_1, x_2,...,x_n) = \delta(\underset{i=1..n}{MAX}\{\gamma(x_i)\})

where MAXMAX is a vector max operator that takes nn input vectors and returns a new vector of the element-wise maximum; Ξ΄\delta and Ξ³\gamma are usually presented by neural networks. In practice, Ξ΄\delta and Ξ³\gamma function are approximated by an affine transformation matrix with a mini multi-layer preception network (i.e., T-net) and a matrix multiplication operation.

Multimodal Fusion

Given the features from the point cloud branch and the RGB image branch, we first do an early fusion by concatenating the features extracted from the input cloud with the deep features extracted from the RGB image. This concatenated feature is then fed through two 1Γ—11 Γ— 1 convolutional layers. Finally, we combine the features from 2D-3D fusion with the extracted features from the distance map branch. The steering angle is predicted from a final fully connected layer keeping all the features from the multimodal fusion network.


To train an end-to-end navigation network, two popular approaches are used: classification loss and regression loss. Our training NMFNet with three branches to handle three visual modalities is illustrated in Fig. 3.


Figure 3: An overview of our NMFNet architecture. The network consists of three branches: The first branch learns the depth features from the distance map. The second branch extracts the features from RGB images, and the third branch handles the point cloud data. We then employ the 2D-3D fusion to predict the steering angle.


Data Collection

We create the simulation models of these environments in Gazebo and collect the visual data from simulation. In particular, we collect the data from three types of complex environment:

  • Collapsed house: The house suffered from an accident or a disaster (e.g. an earthquake) with random objects on the ground.
  • Collapsed city: Similar to the collapsed house, but for the outdoor environment. In this scenario, the road has many debris from the collapsed house/wall.
  • Natural cave: A long natural tunnel in a poor brightness condition with irregular geological structures.

For each environment, we use a mobile robot model equipped with a laser sensor and a depth camera mounted on top of the robot to collect the visual data. The robot is controlled manually to navigate around each environment. We collect the visual data when the robot is moving. All the visual data are synchronized with a current steering signal of the robot at each timestamp. Examples of different robot’s views in our simulation of complex environments (collapsed city, natural cave, and collapsed house) are described in Fig. 4.


Figure 4: Top row: RGB images from robot viewpoint in normal setting. Other rows: The same viewpoint when applying domain randomization to the simulated environments.


Table I summarizes the regression results using Root Mean Square Error (RMSE) of our NMFNet and other state-of-theart methods. From the table, we can see that our NMFNet outperforms other methods by a significant margin. In particular, our NMFNet trained with domain randomisation data achieves 0.389 RMSE which is a clear improvement over other methods using only RGB images such as DroNet [29]. This also confirms that using multi visual modalities input as in our fusion network is the key to successfully navigate in complex environments.



Table II shows the RMSE scores when different modalities are used to train the system. We first notice that the network that uses only point cloud data as the input does not converge. This confirms that learning meaningful features from point cloud data is challenging, especially in complex environments. On the other hand, we achieve a surprisingly good result when the distance map modality is used as the input. The other combinations between two modalities show reasonable accuracy, however, we achieve the best result when the network is trained end-to-end using the fusion from all three modalities: rgb, distance map from laser camera, and point cloud from the depth camera.




We propose NMFNet, an end-to-end and real-time deep learning framework for autonomous navigation in complex environments. Our network has three branches and effectively learns the visual fusion input data. Furthermore, we show that the use of mutilmodal sensor data is essential for autonomous navigation in complex environments. The extensively experimental results show that our NMFNet outperforms recent state-of-the-art methods by a fair margin while achieving real-time performance and generalizing well on unseen environments.

Deep Metric Learning Meets Deep Clustering - An Novel Unsupervised Approach for Feature Embedding


The objective of deep distance metric learning (DML) is to train a deep learning model that maps training samples into feature embeddings that are close together for samples that belong to the same category and far apart for samples from different categories. Traditional DML approaches require supervised information, i.e., class labels, to supervise the training. Although the supervised DML achieves impressive results on different tasks, it requires large amount of annotated training samples to train the model. Unfortunately, such large datasets are not always available and they are costly to annotate for specific domains. That disadvantage also limits the transferability of supervised DML to new domain/applications which do not have labeled data. These reasons have motivated recent studies aiming at learning feature embeddings without annotated datasets --- unsupervised deep distance metric learning (UDML). Our study is in that same direction, i.e., learning embeddings from unlabeled data.

There are two main challenges for UDML:

  • Firstly, how to define positive and negative samples for a given anchor data point, such that we can apply distance-based losses, e.g., pairwise loss or triplet loss, in the embedding space.
  • Secondly, how to make the training efficient, given a large number of pairs or triplets of samples, in the order of O(N2)\mathcal{O}(N^2) or O(N3)\mathcal{O}(N^3), respectively, in which NN is the number of training samples.

In this paper, we propose a new method that utilizes deep clustering for deep metric learning to address the two challenges mentioned above. In particular,

  • We propose to use a deep clustering loss to learn centroids, i.e., pseudo labels, that represent semantic classes.
  • During learning, these centroids are also used to reconstruct the input samples. It hence ensures the representativeness of centroids β€” each centroid represents visually similar samples. Therefore, the centroids give information about positive (visually similar) and negative (visually dissimilar) samples.
  • Based on pseudo labels, we propose a novel unsupervised metric loss which enforces the positive concentration and negative separation of samples in the embedding space.



Figure 1: Illustration of the proposed framework which consists of an encoder (G), an embedding module (F), a decoder (D) and three losses, i.e., clustering loss LrimL_{rim}, reconstruction loss LrecL_{rec} and metric loss LmL_m. The details are presented in the text (Source).

The proposed framework is presented in Figure 1.

  • For every original image in a batch, we make an augmented version by using a random geometric transformation.
  • The input images are fed into the backbone network which is also considered as the encoder (G) to get image representations.
  • The image representations are passed through the embedding module which consists of fully connected and L2 normalization layers (F), which results in unit norm image embeddings.
  • The clustering module takes image embeddings as inputs, performs the clustering with a clustering loss, and outputs the cluster assignments.
  • Given the cluster assignments, centroid representations are computed from image representations, which are then passed through the decoder (D) with a reconstruction loss to reconstruct images that belong to the corresponding clusters.
  • The centroid representations are also passed through the embedding module (F) to get centroid embeddings. The centroid embeddings and image embeddings are used as inputs for the metric loss.

Discriminative Clustering

We formulate the clustering of embedding features as a classification problem. Given a set of embedding features X={xi}i=1m∈R128Γ—mX=\{x_i\}_{i=1}^m \in \mathbb{R}^{128 \times m} in a batch and the number of clusters K≀mK\le m (i.e., the number of clusters KK is limited by the batch size mm). The cluster assignment for xx then is estimated by cβˆ—=argmaxcyc^* = argmax_c y. Let Y={yi}i=1mY=\{y_i\}_{i=1}^m be the set of softmax outputs for XX.

Inspired by Regularized Information Maximization (RIM), we use the following objective function (1) for the clustering.

Lrim=R(ΞΈ)βˆ’Ξ»[H(Y)βˆ’H(Y∣X)](1)L_{rim} = \mathcal{R}(\theta) - \lambda \left[ H(Y) - H(Y|X) \right] (1)

where H(.)H(.) and H(.∣.)H(.|.) are entropy and conditional entropy, respectively; R(θ)\mathcal{R}(\theta) regularizes the classifier parameters (in this work we use l2l_2 regularization); λ\lambda is a weighting factor to control the importance of two terms.

Minimizing (1) is equivalent to maximizing H(Y)H(Y) and minimizing H(Y∣X)H(Y|X). Increasing the marginal entropy H(Y)H(Y) encourages cluster balancing, while decreasing the conditional entropy H(Y∣X)H(Y|X) encourages cluster separation.


In order to enhance the representativeness of centroids, we introduce a reconstruction loss (2) that penalizes high reconstruction errors from centroids to corresponding samples. Specifically, the decoder takes a centroid representation of a cluster and minimizes the difference between input images that belong to the cluster and the reconstructed image from the centroid representation.

Lrec=1mβˆ‘j=1Kβˆ‘Ii∈Xj∣∣Iiβˆ’D(rj)∣∣2,(2)L_{rec} = \frac{1}{m} \sum_{j=1}^K\sum\limits_{I_i\in X_j}||I_i - D(r_j)||^2, (2)

where D(.)D(.) is the decoder which reconstructs samples in the batch using their corresponding centroid representations and mm is the number of images in the batch.

Metric loss

Let fi∈R128f_i \in \mathbb{R}^{128} and fi^∈R128\hat{f_i} \in \mathbb{R}^{128} be the image embeddings of IiI_i and Ii^\hat{I_i}, respectively. The proposed metric loss (3) aims to minimize the distance between fif_i and fi^\hat{f_i} while pushing fif_i far away from negative clusters.

Lm=βˆ’βˆ‘ilog(l(Ii,Ii^))βˆ’βˆ‘iβˆ‘j=1,jβ‰ qKlog(l(Ii,cj))(3)L_{m} = -\sum_i log \left( l(I_i,\hat{I_i}) \right) - \sum_i\sum\limits_{j=1, j \neq q}^K log \left(l(I_i,c_j) \right) (3)

Final Loss

The network in Figure 1 is trained in an end-to-end manner with the following multi-task loss.

L=Ξ±Lm+Ξ²Lrim+Ξ³Lrec,(4)L = \alpha L_{m} + \beta L_{rim} + \gamma L_{rec}, (4)

where LmL_m is the center-based softmax loss (3) for deep metric learning, LrimL_{rim} is the clustering loss (1), and LrecL_{rec} is the reconstruction loss (2).

Experimental Results

Ablation Study

We denote our model with:

  • only clustering loss (1) as only LrimL_{rim}.
  • both clustering and the metric losses (2) and (3) as Center-based Softmax (CBS).
  • Center-based Softmax with Reconstruction (CBSwR).

Table 1: The impact of each loss component on the performance on CUB200-2011 dataset and the comparison to the baseline (Source).


Table 2: The impact of each loss component on the performance on Car196 dataset and the comparison to the baseline (Source).

Tables 1 and 2 present the comparative results between methods. The results show that using only the clustering loss, the accuracy is significantly lower than the baseline SME. However, when using the centroids from the clustering for calculating the metric loss (i.e., CBS), it gives the performance boost over the baseline (i.e., SME). Furthermore, the reconstruction loss enhances the representativeness of centroids, as confirmed by the improvements of CBSwR over CBS on both datasets.


Table 3: The training time (seconds) of different methods on CUB200-2011 and Car196 datasets with 20 epochs. The models are trained on a NVIDIA GeForce GTX 1080-Ti GPU (Source).


Table 4: The impact of the number of clusters of the final model CBSwR on the performance on CUB200-2011 dataset (Source).

Table 3 presents the training time of different methods on the CUB200-2011 and Car196 datasets. Although the asymptotic complexity of CBSwR for training one batch is O(Km)\mathcal{O}(Km), it also consists of a decoder part which affects the real training. It is worth noting that the decoder is only involved during training. During testing, our method has similar computational complexity as SME.

Table 4 presents the impact of the number of clusters KK in the clustering loss on the CUB200-2011 dataset with our proposed model CBSwR (recall that the number of clusters KK is limited by the batch size mm). During training, the number of samples per clusters vary depending on batches and the number of clusters. At K=32K=32 which is our final setting, the number of samples per cluster varies from 2 to 11, on the average. The retrieval performance is just slightly different for the different number of clusters. This confirms the robustness of the proposed method w.r.t. the number of clusters.

Comparison to the state of the art


Table 5: Clustering and Recall performance on the CUB200-2011 dataset (Source).


Table 6: Clustering and Recall performance on the Car196 dataset (Source).

Table 5 presents the comparative results on CUB200-2011 dataset. In terms of clustering quality (NMI metric), the proposed method and the state-of-the-art UDML methods MOM and SME achieve comparable accuracy. However, in terms of retrieval accuracy R@K, our method outperforms other approaches. Our proposed method is also competitive to most of the supervised DML methods.

Table 6 presents comparative results on Car196 dataset. Compared to unsupervised methods, the proposed method outperforms other approaches in terms of retrieval accuracy at all ranks of K. Our method is comparable to other unsupervised methods in terms of clustering quality.

Quantity Visualization


Figure 2: Barnes-Hut t-SNE visualization of our embedding on the CUB200-2011 dataset.

Figure 2 shows the t-SNE plots on our learned embedding features on CUB200-2011. We can see that our embedding produces reasonable results in grouping similar visual objects despite the significant variations in view-point, pose, and configuration.


We propose a new method that utilizes deep clustering for deep metric learning to address the two challenges in UDML, i.e., positive/negative mining and efficient training. The method is based on a novel loss that consists of a learnable clustering function, a reconstruction function, and a center-based metric loss function. Our experiments on CUB200-2011 and Car196 datasets show state-of-the-art performance on the retrieval task, compared to other unsupervised learning methods.

Open Source

🐱 Github: https://github.com/aioz-ai/BMVC20_CBSwR

Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering.

Different approaches have been proposed to Visual Question Answering (VQA). However, few works are aware of the behaviors of varying joint modality methods over question type prior knowledge extracted from data in constraining answer search space, of which information gives a reliable cue to reason about answers for questions asked in input images. In this blog, we share a novel VQA model that utilizes the question-type prior information to improve VQA by leveraging the multiple interactions between different joint modality methods based on their behaviors in answering questions from different types. The solid experiments on two benchmark datasets, i.e., VQA 2.0 and TDIUC, indicate that the proposed method yields the best performance with the most competitive approaches.


There are works that consider types of question as the side information whichgives a strong cue to reason about the answer. However, the relation between question types and answers from training data have not been investi-gated yet. Fig. 1 shows the correlation between question types and some answersin the VQA 2.0 dataset. It suggests that a question regarding the quantityshould be answered by a number, not a color. The observation indicated that theprior information got from the correlations between question types and answers open an answer search space constrain for the VQA model. The search spaceconstrain is useful for VQA model to give out final prediction and thus, improvethe overall performance. The Fig. 1 is consistent with our observation, e.g., itclearly suggests that a question regarding the quantity should be answered by anumber, not a color.

Figure 1. The distribution of candidate answers in each question type in VQA 2.0.
Although different joint modality methods or attention mechanisms have been proposed, we hypothesize that each method may capture different aspects of the input. That means different attentions may provide different answers for questions belonged to different question types.
Fig.2 shows examples in which the attention models (SAN and BAN) attend on different regions of input images when dealing with questions from different types. Unfortunately, most of recent VQA systems are based on single attention models BAN2, SAN, MLP, MCB, STL. From the above observation, it is necessary to develop a VQA system which leverages the power of different attention models to deal with questions from different question types.
Figure 2. Examples of attention maps of different attention mechanisms. BAN and SAN identify different visual areas when answering questions from different types.


The proposed multiple interaction learning with question-type prior knowledge (MILQT) is illustrated in Fig. 3. Similar to the most of the VQA systems, multiple interaction learning with question-type prior knowledge (MILQT) consists of the joint learning solution for input questions and images, followed by a multi-class classification over a set of predefined candidate answers. However, MILQT allows to leverage multiple joint modality methods under the guiding of question-types to output better answers.

Figure 3. The introduced MILQT for VQA.
As in Fig.3, MILQT consists of two modules: Question-type awareness A\mathcal{A}, and Multi-hypothesis interaction learning M\mathcal{M}. The first module aims to learn the question-type representation, which is further used to enhance the joint visual-question embedding features and to constrain answer search space through prior knowledge extracted from data. Based on the question-type information, the second module aims to identify the behaviors of multiple joint learning methods and then justify adjust contributions to giving out final predictions.

Question representation. Given an input question, follow the recent state-of-the-art BAN, we trim the question to a maximum of 12 words. The questions that are shorter than 12 words are zero-padded. Each word is then represented by a 600-D vector that is a concatenation of the 300-D GloVe word embedding and the augmenting embedding from training data. This step results in a sequence of word embeddings with size of 12Γ—60012 \times 600 and is denoted as fwf_w. In order to obtain the intent of question, the fwf_w is passed through a Gated Recurrent Unit (GRU) which results in a 1024-D vector representation fqf_q for the input question.

Image representation. We use bottom-up attention, i.e. an object detection which takes as FasterRCNN backbone, to extract image representation. At first, the input image is passed through bottom-up networks to get KΓ—2048K \times 2048 bounding box representation which is denotes as fvf_v in Fig. 3.

Multi-level multi-modal fusion. Unlike the previous works that perform only one level of fusion between linguistic and visual features that may limit the capacity of these models to learn a good joint semantic space. In our work, a multi-level multi-modal fusion that encourages the model to learn a better joint semantic space is introduced which takes the question-type representation got from question-type classification component as one of inputs.

  • First level multi-modal fusion: The first level fusion is similar to previous works. Given visual features fvf_v, question features fqf_{q}, and any joint modality mechanism,
    we combines visual features with question features and learn attention weights to weight for visual and/or linguistic features. Different attention mechanisms have different ways for learning the joint semantic space. The output of first level multi-modal fusion is denoted as fattf_{att} in the Fig.3.
  • Second level multi-modal fusion: In order to enhance the joint semantic space, the output of the first level multi-modal fusion fattf_{att} is combined with the question-type feature fqtf_{qt}, which is the output of the last FC layer of the Question-type classification component.
    We try two simple but effective operators, i.e. element-wise multiplication --- EWM or element-wise addition --- EWA, to combine fattf_{att} and fqtf_{qt}. The output of the second level multi-modal fusion, which is denoted as fattβˆ’qtf_{att-qt} in Fig.3, can be seen as an attention representation that is aware of the question-type information.

Given an attention mechanism, the fattβˆ’qtf_{att-qt} will be used as the input for a classifier that predicts an answer for the corresponding question. This is shown at the Answer prediction boxes in the Fig.3.

Multi-hypothesis interaction learning As presented in Fig.3, MILQT allows to utilize multiple hypotheses (i.e., joint modality mechanisms). Specifically, we propose a multi-hypothesis interaction learning design M\mathcal{M} that takes answer predictions produced by different joint modality mechanisms and interactively learn to combine them.
Let g∈RAΓ—Jg \in \R^{A \times J} be the matrix of predicted probability distributions over AA answers from the JJ joint modality mechanisms. M\mathcal{M} outputs the distribution ρ∈RA\rho \in \R^{A}, which is calculated from gg through Equation below:

ρ=M(g,wmil)=βˆ‘j(mqtβˆ’ansTwmilβŠ™g)\rho = \mathcal{M} \left(g,w_{mil}\right) = \sum_{j}\left(m^T_{qt-ans}w_{mil} \odot g\right)

wmil∈RPΓ—Jw_ {mil} \in \textbf{R}^{P \times J} is the learnable weight which control the contributions of JJ considered joint modality mechanisms on predicting answer based on the guiding of PP question types; βŠ™\odot denotes Hardamard product.


Experiments on VQA 2.0 test-dev and test-standard.

We evaluate MILQTon the test-dev and test-standard of VQA 2.0 dataset. To train the model,similar to previous works, we use both training set and validationset of VQA 2.0. We also use the Visual Genome as additional training data. MILQT consists of three joint modality mechanisms, i.e., BAN-2, BAN-2-Counter, and SAN accompanied with the EWM for the multi-modal fusion, andthe predicted question type together with the prior information to augment theVQA loss. Table 4 presents the results of different methods on test-dev and test-std of VQA 2.0. The results show that our MILQT yields the good performance with the most competitive approaches.


Table 1. Comparison to the state of the arts on the test-dev and test-standard of VQA 2.0. For fair comparison, Glove embedding and GRU are leveraged for question embedding and Bottom-up features are used to extract visual information. CMP, i.e.Cross-Modality with Pooling, is the LXMERT with the aforementioned setup (Source).

Experiments on TDIUC.

In order to prove the stability of MILQT, we evaluate MILQT on TDIUC dataset.
The results in Table.2 show that the proposed model establishes the state-of-the-art results on both evaluation metrics Arithmetic MPT and Harmonic MPT. Specifically, our model significantly outperforms the recent QTA, i.e., on the overall, we improve over QTA 6.1%6.1\% and 11.1%11.1\% with Arithemic MPT and Harmonic MPT metrics, respectively. It is worth noting that the results of QTA in Table. 2, which are cited from QTA, are achieved when QTA used the one-hot predicted question type of testing question to weight visual features. When using the groundtruth question type to weight visual features, QTA reported 69.11%69.11\% and 60.08%60.08\% for Arithemic MPT and Harmonic MPT metrics, respectively. Our model also outperforms these performances a large margin, i.e., the improvements are 3.9%3.9\% and 6.8%6.8\% for Arithemic MPT and Harmonic MPT metrics, respectively.


Table 2. The comparative results between the proposed model and other models onthe validation set of TDIUC (Source).


We present a multiple interaction learning with question-type prior knowledge for constraining answer search space--- MILQT that takes into account the question-type information to improve the VQA performance at different stages. The system also allows to utilize and learn different attentions under a unified model in an interacting manner. The extensive experimental results show that all proposed components improve the VQA performance. We yields the best performance with the most competitive approaches on VQA 2.0 and TDIUC dataset.

Open Source

Github: https://github.com/aioz-ai/ECCVW20_MILQT

Overcoming Data Limitation in Medical Visual Question Answering

What are the difficulties when dealing with Medical VQA task?

Visual Question Answering (VQA) aims to provide a correct answer to a given question such that the answer is consistent with the visual content of a given image.

In medical domain, VQA could benefit both doctors and patients. For example, doctors could use answers provided by VQA system as support materials in decision making, while patients could ask VQA questions related to their medical images for better understanding their health.


Figure 1: An example of Medical VQA (Source).

However, one major problem with medical VQA is the lack of large scale labeled training data which usually requires huge efforts to build.

  • The first attempt for building the dataset for medical VQA is by ImageCLEF-Med. In this, images were automatically captured from PubMed Central articles. The questions and answers were automatically generated from corresponding captions of images. By that construction, the data has high noisy level, i.e., the dataset includes many images that are not useful for direct patient care and it also contains questions that do not make any sense.
  • Recently, the first manually constructed VQA-RAD dataset for medical VQA task is released. Unfortunately, it contains only 315 images, which prevents to directly apply the powerful deep learning models for the VQA problem. One may think about the use of transfer learning in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet are used for finetuning on the medical VQA. However, due to difference in visual concepts between ImageNet images and medical images, finetuning with very few medical images is not sufficient.

Therefore it is necessary to develop a new VQA framework that can improve the accuracy while still only needs a small labeled training data.

The motivation for our approach to overcome the data limitation of medical VQA comes from two observations:

  • Firstly, we observe that there are large scale unlabeled medical images available. These images are from same domain with medical VQA images. Hence if we train an unsupervised deep learning model using these unlabeled images, the trained weights may be easier to be adapted to the medical VQA problem than the pretrained weights on ImageNet images.
  • Another observation is that although the labeled dataset VQA-RAD is primarily designed for VQA, by spending a little effort, we can extract the new class labels for that dataset. The new class labels allow us to apply the recent meta-learning technique for learning meta-weights, that can be quickly adapted to the VQA problem later.


The proposed medical VQA framework is presented in Figure 2. In our framework, the image feature extraction component is initialized by pretrained weights from MAML and CDAE. After that, the VQA framework will be finetuned in an end-to-end manner on the medical VQA data. In the following sections, we detail the architectures of MAML, CDAE, and our framework.


Figure 2: The proposed medical VQA. The image feature extraction is denoted as 'Mixture of Enhanced Visual Features (MEVF)' and is marked with the red dashed box. The weights of MEVF are intialized by MAML and CDAE (Source).

Model-Agnostic Meta-Learning -- MAML

The MAML model consists of four 3Γ—33\times3 convolutional layers with stride 22 and is ended with a mean pooling layer; each convolutional layer has 6464 filters and is followed by a ReLu layer.

We create the dataset for training MAML by manually reviewing around three thousand question-answer pairs from the training set of VQA-RAD dataset. In our annotation process, images are split into three parts based on its body part labels (head, chest, abdomen). Images from each body part are further divided into three subcategories based on the interpretation from the question-answer pairs corresponding to the images. These subcategories are: 1. normal images in which no pathology is found. 2. abnormal present images in which there are the existence of fluid, air, mass, or tumor. 3. abnormal organ images in which the organs are large in size or in wrong position.

Thus, all the images are categorized into 9 classes:

| head normal | head abnormal present | head abnormal organ |
| chest normal | chest abnormal organ | chest abnormal present |
| abdominal normal | abdominal abnormal organ | abdominal abnormal present |

For every iteration of MAML training (line 3 in Alg.1), 5 tasks are sampled per iteration. For each task, we randomly select 3 classes (from 9 classes). For each class, we randomly select 6 images in which 3 images are used for updating task models and the remaining 3 images are used for updating meta-model.


Denoising Auto Encoder -- CDAE

The encoder maps an image xβ€²x', which is the noisy version of the original image xx, to a latent representation zz which retains useful amount of information. The decoder transforms zz to the output yy. The training algorithm aims to minimize the reconstruction error between yy and the original image xx as follows

Lrec=βˆ₯xβˆ’yβˆ₯22L_{rec} = \left \| x-y \right \|_2^2

In our design, the encoder is a stack of convolutional layers; each of them is followed by a max pooling layer. The decoder is a stack of deconvolutional and convolutional layers. The noisy version xβ€²x' is achieved by adding Gaussian noise to the original image xx.

To train CDAE, we collect 11,77911,779 unlabeled images available online which are brain MRI images, chest X-ray images and CT abdominal images. The dataset is split into train set with 9,4239,423 images and test set with 2,3562,356 images. We use Gaussian noise to corrupt the input images before feeding them to the encoder.

Our VQA framework

After training MAML and CDAE, we use their trained weights to initialize the MEVF image feature extraction component in the VQA framework. We then finetune the whole VQA model using the training set of VQA-RAD dataset.

To train the proposed model, we introduce a multi-task loss func-tion to incorporate the effectiveness of the CDAE to VQA. Formally, our lossfunction is defined as follows:

L=Ξ±1Lvqa+Ξ±2LrecL = \alpha_1 L_{vqa} + \alpha_2 L_{rec}

where LvqaL_{vqa} is a Cross Entropy loss for VQA classification and LrecL_{rec} stands for the reconstruction loss of CDAE . The whole VQA model is finetuned in an end-to-end manner.



Table 1: VQA results on VQA-RAD test set. All reference methods differ at the image feature extraction component. Other components are similar. The Stacked Attention Network (SAN) is used as the attention mechanism in all methods (Source).

Table 1 presents VQA accuracy in both VQA-RAD open-ended and close-ended questions on the test set. The results show that for both MAML and CDAE, by firstly pretraining then finetuning, the finetuning significantly improves the performance over the training from scratch using only VQA-RAD.

In addition, the results also show that our pretraining and finetuning of MAML and CDAE give better performance than the finetuning of VGG-16 which is pretrained on the ImageNet dataset. Our proposed image feature extraction MEVF which leverages both pretrained weights of MAML and CDAE, then finetuning them give the best performance. This confirms the effectiveness of the proposed MEVF for dealing with the limitation of labeled training data for medical VQA.


Table 2: Performance comparison on VQA-RAD test set (Source).

Table 2 presents comparative results between methods. Note that for the image feature extraction, the baselines use the pretrained models (VGG or ResNet) that have been trained on ImageNet and then finetune on the VQA-RAD dataset. For the question feature extraction, all baselines and our framework use the same pretrained models (i.e., Glove) and finetuning on VQA-RAD. The results show that when BAN or SAN is used as the attention mechanism in our framework, it significantly outperforms the baseline frameworks BAN and SAN. Our best setting, i.e. the one with BAN as the attention, achieves the state-of-the-art results and it significantly outperforms the best baseline framework BAN, i.e., the improvements are 16.3%16.3\% and 8.6%8.6\% on open-ended and close-ended VQA, respectively.


In this paper, we proposed a novel medical VQA framework that leverages the meta-learning MAML and denoising auto-encoder CDAE for image feature extraction in order to overcome the limitation of labeled training data. Specifically, CDAE helps to leverage information from the large scale unlabeled images, while MAML helps to learn meta-weights that can be quickly adapted to the VQA problem. We establish new state-of-the-art results on VQA-RAD dataset for both close-ended and open-ended questions.

Open Source

🐱 Github: https://github.com/aioz-ai/MICCAI19-MedVQA

A Brief Introduction to Visual Question Answering

1. Visual Question Answering - Overview

Visual Question Answering (VQA) aims to figure out a correct answer for a given question consistent with the visual content of a given image. The overarching goal of this issue is to create systems that can comprehend the contents of an image in the same way that humans do and communicate effectively about that image in natural language. It is indeed a challenging task as it necessitates the interaction and complementation of both image feature extractor and natural language processor.

There are two main variants of VQA which are Free-Form Opened-Ended (FFOE) VQA and Multiple Choice (MC) VQA. In FFOE VQA, an answer is a free-form response to a given image-question pair input, while in MC VQA, an answer is chosen from an answer list for a given image-question pair input. The discussion of VQA variants will be shared in the next post.

2. Approaches for solving VQA task.

There are three main approaches for VQA:

  • *Compositional VQA models:* the questions are interpreted as a set of many sub-tasks.
  • Bayesian and Question-Aware models: this method is not suitable for use in systems that respond to image-related questions. Since the algorithm based on this method does not try looking at the picture and instead predicts the response based on the Bayesian model by determining the probability of the words in the dataset's responses.
  • Attention based models: this method try to learn the interaction between image and question features in VQA task through a module called attention. Then, the joint features got from that module are leveraged for answering the corresponding question.

The final one is the most successful approach since recent states of the arts included attention mechanisms.

3. Attention based VQA approach.

In general, attention based VQA approaches have four main steps (See Figure 1):

  • Visual Representation: Encode the information from the image into vector(s) by using Convolutional Neural Network (CNN)
  • Textual Representation: Encode the information of question into vector(s) by using Embedding.
  • Joint Representation: A further step to learn the interaction between question(s) and image(s). Output joint features can be vector(s).
  • Answer prediction: the joint features from the previous step are then passed through this module to obtain the predicted answer. This module is mostly formed by a Classification.


Figure 1. The general approach for Visual Question Answering.

3.1. Visual Representation

The basic attributes or aspects that clearly help us recognize a specific object, image, or something are known as features. The distinguishing characteristics are the distinct properties. When operating on a VQA dataset, we must extract the features of various images in order to separate the images based on specific features or aspects. Image features are one of the most important pieces of information for a VQA system to output the correct answer.

Convolutional neural networks have emerged as the gold standard for image pattern recognition. An input image is converted into image features after it is passed through a convolutional network. Each filter in a CNN layer detects various patterns, such as corners, vertex, shapes, curves, and symmetries (See Figure 2).

Img Extraction

Figure 2. An example of feature extraction in VQA classification.

The majority of VQA literature employs CNNs for image processing. The network's final layer is removed, and the remaining network is used to extract image features. For image representation in VQA, objects in images represented by features extracted from an object detector such as the Faster-RCNN bottom-up model.

3.2. Textual Representation

Textual embeddings can be offered in a variety of ways. Count-based and frequency-based techniques such as count vectorization and TF-IDF are examples of older approaches. There are also prediction-based approaches such as a continuous bag of words and skip grams. Pretrained Word2Vec models are also openly accessible. Embeddings can also be created using deep learning architectures such as RNNs, LSTMs, GRUs, and 1-D CNNs. LSTMs are one of the most often used in VQA literature. For question embedding in VQA, Glove or BERT are used widely for capturing the representation of words and sentences in different contexts (See Figure 3 for a sample structure of question embedding).


Figure 3. An example of question embedding for VQA.

3.3. Joint Representation

In current VQA systems, the joint modality component plays an essential role since it would learn meaningful joint representations between linguistic and visual inputs by applying the attention mechanism. There are many works that learn the interaction between question and image. For instance, a novel trilinear interaction model which simultaneously learns high level associations between image, question and answer information- CTI (Do et al. 2018). See Figure 4 for more details.


Figure 4. Compact Trilinear Interaction mechanism for VQA (Source).

3.4. Answer Prediction

In most recent works, the joint features got from the attention mechanism is then passed through a classifier to output predicted answer. However, more modules can also be applied to produce external knowledge and deal with difficult questions. ````

Data Augmentation for Colon Polyp Detection: A systematic Study

Colorectal cancer (CRC)β™‹, also known as bowel cancer or colon cancer, is a cancer development from the colon or rectum called a polyp. Detecting polyps is a common approach in screening colonoscopies to prevent CRC at an early stage. Early colon polyp detection from medical images is still an unsolved problem due to the considerable variation of polyps in shape, texture, size, color, illumination, and the lack of publicly annotated datasets. At AIOZ, we adopt a recently proposed auto-augmentation method for polyp detection. We also conduct a systematic study on the performance of different data augmentation methods for colon polyp detection. The experimental results show that the auto-augmentation achieves the best performance comparing to other augmentation strategies.


Colorectal cancer (CRC) is the third-largest cause of worldwide cancer deaths in men and the second cause in women, with the number of patients, died each year up to 700,000 [1]. Detection and removal of colon polyps at an early stage will reduce the mortality from CRC. There are several methods for colon screening, such as CT colonography or wireless capsule endoscopy, but the gold standard is colonoscopy [2].

The colonoscopy is performed by an experienced doctor who uses a colonoscope to screen and scan for abnormalities such as intestinal signs, symptoms, colon cancer, and polyps. Abnormal polyps can be removed, and small amounts of tissue can be detached for analysis during the colonoscopy. However, the most crucial drawback of colonoscopy is polyp miss rate, especially with polyp more diminutive than10mm. Several factors cause the miss rate. They are both subjective factors such as bowel preparation, the specific choice of an endoscope, video processor, clinician skill, and objective factors such as polyp appearance and camera movement condition. For these reasons, automatic polyp detection is a potential approach to assist clinicians in improving the sensitivity of the diagnosis.

Previous research shows that automatic polyp detection using deep learning-based methods outperforms hand-craft-based methods demonstrated by both top two results in the MICCAI 2015 challenge [3]. For deep learning-based approaches and model architectures, data augmentation is also a critical factor in making significant improvements due to the lack of annotated data. The recent work [4]shows that learning an optimal policy from data for auto augmentation instead of hand-crafted defining data augmentation strategies can generalize objects better. Thus, studying auto augmentation for polyp detection problems is necessary. In this research, we adapt Faster R-CNN [5] together withAutoAugment [6] to detect polyp from colonoscopy video frames. Besides, we also evaluate traditional data augmentation [7] to see the effectiveness of different augmentation strategies.


1. Polyp Detector

Thanks to the power of deep learning, recent works [5, 12,11] show that deep-based detection methods give impressive detection performance. In this work, we use the Faster RCNN object detector [5] with Resnet101 [13] backbone pre-trained on COCO dataset. Our experiments show that this architecture gives a competitive performance on the Polyp detection problem. The experimental setting for the detector is set as follows. The network is trained using stochastic gradient descent (SGD)with 0.9 momentum; learning rate with initial value is set to3e-4 and will be decreased to 3e-5 from the iteration 900k.The number of anchor boxes per location used in our model is 12 (4 scales, i.e.,64Γ—64, 128Γ—128, 256Γ—256, 512Γ—512 and 3 aspect ratios, i.e.,1 : 2,1 : 1,2 : 1).

2. Data Augmentation


Fig. 1. Example of applying learned augmentation policies tocolonoscopy image.

Data augmentation can be split into two types: self-defined data augmentation (a.k.a traditional augmentation) and auto augmentation [6]. In this study, we adopt an automated data augmentation approach for object detection, i.e., Auto-augment [6], which finds optimal data augmentation policies during training. In Auto-augment, an augmentation policy consists of several sub-policies; each sub-policy consists of two operations. Each operation is an image transformation containing two parameters: probability and the magnitude of the shift. There are three types of transformations used in Auto-augment for object detection [4], which are

  • Color operations: distort color channels without impacting the locations of the bounding boxes
  • Geometric operations: geometrically distort the image, which correspondingly alters the location and size of the bounding box annotations
  • Bounding box operations: only distort the pixel content contained within the bounding box annotations. One of the essential conclusions in [4] is that the learned policy found on COCO can be directly applied to other detection datasets and models to improve predictive accuracy. Hence, in this study, we apply the learned policies from [4]to augment data when training the detector in Sec. 3.1. The learned policed we use for training our detector is summarized in Table 1. In Table 1, each operation is a triple which describes the transformation, the probability, and the magnitude of the transformations. Due to the space limitation, we refer the reader to [4] for detail on the descriptions of trans-formation. Fig. 1 showed augmented examples when applying the learned augmentation policy on a polyp image from the training dataset.


Table 1. Sub-policies and operations used in our experiment.

In addition to auto-augmentation, we also investigate the effect of traditional augmentation and the combination of traditional and automatic augmentation. We randomly apply several transformations to the image for traditional data augmentation, such as rotation, mirroring, sheering, translation, and zoom. We propose different strategies to combine those data augmentation types, i.e., (1) firstly, the detector is trained with Auto-augment; after that, it is trained with the traditional data augmentation; (2) training with the traditional augmentation, then with auto augmentation; (3) training with AutoAugmenton the original data and the data generated by traditional augmentation. All augmentation strategies are evaluated with the same model architecture and training configuration. This allows us to explore which data augmentation method is suitable for the polyp detection problem.


We use CVC-ClinicDB [14] for training and ETIS-Larib[15] for testing. This allows us to make a fair comparison with MICCAI2015 challenge results which are reported on the same dataset. The CVC-CLINIC database contains 612polyp image frames of 31 unique polyps from 31 different colonoscopy videos. The ETIS-LARIB dataset contains 196high resolution image frames of 44 different polyps.


Fig. 2. Examples of false positive detection on testing dataset. Green boxes and blues boxes are ground truths and predictions, respectively.


Fig. 3. Examples of false-negative detection on the testing dataset. Green boxes and blues boxes are ground truths and predictions, respectively.

Fig. 2 and Fig. 3 visualize several failed results from our model in the testing dataset in which the blue boxes are the predicted locations, and green boxes are ground truths. These false-positive samples (Fig. 2) caused by a shortcoming in bowel preparation (i.e., leftovers of food and fluid in the colon), while false negative (Fig. 3) samples are caused by the variations of polyp type and appearance (i.e., small polyp, flat polyp, similarities of polyp and colon vein)


Table 2. Comparison among traditional data augmentation (TDA), auto augmentation (AA) and their combinations.
Table 2 shows the comparative results between different augmentation strategies. The results show that the third combination method (AA-TDA-3) achieves higher performance in Precision than AutoAugment, i.e., 75.90% and 74.51%, respectively. However, overall, Auto-augment (AA) achieves the best results because of its performance in covering polyp miss rate (i.e., 152) with an acceptable false-positive rate (i.e., 52). The competitive performance of auto augmentation (AA) confirms the transferable learned data augmentation policies on the COCO dataset [4].


Table 3. Comparative results between our model and the state of the art.
Table 3 presents the comparative results between the auto augmentation in our model and other state-of-the-art results. Among compared methods, CUMED, OUR, and UNS- UCLAN are end-to-end deep learning-based approaches. The results show that compared to methods from MICCAI challenge [3], auto augmentation achieves better performance on all metrics. Comparing to the recent method [10], auto augmentation also achieves better performance on all metrics but FP. These results confirm the effectiveness of auto augmentation for polyp detection problems.


This study adopts a deep learning-based object detection method with auto data augmentation for polyp detection problems. Different augmentation strategies are evaluated. The experimental results show that the learned auto augmentation policies learned from the general object detection dataset are well transferred to the polyp detection problem. Although auto augmentation achieves competitive results, it still has a high FP compared to the state of the art. This weakness can be improved by several post-processing, such as false-positive learning.

Open Source

πŸ… Github: https://github.com/aioz-ai/polyp-detection

πŸ“ Blog post: https://ai.aioz.io/blog/polyp-detection


This research was conducted by Phong Nguyen, Quang Tran, Erman Tjiputra, and Toan Do. We’d like to give special thanks to the other AIOZ AI team members for their supports and feedbacks.

πŸŽ‰ All the above contributions were incredibly enabling for this research. πŸŽ‰


[1] Hamidreza Sadeghi Gandomani, Mohammad Aghajani, et al., β€œColorectal cancer in the world: incidence, mortality and risk factors,”Biomedical Research and Therapy, 2017.

[2] Florence B ́enard, Alan N Barkun, et al., β€œSystematic review of colorectal cancer screening guidelines for average-risk adults: Summarizing the current global recommendations,”World journal of gastroenterology, 2018.

[3] Jorge Bernal, Nima Tajkbaksh, et al., β€œComparative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge,”IEEE Transactions on Medical Imaging, pp. 1231–1249, 2017.

[4] Barret Zoph, Ekin D Cubuk, et al., β€œLearning data augmentation strategies for object detection,”arXiv, 2019.

[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, β€œFaster R-CNN: Towards real-time object detection with region proposal networks,” inNIPS, 2015.

[6] Ekin D Cubuk, Barret Zoph, et al.,β€œAutoaugment: Learning augmentation strategies from data,” in CVPR, 2019.

[7] Younghak Shin, Hemin Ali Qadir, et al., β€œAutomatic colon polyp detection using region based deep cnn and post learning approaches,”IEEE Access, 2018

[8] Yangqing Jia, Evan Shelhamer, et al., β€œCaffe: Convolutional architecture for fast feature embedding,” in ACMMM, 2014.

Introduction to Federated Learning

Federated Learning: machine learning over a distributed dataset, where user devices (e.g., desktop, mobile phones, etc.) are utilized to collaboratively learn a shared prediction model while keeping all training data locally on the device. This approach decouples the ability to do machine learning from storing the data in the cloud.

Conceptually, federated learning proposes a mechanism to train a high-quality centralized model. Simultaneously, training data remains distributed over many clients, each with unreliable and relatively slow network connections.

The idea behind federated learning is as conceptually simple as its technologically complex. Traditional machine learning programs relied on a centralized model for training in which a group of servers runs a specific model against training and validation datasets. That centralized training approach can work very efficiently in many scenarios. Still, it has also proven to be challenging in use cases involving a large number of endpoints using and improving the model. The prototypical example of the limitation of the centralized training model can be found in mobile or internet of things(IoT) scenarios. The quality of a model depends on the information processed across hundreds of thousands or millions of devices. Each endpoint can contribute to a machine learning model's training in its own autonomous way in those scenarios. In other words, knowledge is federated.

Blockchain: large, distributed dataset, where no-one can edit/delete an old entry, nor fake a new entry. The data is enforced by fundamental limits of computations (i.e., Proof of Work).

Smart Contract: dataset stored on the Blockchain, which includes: Data (i.e., ledgers, events, statistics), State (today's ledger, today's events), Code (rules for changing state).

Math Examples

Fundamental Theorem of Calculus Let f:[a,b]β†’Rf:[a,b] \to \R be Riemann integrable. Let F:[a,b]β†’RF:[a,b]\to\R be F(x)=∫axf(t)dtF(x)= \int_{a}^{x}f(t)dt. Then FF is continuous, and at all xx such that ff is continuous at xx, FF is differentiable at xx with Fβ€²(x)=f(x)F'(x)=f(x).

Lift(LL) can be determined by Lift Coefficient (CLC_L) like the following equation.

L=12ρv2SCLL = \frac{1}{2} \rho v^2 S C_L


[1] J.Rodriguez. Whats New in Deep Learning Research: Understanding Federated Learning.

Graph-based Person Signature for Person Re-Identifications


Person re-identification (ReID) aims to retrieve a particular person image in a collection of images captured by multiple cameras from various viewpoints across time.

The challenges of the person ReID task come from significant variations of human attributes such as poses, gaits, clothes, as well as challenging environmental settings like illumination, complex background, and occlusions. With the rise of deep learning, most of the recent studies utilize Convolutional Neural Network (CNN) to tackle the person ReID problem.

Recently, attribute-based methods have shown great success in providing semantic features for the deep network. Unlike the person identity label, which offers only coarse information to identify one identity among all other person identities, the attributes are the detailed descriptions that are highly intuitive and mostly unchanged between images captured from different cameras. Therefore, they can be used to explicitly guide the model to learn a robust person representation by defining human characteristics.

In this work, we propose to utilize the person attribute information with its associated body part to encode the visual person signature in one unified framework.

  • We hypothesize that the detailed person descriptions (attributes labels) can be integrated with visual features (body parts and global features) to create a unique signature for a particular person.
  • Since both body parts and attributes provide local representations, by linking them together, the network can have a better understanding of the relationship between visual features and attribute descriptions.
  • Although previous works have investigated how person identity, body parts, and attributes benefit the task of person ReID, our key difference is that we utilize Graph Convolutional Networks (GCN) to effectively construct and model the correlation between attributes and body parts with global features. In particular, we treat body part regions and attributes as nodes in a graph and utilize a GCN to learn the topological structure of a person's signatures. The GCN propagates messages on a graph structure. After message traversal on the graph, the node's final representations are obtained from its data and from other node's information. Figure 1 shows the effectiveness of our approach.


Figure 1: The effectiveness of our GPS in improving retrieval results on Market-1501 dataset. The details are presented in the text (Source).



Figure 2: Illustration of our proposed framework including two branches: (1) global branch which extracts person global features; (2) GPS branch which performs reasoning the person attributes and body parts using GCN. The details are presented in the text (Source).

The proposed framework is presented in Figure 2.

  • We denote II is a probe person image. This probe image II is first passed through a backbone CNN to get the feature map F\mathbf{F}.
  • By utilizing a human parsing pretrained model, we extract the body part masks to obtain the visual features of each part.
  • The person attributes are then represented by a lookup word embedding.
  • Given body part features and attribute features, we construct the Graph-based Person Signature which includes attribute nodes and body part nodes conditioned on the correlation matrix. We employ the GCN for reasoning on the person signature graph and encoding the graph into more representativeness features.

Our proposed method is a multi-branch multi-task framework for person ReID, where the main branch performs the verification task by optimizing two well-known loss functions: Triplet loss and Center loss. The auxiliary branch performs reasoning on the proposed person signature graph and solves the attribute recognition as well as the person identity classification tasks.

Experimental Results

Ablation Study

Loss Contribution. In Table 2, we show the contribution of each loss to the final performance on the Market1501 dataset. The person ID classification loss, triplet loss, center loss, and attribute recognition loss are denoted as Lid\mathbf{\mathcal{L}_{id}}, Ltriplet\mathbf{\mathcal{L}_{triplet}}, Lcenter\mathbf{\mathcal{L}_{center}}, and La\mathbf{\mathcal{L}_{a}}, respectively. The performance is improved when we incorporate all losses to the framework, which justifies the effectiveness of our proposed method. By using only Lid\mathbf{\mathcal{L}_{id}}, we still achieve comparative results with other mask-guided and attribute-based methods. While the triplet loss Ltriplet\mathbf{\mathcal{L}_{triplet}} demonstrates its capability on improving the performance, the center loss Lcenter\mathbf{\mathcal{L}_{center}} shows a slight impact on the performance. Notably, the attribute loss La\mathbf{\mathcal{L}_{a}} shows stability when being incorporated with other loss functions.


Table 2: The contribution of losses to the performance of person ReID task on Market1501 dataset. Note that the experiments are conducted with ResNet-50 as backbone CNN network (Source).

Model Interpretability. In this section, we conduct cross-dataset experiments to evaluate the effectiveness of GPS. The model is trained on the source dataset and test directly on the target dataset without finetuning. As shown in Table 3, our GPS archives a significant improvement over the Bag-of-Tricks baseline (BoT). This demonstrates the interpretability of our proposed method as well as confirms the effectiveness of learning the attributes for the person ReID task.


Table 3: The transferable ability of our GPS evaluated on cross-dataset (Source).

Training Parameters. We also provide the number of training parameters of our GPS and the baseline BoT in Table 4 to show the complexity of each method. Overall, our GPS slightly increases about 3M parameters in comparison with the baseline BoT while achieving much better performance.


Table 4: The number of parameters of our GPS in comparision with the baseline BoT on Market1501 and DukeMTMC-ReID datasets using ResNet-50 as the backbone network. #nParam indicates the number of parameters and 1K=1000 (Source).

Comparison to the state of the art

GPS vs. Baseline. The last two rows of Table 5 show the result of our GPS when being integrated into the baseline BoT. The results clearly show that our GPS significantly improves the performance of BoT in both Market-1501 and DukeMTMC-ReID dataset. This demonstrates the effectiveness of our GPS and confirms the usefulness of learning the attributes in the ReID task.


Table 5: Comparison with state-of-the-art methods on Market-1501 and DukeMTMC-ReID datasets. The cyan and yellow boxes are the best results corresponding to mask-guided/attribute-based and other approaches, respectively. Note that no post-processing is applied to our method (Source).

Evaluation on Market-1501. We evaluate our GPS with other methods on Market-1501 dataset in Table 5. The results show that our method outperforms the state-of-the-art attribute-based methods AANet that use attribute and body part information in all evaluation metrics. Specifically, we outperforms AANet by 5.3% and 1.3% at mAP and R-1, respectively. Our GPS also outperforms the state-of-the-art mask-guided methods, and especially, we outperform P2^2-Net by 2.2% at mAP. At the same time, we also get comparative results when comparing with other recent ReID approaches.

Evaluation on DukeMTMC-ReID. Table 5 also summaries the results of our GPS and other methods on DukeMTMC-ReID dataset. Our GPS significantly outperforms other attribute-based methods in all metrics. Specifically, our method outperforms the recent state-of-the-art attribute-based method AANet by 6.1% at mAP and 1.8% at R-1. In addition, we also outperforms ADPR by 9.0%, 3.9%, 2.8%, 2.0% at mAP, R-1, R-5, R-10, respectively. Moreover, our GPS outperforms the state-of-the-art mask-guided method P2^2-Net by 4.9%, 1.7%, 2.1%, 1.7% at mAP, R-1, R-5, R-10, respectively. Besides, we also achieve comparative results with other ReID approaches.

Attributed-based and Mask-guided vs. Other approaches. From Table 5, we notice that although our GPS shows a definite improvement over mask-guided and attributed-based methods, it achieves competitive results with methods from other approaches and particularly being outperformed by st-ReID method. Note that the results of st-ReID also completely dominate all methods from all other approaches. The effectiveness of st-ReID comes from the fact that it also uses the spatial-temporal information (i.e., the spatial map of camera setting and temporal information from video timestamp) into the network. This extra information allows the network to encode the person identity from multiple viewpoints, which significantly reduces the effect of different poses, viewpoints, or ambiguity challenges. From experiments, we have observed that our GPS, as well as other attribute-based and mask-guided methods, suffers from the fact that the pretrained body part network cannot provide adequate segmentation masks, so the retrieval results are also affected.

Quantity Visualization


Figure 3: Top 5 retrieval results of some queries on Market-1501 dataset. Note that the green/red boxes denote true/false retrieval results, respectively.

We present some retrieval examples with five retrieved images for each query in Figure 3. As in the visualization, our GPS obtained better retrieval results than the baseline. In the first row of Figure 3, the baseline gets the false retrieval result at Rank-5 due to the similarity of gender, wearing a hat, etc., except the color of the clothes. By leveraging our GPS, the extracted features are more robust to attribute and body part information, then, lead to better retrieval results for ReID model. In the second row, the model with our GPS gives better results by extracting more information about the relationship between `backpack' attribute and this person identity, thereby eliminating false cases. We also show an example that our GPS does not yet produce entirely correct retrieval results in the third line of the Figure 3. In this case, the lower body of the probe image is partly covered by the bicycle. Thus, the extracted features (i.e., the color of the pants) are not fully captured, which results in the feature misalignment between the probe image and retrieval results.


This paper proposes Graph-based Person Signature (GPS) that effectively captures the dependencies of person attributes and body parts information. We utilize the GCN on the GPS to propagate the information among nodes in the graph and integrate the graph features into a novel multi-branch multi-task network. The experimental results on benchmark datasets confirm the effectiveness of our GPS and demonstrate that our GPS performs better than recent state-of-the-art attribute-based and mask-guided ReID methods.

Open Source

🐱 Github: https://github.com/aioz-ai/CVPRW21_GPS

Compact Trilinear Interaction for Visual Question Answering

In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, knowledge distillation is applied in Free-form Opened-ended VQA. It is not only for reducing the computational cost and required memory but also for transferring knowledge from trilinear interactionmodel to bilinear interaction model. The extensive experiments on benchmarking datasets TDIUC, VQA-2.0, and Visual7W show that the proposed compact trilinear interaction model achieves state-of-the-art results on all three datasets.

For free-form opened-ended VQA task, CTI achieved 67.4 on VQA-2.0 and 87.0 on TDIUC dataset in VQA accuracy metric.

For multiple choice VQA task, CTI achieved 72.3 on Visual7W dataset in MC-VQA accuracy metric.

Compact Trilinear Interaction in VQA.

Let M={M1,M2,M3}M = \{M_1, M_2, M_3\} be the representations of three inputs. Mt∈RntΓ—dtM_t \in \textbf{R}^{n_t \times d_t}, where ntn_t is the number of channels of the input MtM_t and dtd_t is the dimension of each channel.
For example, if M1M_1 is the region-based representation for an image, then n1n_1 is the number of regions and d1d_1 is the dimension of the feature representation for each region. Let mte∈R1Γ—dtm_{t_e} \in \textbf{R}^{1 \times d_{t}} be the ethe^{th} row of MtM_t, i.e., the feature representation of ethe^{th} channel in MtM_t, where t∈{1,2,3}t \in \{1, 2, 3\}.

The input for training VQA is set of (V,Q,A)(V,Q,A) in which VV is an image representation; V∈RvΓ—dvV \in \textbf{R}^{v \times d_v} where vv is the number of interested regions (or bounding boxes) in the image and dvd_v is the dimension of the representation for a region; QQ is a question representation; Q∈RqΓ—dqQ \in \textbf{R}^{q \times d_q }
where qq is the number of hidden states and dqd_q is the dimension for each hidden state.
AA is an answer representation; A∈RaΓ—daA \in \textbf{R}^{a \times d_a}
where aa is the number of hidden states and dad_a is the dimension for each hidden state.

We firstly compute the attention map M\mathcal{M} as follows:

M=βˆ‘r=1R⟦Gr;VWvr,QWqr,AWar⟧\mathcal{M} = \sum^R_{r=1} {\llbracket \mathcal{G}_r; V W_{v_r}, Q W_{q_r}, A W_{a_r}\rrbracket}

Then the joint representation zz is computed as follows:

zT=βˆ‘i=1vβˆ‘j=1qβˆ‘k=1aMijk(ViWzv∘QjWzq∘AkWza)z^T= \sum_{i=1}^{v}\sum_{j=1}^{q}\sum_{k=1}^{a} \mathcal{M}_{ijk}\left( V_{i}W_{z_v} \circ Q_{j}W_{z_q} \circ A_{k}W_{z_a}\right)

where Wvr,Wqr,WarW_{v_r},W_{q_r}, W_{a_r} and Wzv,Wzq,WzaW_{z_v},W_{z_q}, W_{z_a} are learnable factor matrices; each Gr\mathcal{G}_r is a learnable Tucker tensor.

Integrate CTI into different VQA task

For multiple choice VQA


Figure 1. The model when CTI is applied to MC VQA.

Each input question and each answer are trimmed to a maximum of 12 words which will then be zero-padded if shorter than 12 words. Each word is then represented by a 300-D GloVe word embedding. Each image is represented by a 14Γ—14Γ—204814 \times 14 \times 2048 grid feature (i.e., 196196 cells; each cell is with a 20482048-D feature), extracted from the second last layer of ResNet-152 which is pre-trained on ImageNet.

Input samples are divided into positive samples and negative samples. A positive sample, which is labelled as 11 in binary classification, contains image, question and the right answer. A negative sample, which is labelled as 00 in binary classification, contains image, question, and the wrong answer. These samples are then passed through CTI to get the joint representation zz. The joint representation is passed through a binary classifier to get the prediction. The Binary Cross Entropy loss is used for training the model.

For free-form opened-ended VQA


Figure 2. The model when CTI is applied to FFOE VQA.

Unlike MC VQA, FFOE VQA treats the answering as a classification problem over the set of predefined answers. Hence the set possible answers for each question-image pair is much more than the case of MC VQA. For each question-image input, the model takes every possible answers from its answer list to computed the joint representation, causes high computational cost.

In addition, CTI requires all three V,Q,AV, Q, A inputs to compute the joint representation. However, during the testing, there are no available answer information in FFOE VQA. To overcome these challenges, we propose to use Knowledge Distillation to transfer the learned knowledge from a teacher model to a student model.

The loss function for the student model is defined as:

LKD=Ξ±T2LCE(QSΟ„,QTΟ„)+(1βˆ’Ξ±)LCE(QS,ytrue)\mathcal{L}_{KD} = \alpha T^2 \mathcal{L}_{CE}(Q^\tau_S, Q^\tau_T) + (1-\alpha)\mathcal{L}_{CE}(Q_S,y_{true})

where LCE\mathcal{L}_{CE} stands for Cross Entropy loss; QSQ_S is the standard softmax output of the student; ytruey_{true} is the ground-truth answer labels;
Ξ±\alpha is a hyper-parameter for controlling the importance of each loss component; QSΟ„,QTΟ„Q^\tau_S, Q^\tau_T are the softened outputs of the student and the teacher using the same temperature parameter TT, which are computed as follows:

QiΟ„=exp(li/T)βˆ‘iexp(li/T)Q^\tau_i = \frac{exp(l_i/T)}{\sum_{i} exp(l_i/T)}

where for both teacher and the student models, the logit ll is the predictions outputted by the corresponding classifiers.



Table 1. Performance of CTI and BAN2, SAN in VQA-2.0 validation set and test-dev set. BAN2-CTI and SANCTI are student models trained under the teacher model.

To further evaluate the effectiveness of CTI, we conduct a detailed comparison with the current state of the art. For FFOE VQA, we compare CTI with the recent state-of-the-art methods on TDIUC and VQA-2.0 datasets. For MC VQA, we compare with the state-of-the-art methods on Visual7W dataset.


Table 2. Performance comparison between different approaches with different evaluation metrics on TDIUC validation set. BAN2-CTI and SAN-CTI are the student models trained under compact trilinear interaction teacher model.

Regarding FFOE VQA, Table 1 and Table 2 show comparative results on VQA-2.0 and TDIUC respectively. Specifcaly, Table 1 shows that distilled student BAN2-CTI outperforms all compared methods over all metrics by a large margin, i.e., the model outperforms the current state-of-the-art QTA on TDIUC by 3.4%3.4\% and 5.4%5.4\% on Ari and Har metrics, respectively. The results confirm that trilinear interaction has learned informative representations from the three inputs and the learned information is effectively transferred to student models by distillation.


Table 3. Performance comparison between different approaches on Visual7W test set. Both training set and validation set are used for training. All models but CTIwBoxes are trained with same image and question representations. Both train set and validation set are used for training. Note that CTIwBoxes is the CTI model using Bottom-up features. instead of grid features for image representation.

Regarding MC VQA, Table 3 shows that the CTI outperforms compared methods by a noticeable margin. This model outperforms the current state-of-the-art STL by 1.1%. Again, this validates the effectiveness of the proposed joint presentation learning, which precisely and simultaneously learns interactions between the three inputs. We note that when comparing with other methods on Visual7W, for image representations, we used the grid features extracted from ResNet-512 for a fair comparison. Our proposed model can achieve further improvements by using the object detection-based features used in FFOE VQA. With new features, the model denoted as CTIwBoxes in Table 3 achieve 72.3% accuracy with Acc-MC metric which improves over the current state-of-the-art STL 4.1%.


A novel compact trilinear interaction is introduced to simultaneously learns high level associations between image, question, and answer in both MC VQA and FFOE VQA. In addition, knowledge distillation is the first time applied to FFOE VQA to overcome the computational complexity and memory issue of the interaction. The extensive experimental results show that these models achieve the state-of-the-art results on three benchmarking datasets.

Open Source

Github: https://github.com/aioz-ai/ICCV19_VQA-CTI