5 posts tagged with "vqa"

View All Tags

Multiple Meta-model Quantifying for Medical Visual Question Answering


A medical Visual Question Answering (VQA) system can provide meaningful references for both doctors and patients during the treatment process. Extracting image features is one of the most important steps in a medical VQA framework which outputs essential information to predict answers. Transfer learning, in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet, is a popular way to initialize the feature extraction process. However, due to the difference in visual concepts between ImageNet images and medical images, finetuning process is not sufficient. Recently, Model Agnostic Meta-Learning (MAML) has been introduced to overcome the aforementioned problem by learning meta-weights that quickly adapt to visual concepts. However, MAML is heavily impacted by the meta-annotation phase for all images in the medical dataset. Different from normal images, transfer learning in medical images is more challenging due to:

  • (i) noisy labels may occur when labeling images in an unsupervised manner;
  • (ii) high-level semantic labels cause uncertainty during learning;
  • (iii) difficulty in scaling up the process to all unlabeled images in medical datasets.

Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering.

Different approaches have been proposed to Visual Question Answering (VQA). However, few works are aware of the behaviors of varying joint modality methods over question type prior knowledge extracted from data in constraining answer search space, of which information gives a reliable cue to reason about answers for questions asked in input images. In this blog, we share a novel VQA model that utilizes the question-type prior information to improve VQA by leveraging the multiple interactions between different joint modality methods based on their behaviors in answering questions from different types. The solid experiments on two benchmark datasets, i.e., VQA 2.0 and TDIUC, indicate that the proposed method yields the best performance with the most competitive approaches.


There are works that consider types of question as the side information whichgives a strong cue to reason about the answer. However, the relation between question types and answers from training data have not been investi-gated yet. Fig. 1 shows the correlation between question types and some answersin the VQA 2.0 dataset. It suggests that a question regarding the quantityshould be answered by a number, not a color. The observation indicated that theprior information got from the correlations between question types and answers open an answer search space constrain for the VQA model. The search spaceconstrain is useful for VQA model to give out final prediction and thus, improvethe overall performance. The Fig. 1 is consistent with our observation, e.g., itclearly suggests that a question regarding the quantity should be answered by anumber, not a color.

Figure 1. The distribution of candidate answers in each question type in VQA 2.0.
Although different joint modality methods or attention mechanisms have been proposed, we hypothesize that each method may capture different aspects of the input. That means different attentions may provide different answers for questions belonged to different question types.
Fig.2 shows examples in which the attention models (SAN and BAN) attend on different regions of input images when dealing with questions from different types. Unfortunately, most of recent VQA systems are based on single attention models BAN2, SAN, MLP, MCB, STL. From the above observation, it is necessary to develop a VQA system which leverages the power of different attention models to deal with questions from different question types.
Figure 2. Examples of attention maps of different attention mechanisms. BAN and SAN identify different visual areas when answering questions from different types.


The proposed multiple interaction learning with question-type prior knowledge (MILQT) is illustrated in Fig. 3. Similar to the most of the VQA systems, multiple interaction learning with question-type prior knowledge (MILQT) consists of the joint learning solution for input questions and images, followed by a multi-class classification over a set of predefined candidate answers. However, MILQT allows to leverage multiple joint modality methods under the guiding of question-types to output better answers.

Figure 3. The introduced MILQT for VQA.
As in Fig.3, MILQT consists of two modules: Question-type awareness A\mathcal{A}, and Multi-hypothesis interaction learning M\mathcal{M}. The first module aims to learn the question-type representation, which is further used to enhance the joint visual-question embedding features and to constrain answer search space through prior knowledge extracted from data. Based on the question-type information, the second module aims to identify the behaviors of multiple joint learning methods and then justify adjust contributions to giving out final predictions.

Question representation. Given an input question, follow the recent state-of-the-art BAN, we trim the question to a maximum of 12 words. The questions that are shorter than 12 words are zero-padded. Each word is then represented by a 600-D vector that is a concatenation of the 300-D GloVe word embedding and the augmenting embedding from training data. This step results in a sequence of word embeddings with size of 12×60012 \times 600 and is denoted as fwf_w. In order to obtain the intent of question, the fwf_w is passed through a Gated Recurrent Unit (GRU) which results in a 1024-D vector representation fqf_q for the input question.

Image representation. We use bottom-up attention, i.e. an object detection which takes as FasterRCNN backbone, to extract image representation. At first, the input image is passed through bottom-up networks to get K×2048K \times 2048 bounding box representation which is denotes as fvf_v in Fig. 3.

Multi-level multi-modal fusion. Unlike the previous works that perform only one level of fusion between linguistic and visual features that may limit the capacity of these models to learn a good joint semantic space. In our work, a multi-level multi-modal fusion that encourages the model to learn a better joint semantic space is introduced which takes the question-type representation got from question-type classification component as one of inputs.

  • First level multi-modal fusion: The first level fusion is similar to previous works. Given visual features fvf_v, question features fqf_{q}, and any joint modality mechanism,
    we combines visual features with question features and learn attention weights to weight for visual and/or linguistic features. Different attention mechanisms have different ways for learning the joint semantic space. The output of first level multi-modal fusion is denoted as fattf_{att} in the Fig.3.
  • Second level multi-modal fusion: In order to enhance the joint semantic space, the output of the first level multi-modal fusion fattf_{att} is combined with the question-type feature fqtf_{qt}, which is the output of the last FC layer of the Question-type classification component.
    We try two simple but effective operators, i.e. element-wise multiplication --- EWM or element-wise addition --- EWA, to combine fattf_{att} and fqtf_{qt}. The output of the second level multi-modal fusion, which is denoted as fattqtf_{att-qt} in Fig.3, can be seen as an attention representation that is aware of the question-type information.

Given an attention mechanism, the fattqtf_{att-qt} will be used as the input for a classifier that predicts an answer for the corresponding question. This is shown at the Answer prediction boxes in the Fig.3.

Multi-hypothesis interaction learning As presented in Fig.3, MILQT allows to utilize multiple hypotheses (i.e., joint modality mechanisms). Specifically, we propose a multi-hypothesis interaction learning design M\mathcal{M} that takes answer predictions produced by different joint modality mechanisms and interactively learn to combine them.
Let gRA×Jg \in \R^{A \times J} be the matrix of predicted probability distributions over AA answers from the JJ joint modality mechanisms. M\mathcal{M} outputs the distribution ρRA\rho \in \R^{A}, which is calculated from gg through Equation below:

ρ=M(g,wmil)=j(mqtansTwmilg)\rho = \mathcal{M} \left(g,w_{mil}\right) = \sum_{j}\left(m^T_{qt-ans}w_{mil} \odot g\right)

wmilRP×Jw_ {mil} \in \textbf{R}^{P \times J} is the learnable weight which control the contributions of JJ considered joint modality mechanisms on predicting answer based on the guiding of PP question types; \odot denotes Hardamard product.


Experiments on VQA 2.0 test-dev and test-standard.

We evaluate MILQTon the test-dev and test-standard of VQA 2.0 dataset. To train the model,similar to previous works, we use both training set and validationset of VQA 2.0. We also use the Visual Genome as additional training data. MILQT consists of three joint modality mechanisms, i.e., BAN-2, BAN-2-Counter, and SAN accompanied with the EWM for the multi-modal fusion, andthe predicted question type together with the prior information to augment theVQA loss. Table 4 presents the results of different methods on test-dev and test-std of VQA 2.0. The results show that our MILQT yields the good performance with the most competitive approaches.


Table 1. Comparison to the state of the arts on the test-dev and test-standard of VQA 2.0. For fair comparison, Glove embedding and GRU are leveraged for question embedding and Bottom-up features are used to extract visual information. CMP, i.e.Cross-Modality with Pooling, is the LXMERT with the aforementioned setup (Source).

Experiments on TDIUC.

In order to prove the stability of MILQT, we evaluate MILQT on TDIUC dataset.
The results in Table.2 show that the proposed model establishes the state-of-the-art results on both evaluation metrics Arithmetic MPT and Harmonic MPT. Specifically, our model significantly outperforms the recent QTA, i.e., on the overall, we improve over QTA 6.1%6.1\% and 11.1%11.1\% with Arithemic MPT and Harmonic MPT metrics, respectively. It is worth noting that the results of QTA in Table. 2, which are cited from QTA, are achieved when QTA used the one-hot predicted question type of testing question to weight visual features. When using the groundtruth question type to weight visual features, QTA reported 69.11%69.11\% and 60.08%60.08\% for Arithemic MPT and Harmonic MPT metrics, respectively. Our model also outperforms these performances a large margin, i.e., the improvements are 3.9%3.9\% and 6.8%6.8\% for Arithemic MPT and Harmonic MPT metrics, respectively.


Table 2. The comparative results between the proposed model and other models onthe validation set of TDIUC (Source).


We present a multiple interaction learning with question-type prior knowledge for constraining answer search space--- MILQT that takes into account the question-type information to improve the VQA performance at different stages. The system also allows to utilize and learn different attentions under a unified model in an interacting manner. The extensive experimental results show that all proposed components improve the VQA performance. We yields the best performance with the most competitive approaches on VQA 2.0 and TDIUC dataset.

Open Source

Github: https://github.com/aioz-ai/ECCVW20_MILQT

Overcoming Data Limitation in Medical Visual Question Answering

What are the difficulties when dealing with Medical VQA task?

Visual Question Answering (VQA) aims to provide a correct answer to a given question such that the answer is consistent with the visual content of a given image.

In medical domain, VQA could benefit both doctors and patients. For example, doctors could use answers provided by VQA system as support materials in decision making, while patients could ask VQA questions related to their medical images for better understanding their health.


Figure 1: An example of Medical VQA (Source).

However, one major problem with medical VQA is the lack of large scale labeled training data which usually requires huge efforts to build.

  • The first attempt for building the dataset for medical VQA is by ImageCLEF-Med. In this, images were automatically captured from PubMed Central articles. The questions and answers were automatically generated from corresponding captions of images. By that construction, the data has high noisy level, i.e., the dataset includes many images that are not useful for direct patient care and it also contains questions that do not make any sense.
  • Recently, the first manually constructed VQA-RAD dataset for medical VQA task is released. Unfortunately, it contains only 315 images, which prevents to directly apply the powerful deep learning models for the VQA problem. One may think about the use of transfer learning in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet are used for finetuning on the medical VQA. However, due to difference in visual concepts between ImageNet images and medical images, finetuning with very few medical images is not sufficient.

Therefore it is necessary to develop a new VQA framework that can improve the accuracy while still only needs a small labeled training data.

The motivation for our approach to overcome the data limitation of medical VQA comes from two observations:

  • Firstly, we observe that there are large scale unlabeled medical images available. These images are from same domain with medical VQA images. Hence if we train an unsupervised deep learning model using these unlabeled images, the trained weights may be easier to be adapted to the medical VQA problem than the pretrained weights on ImageNet images.
  • Another observation is that although the labeled dataset VQA-RAD is primarily designed for VQA, by spending a little effort, we can extract the new class labels for that dataset. The new class labels allow us to apply the recent meta-learning technique for learning meta-weights, that can be quickly adapted to the VQA problem later.


The proposed medical VQA framework is presented in Figure 2. In our framework, the image feature extraction component is initialized by pretrained weights from MAML and CDAE. After that, the VQA framework will be finetuned in an end-to-end manner on the medical VQA data. In the following sections, we detail the architectures of MAML, CDAE, and our framework.


Figure 2: The proposed medical VQA. The image feature extraction is denoted as 'Mixture of Enhanced Visual Features (MEVF)' and is marked with the red dashed box. The weights of MEVF are intialized by MAML and CDAE (Source).

Model-Agnostic Meta-Learning -- MAML

The MAML model consists of four 3×33\times3 convolutional layers with stride 22 and is ended with a mean pooling layer; each convolutional layer has 6464 filters and is followed by a ReLu layer.

We create the dataset for training MAML by manually reviewing around three thousand question-answer pairs from the training set of VQA-RAD dataset. In our annotation process, images are split into three parts based on its body part labels (head, chest, abdomen). Images from each body part are further divided into three subcategories based on the interpretation from the question-answer pairs corresponding to the images. These subcategories are: 1. normal images in which no pathology is found. 2. abnormal present images in which there are the existence of fluid, air, mass, or tumor. 3. abnormal organ images in which the organs are large in size or in wrong position.

Thus, all the images are categorized into 9 classes:

| head normal | head abnormal present | head abnormal organ |
| chest normal | chest abnormal organ | chest abnormal present |
| abdominal normal | abdominal abnormal organ | abdominal abnormal present |

For every iteration of MAML training (line 3 in Alg.1), 5 tasks are sampled per iteration. For each task, we randomly select 3 classes (from 9 classes). For each class, we randomly select 6 images in which 3 images are used for updating task models and the remaining 3 images are used for updating meta-model.


Denoising Auto Encoder -- CDAE

The encoder maps an image xx', which is the noisy version of the original image xx, to a latent representation zz which retains useful amount of information. The decoder transforms zz to the output yy. The training algorithm aims to minimize the reconstruction error between yy and the original image xx as follows

Lrec=xy22L_{rec} = \left \| x-y \right \|_2^2

In our design, the encoder is a stack of convolutional layers; each of them is followed by a max pooling layer. The decoder is a stack of deconvolutional and convolutional layers. The noisy version xx' is achieved by adding Gaussian noise to the original image xx.

To train CDAE, we collect 11,77911,779 unlabeled images available online which are brain MRI images, chest X-ray images and CT abdominal images. The dataset is split into train set with 9,4239,423 images and test set with 2,3562,356 images. We use Gaussian noise to corrupt the input images before feeding them to the encoder.

Our VQA framework

After training MAML and CDAE, we use their trained weights to initialize the MEVF image feature extraction component in the VQA framework. We then finetune the whole VQA model using the training set of VQA-RAD dataset.

To train the proposed model, we introduce a multi-task loss func-tion to incorporate the effectiveness of the CDAE to VQA. Formally, our lossfunction is defined as follows:

L=α1Lvqa+α2LrecL = \alpha_1 L_{vqa} + \alpha_2 L_{rec}

where LvqaL_{vqa} is a Cross Entropy loss for VQA classification and LrecL_{rec} stands for the reconstruction loss of CDAE . The whole VQA model is finetuned in an end-to-end manner.



Table 1: VQA results on VQA-RAD test set. All reference methods differ at the image feature extraction component. Other components are similar. The Stacked Attention Network (SAN) is used as the attention mechanism in all methods (Source).

Table 1 presents VQA accuracy in both VQA-RAD open-ended and close-ended questions on the test set. The results show that for both MAML and CDAE, by firstly pretraining then finetuning, the finetuning significantly improves the performance over the training from scratch using only VQA-RAD.

In addition, the results also show that our pretraining and finetuning of MAML and CDAE give better performance than the finetuning of VGG-16 which is pretrained on the ImageNet dataset. Our proposed image feature extraction MEVF which leverages both pretrained weights of MAML and CDAE, then finetuning them give the best performance. This confirms the effectiveness of the proposed MEVF for dealing with the limitation of labeled training data for medical VQA.


Table 2: Performance comparison on VQA-RAD test set (Source).

Table 2 presents comparative results between methods. Note that for the image feature extraction, the baselines use the pretrained models (VGG or ResNet) that have been trained on ImageNet and then finetune on the VQA-RAD dataset. For the question feature extraction, all baselines and our framework use the same pretrained models (i.e., Glove) and finetuning on VQA-RAD. The results show that when BAN or SAN is used as the attention mechanism in our framework, it significantly outperforms the baseline frameworks BAN and SAN. Our best setting, i.e. the one with BAN as the attention, achieves the state-of-the-art results and it significantly outperforms the best baseline framework BAN, i.e., the improvements are 16.3%16.3\% and 8.6%8.6\% on open-ended and close-ended VQA, respectively.


In this paper, we proposed a novel medical VQA framework that leverages the meta-learning MAML and denoising auto-encoder CDAE for image feature extraction in order to overcome the limitation of labeled training data. Specifically, CDAE helps to leverage information from the large scale unlabeled images, while MAML helps to learn meta-weights that can be quickly adapted to the VQA problem. We establish new state-of-the-art results on VQA-RAD dataset for both close-ended and open-ended questions.

Open Source

🐱 Github: https://github.com/aioz-ai/MICCAI19-MedVQA

A Brief Introduction to Visual Question Answering

1. Visual Question Answering - Overview

Visual Question Answering (VQA) aims to figure out a correct answer for a given question consistent with the visual content of a given image. The overarching goal of this issue is to create systems that can comprehend the contents of an image in the same way that humans do and communicate effectively about that image in natural language. It is indeed a challenging task as it necessitates the interaction and complementation of both image feature extractor and natural language processor.

There are two main variants of VQA which are Free-Form Opened-Ended (FFOE) VQA and Multiple Choice (MC) VQA. In FFOE VQA, an answer is a free-form response to a given image-question pair input, while in MC VQA, an answer is chosen from an answer list for a given image-question pair input. The discussion of VQA variants will be shared in the next post.

2. Approaches for solving VQA task.

There are three main approaches for VQA:

  • *Compositional VQA models:* the questions are interpreted as a set of many sub-tasks.
  • Bayesian and Question-Aware models: this method is not suitable for use in systems that respond to image-related questions. Since the algorithm based on this method does not try looking at the picture and instead predicts the response based on the Bayesian model by determining the probability of the words in the dataset's responses.
  • Attention based models: this method try to learn the interaction between image and question features in VQA task through a module called attention. Then, the joint features got from that module are leveraged for answering the corresponding question.

The final one is the most successful approach since recent states of the arts included attention mechanisms.

3. Attention based VQA approach.

In general, attention based VQA approaches have four main steps (See Figure 1):

  • Visual Representation: Encode the information from the image into vector(s) by using Convolutional Neural Network (CNN)
  • Textual Representation: Encode the information of question into vector(s) by using Embedding.
  • Joint Representation: A further step to learn the interaction between question(s) and image(s). Output joint features can be vector(s).
  • Answer prediction: the joint features from the previous step are then passed through this module to obtain the predicted answer. This module is mostly formed by a Classification.


Figure 1. The general approach for Visual Question Answering.

3.1. Visual Representation

The basic attributes or aspects that clearly help us recognize a specific object, image, or something are known as features. The distinguishing characteristics are the distinct properties. When operating on a VQA dataset, we must extract the features of various images in order to separate the images based on specific features or aspects. Image features are one of the most important pieces of information for a VQA system to output the correct answer.

Convolutional neural networks have emerged as the gold standard for image pattern recognition. An input image is converted into image features after it is passed through a convolutional network. Each filter in a CNN layer detects various patterns, such as corners, vertex, shapes, curves, and symmetries (See Figure 2).

Img Extraction

Figure 2. An example of feature extraction in VQA classification.

The majority of VQA literature employs CNNs for image processing. The network's final layer is removed, and the remaining network is used to extract image features. For image representation in VQA, objects in images represented by features extracted from an object detector such as the Faster-RCNN bottom-up model.

3.2. Textual Representation

Textual embeddings can be offered in a variety of ways. Count-based and frequency-based techniques such as count vectorization and TF-IDF are examples of older approaches. There are also prediction-based approaches such as a continuous bag of words and skip grams. Pretrained Word2Vec models are also openly accessible. Embeddings can also be created using deep learning architectures such as RNNs, LSTMs, GRUs, and 1-D CNNs. LSTMs are one of the most often used in VQA literature. For question embedding in VQA, Glove or BERT are used widely for capturing the representation of words and sentences in different contexts (See Figure 3 for a sample structure of question embedding).


Figure 3. An example of question embedding for VQA.

3.3. Joint Representation

In current VQA systems, the joint modality component plays an essential role since it would learn meaningful joint representations between linguistic and visual inputs by applying the attention mechanism. There are many works that learn the interaction between question and image. For instance, a novel trilinear interaction model which simultaneously learns high level associations between image, question and answer information- CTI (Do et al. 2018). See Figure 4 for more details.


Figure 4. Compact Trilinear Interaction mechanism for VQA (Source).

3.4. Answer Prediction

In most recent works, the joint features got from the attention mechanism is then passed through a classifier to output predicted answer. However, more modules can also be applied to produce external knowledge and deal with difficult questions. ````

Compact Trilinear Interaction for Visual Question Answering

In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, knowledge distillation is applied in Free-form Opened-ended VQA. It is not only for reducing the computational cost and required memory but also for transferring knowledge from trilinear interactionmodel to bilinear interaction model. The extensive experiments on benchmarking datasets TDIUC, VQA-2.0, and Visual7W show that the proposed compact trilinear interaction model achieves state-of-the-art results on all three datasets.

For free-form opened-ended VQA task, CTI achieved 67.4 on VQA-2.0 and 87.0 on TDIUC dataset in VQA accuracy metric.

For multiple choice VQA task, CTI achieved 72.3 on Visual7W dataset in MC-VQA accuracy metric.

Compact Trilinear Interaction in VQA.

Let M={M1,M2,M3}M = \{M_1, M_2, M_3\} be the representations of three inputs. MtRnt×dtM_t \in \textbf{R}^{n_t \times d_t}, where ntn_t is the number of channels of the input MtM_t and dtd_t is the dimension of each channel.
For example, if M1M_1 is the region-based representation for an image, then n1n_1 is the number of regions and d1d_1 is the dimension of the feature representation for each region. Let mteR1×dtm_{t_e} \in \textbf{R}^{1 \times d_{t}} be the ethe^{th} row of MtM_t, i.e., the feature representation of ethe^{th} channel in MtM_t, where t{1,2,3}t \in \{1, 2, 3\}.

The input for training VQA is set of (V,Q,A)(V,Q,A) in which VV is an image representation; VRv×dvV \in \textbf{R}^{v \times d_v} where vv is the number of interested regions (or bounding boxes) in the image and dvd_v is the dimension of the representation for a region; QQ is a question representation; QRq×dqQ \in \textbf{R}^{q \times d_q }
where qq is the number of hidden states and dqd_q is the dimension for each hidden state.
AA is an answer representation; ARa×daA \in \textbf{R}^{a \times d_a}
where aa is the number of hidden states and dad_a is the dimension for each hidden state.

We firstly compute the attention map M\mathcal{M} as follows:

M=r=1RGr;VWvr,QWqr,AWar\mathcal{M} = \sum^R_{r=1} {\llbracket \mathcal{G}_r; V W_{v_r}, Q W_{q_r}, A W_{a_r}\rrbracket}

Then the joint representation zz is computed as follows:

zT=i=1vj=1qk=1aMijk(ViWzvQjWzqAkWza)z^T= \sum_{i=1}^{v}\sum_{j=1}^{q}\sum_{k=1}^{a} \mathcal{M}_{ijk}\left( V_{i}W_{z_v} \circ Q_{j}W_{z_q} \circ A_{k}W_{z_a}\right)

where Wvr,Wqr,WarW_{v_r},W_{q_r}, W_{a_r} and Wzv,Wzq,WzaW_{z_v},W_{z_q}, W_{z_a} are learnable factor matrices; each Gr\mathcal{G}_r is a learnable Tucker tensor.

Integrate CTI into different VQA task

For multiple choice VQA


Figure 1. The model when CTI is applied to MC VQA.

Each input question and each answer are trimmed to a maximum of 12 words which will then be zero-padded if shorter than 12 words. Each word is then represented by a 300-D GloVe word embedding. Each image is represented by a 14×14×204814 \times 14 \times 2048 grid feature (i.e., 196196 cells; each cell is with a 20482048-D feature), extracted from the second last layer of ResNet-152 which is pre-trained on ImageNet.

Input samples are divided into positive samples and negative samples. A positive sample, which is labelled as 11 in binary classification, contains image, question and the right answer. A negative sample, which is labelled as 00 in binary classification, contains image, question, and the wrong answer. These samples are then passed through CTI to get the joint representation zz. The joint representation is passed through a binary classifier to get the prediction. The Binary Cross Entropy loss is used for training the model.

For free-form opened-ended VQA


Figure 2. The model when CTI is applied to FFOE VQA.

Unlike MC VQA, FFOE VQA treats the answering as a classification problem over the set of predefined answers. Hence the set possible answers for each question-image pair is much more than the case of MC VQA. For each question-image input, the model takes every possible answers from its answer list to computed the joint representation, causes high computational cost.

In addition, CTI requires all three V,Q,AV, Q, A inputs to compute the joint representation. However, during the testing, there are no available answer information in FFOE VQA. To overcome these challenges, we propose to use Knowledge Distillation to transfer the learned knowledge from a teacher model to a student model.

The loss function for the student model is defined as:

LKD=αT2LCE(QSτ,QTτ)+(1α)LCE(QS,ytrue)\mathcal{L}_{KD} = \alpha T^2 \mathcal{L}_{CE}(Q^\tau_S, Q^\tau_T) + (1-\alpha)\mathcal{L}_{CE}(Q_S,y_{true})

where LCE\mathcal{L}_{CE} stands for Cross Entropy loss; QSQ_S is the standard softmax output of the student; ytruey_{true} is the ground-truth answer labels;
α\alpha is a hyper-parameter for controlling the importance of each loss component; QSτ,QTτQ^\tau_S, Q^\tau_T are the softened outputs of the student and the teacher using the same temperature parameter TT, which are computed as follows:

Qiτ=exp(li/T)iexp(li/T)Q^\tau_i = \frac{exp(l_i/T)}{\sum_{i} exp(l_i/T)}

where for both teacher and the student models, the logit ll is the predictions outputted by the corresponding classifiers.



Table 1. Performance of CTI and BAN2, SAN in VQA-2.0 validation set and test-dev set. BAN2-CTI and SANCTI are student models trained under the teacher model.

To further evaluate the effectiveness of CTI, we conduct a detailed comparison with the current state of the art. For FFOE VQA, we compare CTI with the recent state-of-the-art methods on TDIUC and VQA-2.0 datasets. For MC VQA, we compare with the state-of-the-art methods on Visual7W dataset.


Table 2. Performance comparison between different approaches with different evaluation metrics on TDIUC validation set. BAN2-CTI and SAN-CTI are the student models trained under compact trilinear interaction teacher model.

Regarding FFOE VQA, Table 1 and Table 2 show comparative results on VQA-2.0 and TDIUC respectively. Specifcaly, Table 1 shows that distilled student BAN2-CTI outperforms all compared methods over all metrics by a large margin, i.e., the model outperforms the current state-of-the-art QTA on TDIUC by 3.4%3.4\% and 5.4%5.4\% on Ari and Har metrics, respectively. The results confirm that trilinear interaction has learned informative representations from the three inputs and the learned information is effectively transferred to student models by distillation.


Table 3. Performance comparison between different approaches on Visual7W test set. Both training set and validation set are used for training. All models but CTIwBoxes are trained with same image and question representations. Both train set and validation set are used for training. Note that CTIwBoxes is the CTI model using Bottom-up features. instead of grid features for image representation.

Regarding MC VQA, Table 3 shows that the CTI outperforms compared methods by a noticeable margin. This model outperforms the current state-of-the-art STL by 1.1%. Again, this validates the effectiveness of the proposed joint presentation learning, which precisely and simultaneously learns interactions between the three inputs. We note that when comparing with other methods on Visual7W, for image representations, we used the grid features extracted from ResNet-512 for a fair comparison. Our proposed model can achieve further improvements by using the object detection-based features used in FFOE VQA. With new features, the model denoted as CTIwBoxes in Table 3 achieve 72.3% accuracy with Acc-MC metric which improves over the current state-of-the-art STL 4.1%.


A novel compact trilinear interaction is introduced to simultaneously learns high level associations between image, question, and answer in both MC VQA and FFOE VQA. In addition, knowledge distillation is the first time applied to FFOE VQA to overcome the computational complexity and memory issue of the interaction. The extensive experimental results show that these models achieve the state-of-the-art results on three benchmarking datasets.

Open Source

Github: https://github.com/aioz-ai/ICCV19_VQA-CTI