/ / 32 min read

Multiple Meta-model Quantifying for Medical Visual Question Answering

A new multiple meta-model quantifying method that effectively learns meta-annotation and leverages meaningful features to the medical VQA task


A medical Visual Question Answering (VQA) system can provide meaningful references for both doctors and patients during the treatment process. Extracting image features is one of the most important steps in a medical VQA framework which outputs essential information to predict answers. Transfer learning, in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet, is a popular way to initialize the feature extraction process. However, due to the difference in visual concepts between ImageNet images and medical images, finetuning process is not sufficient. Recently, Model Agnostic Meta-Learning (MAML) has been introduced to overcome the aforementioned problem by learning meta-weights that quickly adapt to visual concepts. However, MAML is heavily impacted by the meta-annotation phase for all images in the medical dataset. Different from normal images, transfer learning in medical images is more challenging due to:

  • (i) noisy labels may occur when labeling images in an unsupervised manner;
  • (ii) high-level semantic labels cause uncertainty during learning;
  • (iii) difficulty in scaling up the process to all unlabeled images in medical datasets.

In this paper, we introduce a new Multiple Meta-model Quantifying (MMQ) process to address these aforementioned problems in MAML. Intuitively MMQ is designed to:

  • (i) effectively increase meta-data by auto-annotation;
  • (ii) deal with the noisy labels in the training phase by leveraging the uncertainty of predicted scores during the meta-agnostic process;
  • (iii) output meta-models which contain robust features for down-stream medical VQA task. Note that, compared with the recent approach for meta-learning in medical VQA, our proposed MMQ does not take advantage of additional out-of-dataset images, while achieves superior accuracy in two challenging medical VQA datasets.


Method overview

Our approach comprises two parts: our proposed multiple meta-model quantifying (MMQ - Figure 1) and a VQA framework for integrating meta-models outputted from MMQ (Figure 2). MMQ addresses the meta-annotation problem by outputting multiple meta-models. These models are expected to robust to each other and have high accuracy during the inference phase of model-agnostic tasks. The VQA framework aims to leverage different features extracted from candidate meta-models and then generates predicted answers.

Multiple meta-model quantifying


Figure 1: Multiple Meta-model Quantifying in medical VQA. Dotted lines denote looping steps, the number of loop equals to mm required meta-models (Source).

Multiple meta-model quantifying (Figure 1) contains three modules:

  • (i) Meta-training which trains a specific meta-model for extracting image features used in medical VQA task by following MAML;
  • (ii) Data refinement which increases the training data by auto-annotation and deal with the noisy label by leveraging the uncertainty of predicted scores;
  • (iii) Meta-quantifying which selects meta-models whose robust to each others and have high accuracy during inference phase of model-agnostic tasks.

Unlike MAML where only one meta-model is selected, we develop the following refinement and meta-quantifying steps to select high-quality meta-models for transfer learning to the medical VQA framework later.

Data refinement. After finishing the meta-training phase, the weights of the meta-models are used for refining the dataset. The module aims to expand the meta-data pool for meta-training and removes samples that are expected to be hard-to-learn or have noisy labels (See Algorithm 1 for more details).


Meta-quantifying. This module aims to identify candidate meta-models that are useful for the medical VQA task. A candidate model θ\theta should achieve high performance during the validating process and its features distinct from other features from other candidate models.


To achieve these goals, we design a fuse score SFS_F:

SF=γSP+(1γ)t=1m1Cosine(Fc,Ft)FcFtS_F = \gamma S_P + (1-\gamma)\sum_{t=1}^m 1 - Cosine \left(F_{c}, F_t\right) \forall F_{c} \neq F_t

where SPS_P is the predicted score of the current meta-model over ground-truth label; FcF_{c} is the feature extracted from the aforementioned meta-model that needs to compute the score; FtF_t is the feature extracted from tt-th model of the list of meta-model Θ\Theta; Cosine is using for similarity checking between two features.

Since the predicted score SPS_P at the ground-truth label and diverse score are co-variables, therefore the fuse score SFS_F is also covariate with both aforementioned scores. This means that the larger SFS_F is, the higher chance of the model to be selected for the VQA task. Algorithm 2 describes our meta-quantifying algorithm in details.

Integrate quantified meta-models to medical VQA framework


Figure 2: Our VQA framework is designed to integrate robust image features extracted from multiple meta-models outputted from MMQ (Source).

To leverage robust features extracted from quantified meta-models, we introduce a VQA framework as in Figure 2. Specifically, each input question is trimmed to a 1212-word sentence and then zero-padded if its length is less than 1212. Each word is represented by a 300-D GloVe word embedding. The word embedding is fed into a 1024-D LSTM to produce the question embedding fqf_q.

Each input image is passed through nn quantified meta-models got from the meta-quantifying module, which produce nn vectors. These vectors are concatenated to form an enhanced image feature, denoted as fvf_v in Figure 2. Since this vector contains multiple features extracted from different high-performed meta-models and each model has different views, the VQA framework is expected to be less affected by the bias problem. Image feature fvf_v and question embedding fqf_q are fed into an attention mechanism (BAN or SAN) to produce a joint representation faf_a. This feature faf_a is used as input for a multi-class classifier (over the set of predefined answer classes). To train the proposed model, we use a Cross Entropy loss for the answer classification task. The whole VQA framework is then fine-tuned in an end-to-end manner.

Experimental Results



Table 1: Performance comparison on VQA-RAD and Path-VQA test set. (*) indicates methods used pre-trained model on ImageNet dataset. We refine data 55 times (m=5m=5) and use 33 meta-models (n=3n=3) in our MMQ (Source).

Table 1 presents comparative results between different methods. The results show that our MMQ significantly outperforms other meta-learning methods by a large margin. Besides, the gain in performance of MMQ is stable with different attention mechanisms (BAN or SAN) in the VQA task. It worth noting that, compared with the most recent state-of-the-art method MEVF, we outperform 5.3%5.3\% in free-form questions of the PathVQA dataset and 9.8%9.8\% in the Open-ended questions of the VQA-RAD dataset, respectively. Moreover, no out-of-dataset images are used in MMQ for learning meta-models. The results imply that our proposed MMQ learns essential representative information from the input images and leverage effectively the features from meta-models to deal with challenging questions in medical VQA datasets.

Ablation Study


Table 2: The effectiveness of our MMQ under mm times refining data and nn quantified meta-models on PathVQA test set. BAN is used as the attention method (Source).

Table 2 presents our MMQ accuracy in PathVQA dataset when applying mm times refining data and nn quantified meta-models. The results show that, by using only 11 quantified meta-model outputted from our MMQ, we significantly outperform both MAML and MEVF baselines. This confirms the effectiveness of the proposed MMQ for dealing with the limitation of meta-annotation in medical VQA, i.e., noisy labels and scalability. Besides, leveraging more quantified meta-models also further improves the overall performance.

We note that the improvements of our MMQ are more significant on free-form questions over yes/no questions. This observation implies that the free-form questions/answers which are more challenging and need more information from input images benefits more from our proposed method.

Table 2 also shows that increasing the number of refinement steps and the number of quantified meta-models can improve the overall result, but the gain is smaller after each loop. The training time also increases when the number of meta-models is set higher. However, our testing time and the total number of parameters are only slightly higher than MAML and MEVF. Based on the empirical results, we recommend applying 55 times refinement with a maximum of 33 quantified meta-models to balance the trade-off between the accuracy performance and the computational cost.


In this paper, we proposed a new multiple meta-model quantifying method to effectively leverage meta-annotation and deal with noisy labels in the medical VQA task. The extensively experimental results show that our proposed method outperforms the recent state-of-the-art meta-learning based methods by a large margin in both PathVQA and VQA-RAD datasets. Our implementation and trained models will be released for reproducibility.

Open Source

🐱 Github: https://github.com/aioz-ai/MICCAI21_MMQ

Like What You See?