/ / 75 min read

Compact Trilinear Interaction for Visual Question Answering

A compression method for Visual Question Answering.

In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, knowledge distillation is applied in Free-form Opened-ended VQA. It is not only for reducing the computational cost and required memory but also for transferring knowledge from trilinear interactionmodel to bilinear interaction model. The extensive experiments on benchmarking datasets TDIUC, VQA-2.0, and Visual7W show that the proposed compact trilinear interaction model achieves state-of-the-art results on all three datasets.

For free-form opened-ended VQA task, CTI achieved 67.4 on VQA-2.0 and 87.0 on TDIUC dataset in VQA accuracy metric.

For multiple choice VQA task, CTI achieved 72.3 on Visual7W dataset in MC-VQA accuracy metric.

Compact Trilinear Interaction in VQA.

Let M={M1,M2,M3}M = \{M_1, M_2, M_3\} be the representations of three inputs. MtRnt×dtM_t \in \textbf{R}^{n_t \times d_t}, where ntn_t is the number of channels of the input MtM_t and dtd_t is the dimension of each channel.
For example, if M1M_1 is the region-based representation for an image, then n1n_1 is the number of regions and d1d_1 is the dimension of the feature representation for each region. Let mteR1×dtm_{t_e} \in \textbf{R}^{1 \times d_{t}} be the ethe^{th} row of MtM_t, i.e., the feature representation of ethe^{th} channel in MtM_t, where t{1,2,3}t \in \{1, 2, 3\}.

The input for training VQA is set of (V,Q,A)(V,Q,A) in which VV is an image representation; VRv×dvV \in \textbf{R}^{v \times d_v} where vv is the number of interested regions (or bounding boxes) in the image and dvd_v is the dimension of the representation for a region; QQ is a question representation; QRq×dqQ \in \textbf{R}^{q \times d_q }
where qq is the number of hidden states and dqd_q is the dimension for each hidden state.
AA is an answer representation; ARa×daA \in \textbf{R}^{a \times d_a}
where aa is the number of hidden states and dad_a is the dimension for each hidden state.

We firstly compute the attention map M\mathcal{M} as follows:

M=r=1RGr;VWvr,QWqr,AWar\mathcal{M} = \sum^R_{r=1} {\llbracket \mathcal{G}_r; V W_{v_r}, Q W_{q_r}, A W_{a_r}\rrbracket}

Then the joint representation zz is computed as follows:

zT=i=1vj=1qk=1aMijk(ViWzvQjWzqAkWza)z^T= \sum_{i=1}^{v}\sum_{j=1}^{q}\sum_{k=1}^{a} \mathcal{M}_{ijk}\left( V_{i}W_{z_v} \circ Q_{j}W_{z_q} \circ A_{k}W_{z_a}\right)

where Wvr,Wqr,WarW_{v_r},W_{q_r}, W_{a_r} and Wzv,Wzq,WzaW_{z_v},W_{z_q}, W_{z_a} are learnable factor matrices; each Gr\mathcal{G}_r is a learnable Tucker tensor.

Integrate CTI into different VQA task

For multiple choice VQA


Figure 1. The model when CTI is applied to MC VQA.

Each input question and each answer are trimmed to a maximum of 12 words which will then be zero-padded if shorter than 12 words. Each word is then represented by a 300-D GloVe word embedding. Each image is represented by a 14×14×204814 \times 14 \times 2048 grid feature (i.e., 196196 cells; each cell is with a 20482048-D feature), extracted from the second last layer of ResNet-152 which is pre-trained on ImageNet.

Input samples are divided into positive samples and negative samples. A positive sample, which is labelled as 11 in binary classification, contains image, question and the right answer. A negative sample, which is labelled as 00 in binary classification, contains image, question, and the wrong answer. These samples are then passed through CTI to get the joint representation zz. The joint representation is passed through a binary classifier to get the prediction. The Binary Cross Entropy loss is used for training the model.

For free-form opened-ended VQA


Figure 2. The model when CTI is applied to FFOE VQA.

Unlike MC VQA, FFOE VQA treats the answering as a classification problem over the set of predefined answers. Hence the set possible answers for each question-image pair is much more than the case of MC VQA. For each question-image input, the model takes every possible answers from its answer list to computed the joint representation, causes high computational cost.

In addition, CTI requires all three V,Q,AV, Q, A inputs to compute the joint representation. However, during the testing, there are no available answer information in FFOE VQA. To overcome these challenges, we propose to use Knowledge Distillation to transfer the learned knowledge from a teacher model to a student model.

The loss function for the student model is defined as:

LKD=αT2LCE(QSτ,QTτ)+(1α)LCE(QS,ytrue)\mathcal{L}_{KD} = \alpha T^2 \mathcal{L}_{CE}(Q^\tau_S, Q^\tau_T) + (1-\alpha)\mathcal{L}_{CE}(Q_S,y_{true})

where LCE\mathcal{L}_{CE} stands for Cross Entropy loss; QSQ_S is the standard softmax output of the student; ytruey_{true} is the ground-truth answer labels;
α\alpha is a hyper-parameter for controlling the importance of each loss component; QSτ,QTτQ^\tau_S, Q^\tau_T are the softened outputs of the student and the teacher using the same temperature parameter TT, which are computed as follows:

Qiτ=exp(li/T)iexp(li/T)Q^\tau_i = \frac{exp(l_i/T)}{\sum_{i} exp(l_i/T)}

where for both teacher and the student models, the logit ll is the predictions outputted by the corresponding classifiers.



Table 1. Performance of CTI and BAN2, SAN in VQA-2.0 validation set and test-dev set. BAN2-CTI and SANCTI are student models trained under the teacher model.

To further evaluate the effectiveness of CTI, we conduct a detailed comparison with the current state of the art. For FFOE VQA, we compare CTI with the recent state-of-the-art methods on TDIUC and VQA-2.0 datasets. For MC VQA, we compare with the state-of-the-art methods on Visual7W dataset.


Table 2. Performance comparison between different approaches with different evaluation metrics on TDIUC validation set. BAN2-CTI and SAN-CTI are the student models trained under compact trilinear interaction teacher model.

Regarding FFOE VQA, Table 1 and Table 2 show comparative results on VQA-2.0 and TDIUC respectively. Specifcaly, Table 1 shows that distilled student BAN2-CTI outperforms all compared methods over all metrics by a large margin, i.e., the model outperforms the current state-of-the-art QTA on TDIUC by 3.4%3.4\% and 5.4%5.4\% on Ari and Har metrics, respectively. The results confirm that trilinear interaction has learned informative representations from the three inputs and the learned information is effectively transferred to student models by distillation.


Table 3. Performance comparison between different approaches on Visual7W test set. Both training set and validation set are used for training. All models but CTIwBoxes are trained with same image and question representations. Both train set and validation set are used for training. Note that CTIwBoxes is the CTI model using Bottom-up features. instead of grid features for image representation.

Regarding MC VQA, Table 3 shows that the CTI outperforms compared methods by a noticeable margin. This model outperforms the current state-of-the-art STL by 1.1%. Again, this validates the effectiveness of the proposed joint presentation learning, which precisely and simultaneously learns interactions between the three inputs. We note that when comparing with other methods on Visual7W, for image representations, we used the grid features extracted from ResNet-512 for a fair comparison. Our proposed model can achieve further improvements by using the object detection-based features used in FFOE VQA. With new features, the model denoted as CTIwBoxes in Table 3 achieve 72.3% accuracy with Acc-MC metric which improves over the current state-of-the-art STL 4.1%.


A novel compact trilinear interaction is introduced to simultaneously learns high level associations between image, question, and answer in both MC VQA and FFOE VQA. In addition, knowledge distillation is the first time applied to FFOE VQA to overcome the computational complexity and memory issue of the interaction. The extensive experimental results show that these models achieve the state-of-the-art results on three benchmarking datasets.

Open Source

Github: https://github.com/aioz-ai/ICCV19_VQA-CTI

Like What You See?