# Compact Trilinear Interaction for Visual Question Answering

In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, knowledge distillation is applied in Free-form Opened-ended VQA. It is not only for reducing the computational cost and required memory but also for transferring knowledge from trilinear interactionmodel to bilinear interaction model. The extensive experiments on benchmarking datasets TDIUC, VQA-2.0, and Visual7W show that the proposed compact trilinear interaction model achieves state-of-the-art results on all three datasets.

For free-form opened-ended VQA task, CTI achieved **67.4** on VQA-2.0 and **87.0** on TDIUC dataset in VQA accuracy metric.

For multiple choice VQA task, CTI achieved **72.3** on Visual7W dataset in MC-VQA accuracy metric.

## Compact Trilinear Interaction in VQA.

Let $M = \{M_1, M_2, M_3\}$ be the representations of three inputs. $M_t \in \textbf{R}^{n_t \times d_t}$, where $n_t$ is the number of channels of the input $M_t$ and $d_t$ is the dimension of each channel.

For example, if $M_1$ is the region-based representation for an image, then $n_1$ is the number of regions and $d_1$ is the dimension of the feature representation for each region. Let $m_{t_e} \in \textbf{R}^{1 \times d_{t}}$ be the $e^{th}$ row of $M_t$, i.e., the feature representation of $e^{th}$ channel in $M_t$, where $t \in \{1, 2, 3\}$.

The input for training VQA is set of $(V,Q,A)$ in which $V$ is an image representation; $V \in \textbf{R}^{v \times d_v}$ where $v$ is the number of interested regions (or bounding boxes) in the image and $d_v$ is the dimension of the representation for a region; $Q$ is a question representation; $Q \in \textbf{R}^{q \times d_q }$

where $q$ is the number of hidden states and $d_q$ is the dimension for each hidden state.

$A$ is an answer representation; $A \in \textbf{R}^{a \times d_a}$

where $a$ is the number of hidden states and $d_a$ is the dimension for each hidden state.

We firstly compute the attention map $\mathcal{M}$ as follows:

Then the joint representation $z$ is computed as follows:

where $W_{v_r},W_{q_r}, W_{a_r}$ and $W_{z_v},W_{z_q}, W_{z_a}$ are learnable factor matrices; each $\mathcal{G}_r$ is a learnable Tucker tensor.

## Integrate CTI into different VQA task

### For multiple choice VQA

**Figure 1.**The model when CTI is applied to MC VQA.Each input question and each answer are trimmed to a maximum of 12 words which will then be zero-padded if shorter than 12 words. Each word is then represented by a 300-D GloVe word embedding. Each image is represented by a $14 \times 14 \times 2048$ grid feature (i.e., $196$ cells; each cell is with a $2048$-D feature), extracted from the second last layer of ResNet-152 which is pre-trained on ImageNet.

Input samples are divided into positive samples and negative samples. A positive sample, which is labelled as $1$ in binary classification, contains image, question and the right answer. A negative sample, which is labelled as $0$ in binary classification, contains image, question, and the wrong answer. These samples are then passed through CTI to get the joint representation $z$. The joint representation is passed through a binary classifier to get the prediction. The Binary Cross Entropy loss is used for training the model.

### For free-form opened-ended VQA

**Figure 2.**The model when CTI is applied to FFOE VQA.Unlike MC VQA, FFOE VQA treats the answering as a classification problem over the set of predefined answers. Hence the set possible answers for each question-image pair is much more than the case of MC VQA. For each question-image input, the model takes every possible answers from its answer list to computed the joint representation, causes high computational cost.

In addition, CTI requires all three $V, Q, A$ inputs to compute the joint representation. However, during the testing, there are no available answer information in FFOE VQA. To overcome these challenges, we propose to use Knowledge Distillation to transfer the learned knowledge from a teacher model to a student model.

The loss function for the student model is defined as:

where $\mathcal{L}_{CE}$ stands for Cross Entropy loss; $Q_S$ is the standard softmax output of the student; $y_{true}$ is the ground-truth answer labels;

$\alpha$ is a hyper-parameter for controlling the importance of each loss component; $Q^\tau_S, Q^\tau_T$ are the softened outputs of the student and the teacher using the same temperature parameter $T$, which are computed as follows:

where for both teacher and the student models, the logit $l$ is the predictions outputted by the corresponding classifiers.

## Results

**Table 1.**Performance of CTI and BAN2, SAN in VQA-2.0 validation set and test-dev set. BAN2-CTI and SANCTI are student models trained under the teacher model.To further evaluate the effectiveness of CTI, we conduct a detailed comparison with the current state of the art. For FFOE VQA, we compare CTI with the recent state-of-the-art methods on TDIUC and VQA-2.0 datasets. For MC VQA, we compare with the state-of-the-art methods on Visual7W dataset.

**Table 2.**Performance comparison between different approaches with different evaluation metrics on TDIUC validation set. BAN2-CTI and SAN-CTI are the student models trained under compact trilinear interaction teacher model.Regarding FFOE VQA, Table 1 and Table 2 show comparative results on VQA-2.0 and TDIUC respectively. Specifcaly, Table 1 shows that distilled student BAN2-CTI outperforms all compared methods over all metrics by a large margin, i.e., the model outperforms the current state-of-the-art QTA on TDIUC by $3.4\%$ and $5.4\%$ on Ari and Har metrics, respectively. The results confirm that trilinear interaction has learned informative representations from the three inputs and the learned information is effectively transferred to student models by distillation.

**Table 3.**Performance comparison between different approaches on Visual7W test set. Both training set and validation set are used for training. All models but CTIwBoxes are trained with same image and question representations. Both train set and validation set are used for training. Note that CTIwBoxes is the CTI model using Bottom-up features. instead of grid features for image representation.Regarding MC VQA, Table 3 shows that the CTI outperforms compared methods by a noticeable margin. This model outperforms the current state-of-the-art STL by 1.1%. Again, this validates the effectiveness of the proposed joint presentation learning, which precisely and simultaneously learns interactions between the three inputs. We note that when comparing with other methods on Visual7W, for image representations, we used the grid features extracted from ResNet-512 for a fair comparison. Our proposed model can achieve further improvements by using the object detection-based features used in FFOE VQA. With new features, the model denoted as CTIwBoxes in Table 3 achieve 72.3% accuracy with Acc-MC metric which improves over the current state-of-the-art STL 4.1%.

## Conclusion

A novel compact trilinear interaction is introduced to simultaneously learns high level associations between image, question, and answer in both MC VQA and FFOE VQA. In addition, knowledge distillation is the first time applied to FFOE VQA to overcome the computational complexity and memory issue of the interaction. The extensive experimental results show that these models achieve the state-of-the-art results on three benchmarking datasets.