3D Downstream Tasks for Foundation Model (Part 1).

3D Visual Question Answering.
type: insightlevel: advanceguides: smart_caching

3D Visual Question Answering (3D VQA) is an emerging interdisciplinary task that combines the domains of computer vision, natural language processing, and 3D data analysis. It extends the traditional 2D Visual Question Answering (VQA) by incorporating three-dimensional (3D) data, allowing the model to answer questions that involve spatial relationships, object depth, volume, surface properties, and occlusions—capabilities that are challenging or impossible with 2D data alone. The answers to these questions can investigate meaningful relationships between local entities in the scene, therefore significantly enhancing machine understanding of complex scenes. This blog provides an insightful overview of the 3D VQA topic.

Introduction

Given a question about a specific situation, we can easily provide an answer based on the context and available information, such as counting the number of vehicles in a camera image or identifying the position of a chair in a bedroom. However, enabling computers to answer these questions in a human-like manner presents numerous challenges.

Visual Question Answering (VQA) is a multidisciplinary task that lies at the intersection of computer vision, natural language processing, and reasoning. In VQA, a system is presented with an image and a related question in natural language. The goal of the system is to generate an accurate response, typically in the form of a single word or short phrase, based on the content of the image. The question-and-answer descriptions can focus on the whole image or a local entity in the image.

3D Visual Question Answering (3D VQA) is a challenging task at the intersection of computer vision, natural language processing, and 3D scene understanding. It extends the concept of traditional Visual Question Answering (VQA), where a system must answer questions about a 2D image, to three-dimensional environments. In 3D VQA, the system is required to process and interpret 3D data, typically represented by point clouds, meshes, or volumetric data, to accurately answer questions posed in natural language.

The key difference between 3D VQA and its 2D counterpart lies in the dimensionality and complexity of the input data. Unlike 2D images, 3D scenes capture richer spatial information, depth cues, and object geometry. This requires the system to effectively understand and reason about spatial relationships, object interactions, and structural attributes within a 3D context. Questions in 3D VQA can range from object recognition and counting to more complex reasoning tasks such as identifying spatial arrangements, distances, or interactions between objects.

Recent advancements in deep learning, especially in 3D neural networks and natural language processing models, have enabled the development of sophisticated systems capable of tackling 3D VQA tasks. Applications of 3D VQA include autonomous navigation, augmented and virtual reality, robotics, and human-computer interaction, where understanding 3D environments and responding to user queries is crucial. This article is intended to delve deeper into 3D VQA, including task definition, used framework, implemented methods, and datasets.

Method Overview

Task definition

3D VQA requires the model to predict an answer phase for a given question about the 3D scene. The input required contains a 3D point cloud, which represents the geometric and appearance characteristics of objects in the scene, along with other features such as colors, normal vectors, ... and a question about the scene or local objects in that scene.

The point cloud input data can be represented as a matrix PRN×FP \in \mathbb{R}^{N \times F} where NN denotes the number of points and FF is the dimensionality of point features, including point coordinates (x,y,z)(x,y,z) and other features if needed. The final output is an answer to a given question. To better enhance reasoning and explainability in the answer, the output may contain object ID in ScanNet benchmark classes for the objects in the answer, and bounding boxes for these objects. This prevents the case that the method focuses on only one object in the answer.

Architecture

Many works follow the ScanQA model introduced by Daichi et al., a baseline model for 3D VQA task. Given the point cloud pp and a question qq, the model outputs an answer that matches the ground truth.

Encoder: The encoder takes the raw input and outputs its features. In the case of 3D VQA, scene encoder and text encoder are considered.

The scene encoder extracts the scene details such as object-level features or other features from the scene point cloud. Various 3D detection methods can be used such as PointNet, PointNet++, PointGroup, VoteNet, and 3DETR, and achieve notable performance. The output of the scene encoder is the object feature VRnv×dV \in \mathbb{R}^{n_v \times d} where nvn_v is the number of proposals (set to 256 as default) and dd is the number of features.  

The text encoder encodes the question into a sequence of text tokens. Various text encoders can be used such as sequence models (RNN, LSTM, BiLSTM) or pre-trained encoders (T5, BERT, CLIP, ...). The output of the text encoder is word feature QRnq×dQ' \in \mathbb{R}^{n_q \times d} where nqn_q is the length of the question.

Fusion module:   Attention mechanisms (self-attention or cross-attention layers) can be used to represent the relationships between object proposals and between question words. The self-attention layers aim to extract high-level semantic information and encode rich relation information (i.e., word-to-word relations for the question, and object-to-object relations for the point clouds) respectively, while the cross-attention layers focus on how to model complex interactions between different modalities and exchange word-object information. The final output of the fusion module is the fused feature vector fRdf \in \mathbb{R}^d.

Task-specific layers: The fusion feature ff can be used for various tasks, such as object detection, object classification, or answer classification.

Fig-1

Figure 1: Overview of ScanQA model. It consists of two encoders to extract features from the point cloud and text question, a fusion layer to model the interaction between different modalities (text and 3d data), and task-specific layers to produce final outputs.

In the next part, we will discuss about state-of-the-art methods, datasets used and evaluated metric used in 3D VQA.