Minkowski Engine

In this part, we will discover the implementation details of Minkowski Engine and corresponding Experimental Results.

Minkowski Engine is an open-source auto-differentiation library for sparse tensors and the generalized sparse convolution. It is an extensive library with many functions

1. Sparse Tensor Quantization:

Convert an input into unique coordinates, associated features, and optionally labels using GPU functions.

2. Generalized Sparse Convolution:

Create this output coordinates dynamically allowing an arbitrary output coordinates $\mathcal{C}^{out}$ given the input coordinates $\mathcal{C}^{in}$ for the generalized sparse convolution
Create a kernel map for convolving the input with the kernel. The kernel map identifies which inputs affect which outputs, and is defined as pairs of lists of integers: the in map $\textbf{I}$ and the out map $\textbf{O}$ . An integer in an in map $i\in \textbf{I}$ indicates the row index of the coordinate matrix or the feature matrix of an input sparse tensor. Similarly, an integer in the out map $o \in \textbf{O}$ also indicates the row index of the coordinate matrix of an output sparse tensor.

Fig-2

Figure 2: Convolution Kernel Map.

For example, $3\times 3 $ kernel requires 9 kernel maps. Due to sparsity in tensor, some kernel maps do not have elements. The kernel maps are extracted:

Kernel map $B: 1 \mapsto 0$
Kernel map $B: 0 \mapsto 2$
Kernel map $H: 2 \mapsto 3$
Kernel map $I: 0 \mapsto 0$

3. Max Pooling and Global Pooling

Max pooling layer selects the maximum element within a region for each channel. For a sparse tensor input, define it as

x_{\textbf{u},i}^{\text{out}} = \max_{\textbf{k} \in \mathcal{N}^D(\textbf{u}) \cap \mathcal{C}^{in}} x_{\textbf{u}+\textbf{k},i}^{\text{in}}

where $x_{\textbf{u},i}$ indicates the $i$ -th channel feature value at $\textbf{u}$ . The region to pool features from is defined as $\mathcal{N}^D(\textbf{u}) \cap \mathcal{C}^{in}$ . The global pooling is similar to the max pooling except that features from all non-zero elements in the sparse tensor are pooling:

x_{i}^{\text{out}} = \max_{\textbf{k} \in \mathcal{C}^{in}} \ x_{\textbf{u}+\textbf{k},i}^{\text{in}}

4. Normalization

First, instance normalization computes batch-wise statistics and whiten features batch wise. The mean and standard deviations are:

\mu_b = \dfrac{1}{|\mathcal{C}^{in}_b|} \sum_{\textbf{k} \in \mathcal{C}^{in}_b}x_{\textbf{u},b}^{\text{in}}

\sigma_{bi}^2 = \dfrac{1}{|\mathcal{C}^{in}_b|} \sum_{\textbf{u} \in \mathcal{C}^{in}_b} (x_{\textbf{u},bi}^{\text{in}} - \mu_{bi})^2

where $x_{\textbf{u},bi}$ indicates the $i$ -th channel feature at the coordinate $\textbf{u}$ with batch index $b$ . $\mathcal{C}^{in}_b$ is the set of non-zero element coordinates in the $b$ -th batch. $\mu_b$ indicates the $b$ -th batch batch-wise feature mean and $\sigma_{bi}$ is the $i$ -th feature channel standard deviation of the $b$ -th batch.

x_{\textbf{u},bi}^{\text{out}} = \dfrac{x_{\textbf{u},bi}^{\text{in}} - \mu_{bi}}{\sqrt{\sigma_{bi}^2 + \epsilon}}

Batch normalization is similar to the instance normalization except that it computes statistics for all batch:

\mu = \dfrac{1}{|\mathcal{C}^{in}|} \sum_{\textbf{k} \in \mathcal{C}^{in}}x_{\textbf{u}}^{\text{in}}

\sigma_{i}^2 = \dfrac{1}{|\mathcal{C}^{in}|} \sum_{\textbf{u} \in \mathcal{C}^{in}} (x_{\textbf{u},i}^{\text{in}} - \mu_{i})^2

x_{\textbf{u},i}^{\text{out}} = \dfrac{x_{\textbf{u},i}^{\text{in}} - \mu_{i}}{\sqrt{\sigma_{i}^2 + \epsilon}}

5. Non-linearity Layers

Most of the commonly used non-linearity functions are applied independently element-wise. Thus, an element wise function $f(·)$ can be a rectified-linear function (ReLU), leaky ReLU, ELU, SELU, etc:

x_{\textbf{u},i}^{\text{out}} = f(x_{\textbf{u},i}^{\text{in}}) \ \text{for}\ \ \textbf{u} \in \mathcal{C}

Minkowski Convolutional Neural Networks

Problems of high-dimensional convolutions:

Computational cost and memory consumption increase exponentially due to the dimension increase, and they do not necessarily lead to better performance.
The networks do not have an incentive to make the prediction consistent throughout the space and time with conventional cross-entropy loss alone.

Hybrid kernel: The hybrid kernel is a combination of a cross-shaped kernel a conventional cubic kernel.

Spatial dimensions: Use cubic kernel to capture the spatial geometry accurately.
Temporal dimension: Use cross-shaped kernel to connect the same point in space across time.

The hybrid kernel experimentally outperforms the tesseract kernel both in speed and accuracy. Fig-1

Figure 1: Various kernels in space-time. The red arrow indicates the temporal dimension and the other two axes are for spatial dimensions

Experiment

ScanNet: The ScanNet 3D segmentation benchmark consists of 3D reconstructions of real rooms. It contains 1500 rooms, some repeated rooms captured with different sensors. They feed an entire room to a MinkowskiNet fully convolutionally without cropping.

Fig-2

Figure 2: 3D Semantic Label Benchmark on ScanNet.

Synthia 4D: They use the Synthia dataset to create 3D video sequences of driving. Each sequence consists of 4 stereo RGB-D images taken from the top of a car. They back-project the depth images to the 3D space to create 3D videos. The Synthia 4D dataset has an order of magnitude more 3D scans than Synthia 3D dataset.

Fig-3

Figure 3: Segmentation results on the 4D Synthia dataset without noise addition for the input point cloud.

Visualization:

3D network	4D network

Stanford 3D Indoor: The ScanNet and the Stanford Indoor datasets are one of the largest non-synthetic datasets, which make the datasets ideal test beds for 3D segmentation. They have achieved $+19\%$ mIOU on ScanNet, and $+7\%$ on Stanford compared with the original works.

Fig-4

Figure 4: Stanford Area 5 Test.

Visualization:

RGB input	Predictions	Ground truth

References

[1]High-dimensional Convolutional Neural Networks for 3D Perception, Stanford University, Chapter 4. Sparse Tensor Networks.[2] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In CVPR, 2019.