Minkowski Engine - part 2

4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks - Experimental results
type: Optimizationlevel: medium

In this part, we will discover the implementation details of Minkowski Engine and corresponding Experimental Results.

Minkowski Engine

Minkowski Engine is an open-source auto-differentiation library for sparse tensors and the generalized sparse convolution. It is an extensive library with many functions

1. Sparse Tensor Quantization:

Convert an input into unique coordinates, associated features, and optionally labels using GPU functions.

2. Generalized Sparse Convolution:

  • Create this output coordinates dynamically allowing an arbitrary output coordinates Cout\mathcal{C}^{out} given the input coordinates Cin\mathcal{C}^{in} for the generalized sparse convolution

  • Create a kernel map for convolving the input with the kernel. The kernel map identifies which inputs affect which outputs, and is defined as pairs of lists of integers: the in map I\textbf{I} and the out map O\textbf{O}. An integer in an in map iIi\in \textbf{I} indicates the row index of the coordinate matrix or the feature matrix of an input sparse tensor. Similarly, an integer in the out map oOo \in \textbf{O} also indicates the row index of the coordinate matrix of an output sparse tensor.


Figure 2: Convolution Kernel Map.

For example, $3\times 3 $ kernel requires 9 kernel maps. Due to sparsity in tensor, some kernel maps do not have elements. The kernel maps are extracted:

  • Kernel map B:10B: 1 \mapsto 0
  • Kernel map B:02B: 0 \mapsto 2
  • Kernel map H:23H: 2 \mapsto 3
  • Kernel map I:00I: 0 \mapsto 0

3. Max Pooling and Global Pooling

Max pooling layer selects the maximum element within a region for each channel. For a sparse tensor input, define it as

xu,iout=maxkND(u)Cinxu+k,iinx_{\textbf{u},i}^{\text{out}} = \max_{\textbf{k} \in \mathcal{N}^D(\textbf{u}) \cap \mathcal{C}^{in}} x_{\textbf{u}+\textbf{k},i}^{\text{in}}

where xu,ix_{\textbf{u},i} indicates the ii-th channel feature value at u\textbf{u}. The region to pool features from is defined as ND(u)Cin\mathcal{N}^D(\textbf{u}) \cap \mathcal{C}^{in}. The global pooling is similar to the max pooling except that features from all non-zero elements in the sparse tensor are pooling:

xiout=maxkCin xu+k,iinx_{i}^{\text{out}} = \max_{\textbf{k} \in \mathcal{C}^{in}} \ x_{\textbf{u}+\textbf{k},i}^{\text{in}}

4. Normalization

First, instance normalization computes batch-wise statistics and whiten features batch wise. The mean and standard deviations are:

μb=1CbinkCbinxu,bin\mu_b = \dfrac{1}{|\mathcal{C}^{in}_b|} \sum_{\textbf{k} \in \mathcal{C}^{in}_b}x_{\textbf{u},b}^{\text{in}}
σbi2=1CbinuCbin(xu,biinμbi)2\sigma_{bi}^2 = \dfrac{1}{|\mathcal{C}^{in}_b|} \sum_{\textbf{u} \in \mathcal{C}^{in}_b} (x_{\textbf{u},bi}^{\text{in}} - \mu_{bi})^2

where xu,bix_{\textbf{u},bi} indicates the ii-th channel feature at the coordinate u\textbf{u} with batch index bb. Cbin\mathcal{C}^{in}_b is the set of non-zero element coordinates in the bb-th batch. μb\mu_b indicates the bb-th batch batch-wise feature mean and σbi\sigma_{bi} is the ii-th feature channel standard deviation of the bb-th batch.

xu,biout=xu,biinμbiσbi2+ϵx_{\textbf{u},bi}^{\text{out}} = \dfrac{x_{\textbf{u},bi}^{\text{in}} - \mu_{bi}}{\sqrt{\sigma_{bi}^2 + \epsilon}}

Batch normalization is similar to the instance normalization except that it computes statistics for all batch:

μ=1CinkCinxuin\mu = \dfrac{1}{|\mathcal{C}^{in}|} \sum_{\textbf{k} \in \mathcal{C}^{in}}x_{\textbf{u}}^{\text{in}}
σi2=1CinuCin(xu,iinμi)2\sigma_{i}^2 = \dfrac{1}{|\mathcal{C}^{in}|} \sum_{\textbf{u} \in \mathcal{C}^{in}} (x_{\textbf{u},i}^{\text{in}} - \mu_{i})^2
xu,iout=xu,iinμiσi2+ϵx_{\textbf{u},i}^{\text{out}} = \dfrac{x_{\textbf{u},i}^{\text{in}} - \mu_{i}}{\sqrt{\sigma_{i}^2 + \epsilon}}

5. Non-linearity Layers

Most of the commonly used non-linearity functions are applied independently element-wise. Thus, an element wise function f()f(·) can be a rectified-linear function (ReLU), leaky ReLU, ELU, SELU, etc:

xu,iout=f(xu,iin) for  uCx_{\textbf{u},i}^{\text{out}} = f(x_{\textbf{u},i}^{\text{in}}) \ \text{for}\ \ \textbf{u} \in \mathcal{C}

Minkowski Convolutional Neural Networks

Problems of high-dimensional convolutions:

  • Computational cost and memory consumption increase exponentially due to the dimension increase, and they do not necessarily lead to better performance.
  • The networks do not have an incentive to make the prediction consistent throughout the space and time with conventional cross-entropy loss alone.

Hybrid kernel: The hybrid kernel is a combination of a cross-shaped kernel a conventional cubic kernel.

  • Spatial dimensions: Use cubic kernel to capture the spatial geometry accurately.

  • Temporal dimension: Use cross-shaped kernel to connect the same point in space across time.

The hybrid kernel experimentally outperforms the tesseract kernel both in speed and accuracy. Fig-1

Figure 1: Various kernels in space-time. The red arrow indicates the temporal dimension and the other two axes are for spatial dimensions


ScanNet: The ScanNet 3D segmentation benchmark consists of 3D reconstructions of real rooms. It contains 1500 rooms, some repeated rooms captured with different sensors. They feed an entire room to a MinkowskiNet fully convolutionally without cropping.


Figure 2: 3D Semantic Label Benchmark on ScanNet.

Visualization: 3D input point cloud | Predictions | Ground truth :-------------------------:|:-------------------------:|:-------------------------: | | | |

Synthia 4D: They use the Synthia dataset to create 3D video sequences of driving. Each sequence consists of 4 stereo RGB-D images taken from the top of a car. They back-project the depth images to the 3D space to create 3D videos. The Synthia 4D dataset has an order of magnitude more 3D scans than Synthia 3D dataset.


Figure 3: Segmentation results on the 4D Synthia dataset without noise addition for the input point cloud.


3D network4D network

Stanford 3D Indoor: The ScanNet and the Stanford Indoor datasets are one of the largest non-synthetic datasets, which make the datasets ideal test beds for 3D segmentation. They have achieved +19%+19\% mIOU on ScanNet, and +7%+7\% on Stanford compared with the original works.


Figure 4: Stanford Area 5 Test.


RGB inputPredictionsGround truth


[1]High-dimensional Convolutional Neural Networks for 3D Perception, Stanford University, Chapter 4. Sparse Tensor Networks.[2] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In CVPR, 2019.