In this part, we will discover the implementation details of Minkowski Engine and corresponding Experimental Results.
Minkowski Engine
Minkowski Engine is an open-source auto-differentiation library for sparse tensors and the generalized sparse convolution. It is an extensive library with many functions
1. Sparse Tensor Quantization:
Convert an input into unique coordinates, associated features, and optionally labels using GPU functions.
2. Generalized Sparse Convolution:
Create this output coordinates dynamically allowing an arbitrary output coordinates given the input coordinates for the generalized sparse convolution
Create a kernel map for convolving the input with the kernel. The kernel map identifies which inputs affect which outputs, and is defined as pairs of lists of integers: the in map and the out map . An integer in an in map indicates the row index of the coordinate matrix or the feature matrix of an input sparse tensor. Similarly, an integer in the out map also indicates the row index of the coordinate matrix of an output sparse tensor.
For example, $3\times 3 $ kernel requires 9 kernel maps. Due to sparsity in tensor, some kernel maps do not have elements. The kernel maps are extracted:
- Kernel map
- Kernel map
- Kernel map
- Kernel map
3. Max Pooling and Global Pooling
Max pooling layer selects the maximum element within a region for each channel. For a sparse tensor input, define it as
where indicates the -th channel feature value at . The region to pool features from is defined as . The global pooling is similar to the max pooling except that features from all non-zero elements in the sparse tensor are pooling:
4. Normalization
First, instance normalization computes batch-wise statistics and whiten features batch wise. The mean and standard deviations are:
where indicates the -th channel feature at the coordinate with batch index . is the set of non-zero element coordinates in the -th batch. indicates the -th batch batch-wise feature mean and is the -th feature channel standard deviation of the -th batch.
Batch normalization is similar to the instance normalization except that it computes statistics for all batch:
5. Non-linearity Layers
Most of the commonly used non-linearity functions are applied independently element-wise. Thus, an element wise function can be a rectified-linear function (ReLU), leaky ReLU, ELU, SELU, etc:
Minkowski Convolutional Neural Networks
Problems of high-dimensional convolutions:
- Computational cost and memory consumption increase exponentially due to the dimension increase, and they do not necessarily lead to better performance.
- The networks do not have an incentive to make the prediction consistent throughout the space and time with conventional cross-entropy loss alone.
Hybrid kernel: The hybrid kernel is a combination of a cross-shaped kernel a conventional cubic kernel.
Spatial dimensions: Use cubic kernel to capture the spatial geometry accurately.
Temporal dimension: Use cross-shaped kernel to connect the same point in space across time.
The hybrid kernel experimentally outperforms the tesseract kernel both in speed and accuracy.
Experiment
ScanNet: The ScanNet 3D segmentation benchmark consists of 3D reconstructions of real rooms. It contains 1500 rooms, some repeated rooms captured with different sensors. They feed an entire room to a MinkowskiNet fully convolutionally without cropping.
Visualization: 3D input point cloud | Predictions | Ground truth :-------------------------:|:-------------------------:|:-------------------------: | | | |
Synthia 4D: They use the Synthia dataset to create 3D video sequences of driving. Each sequence consists of 4 stereo RGB-D images taken from the top of a car. They back-project the depth images to the 3D space to create 3D videos. The Synthia 4D dataset has an order of magnitude more 3D scans than Synthia 3D dataset.
Visualization:
3D network | 4D network |
---|---|
Stanford 3D Indoor: The ScanNet and the Stanford Indoor datasets are one of the largest non-synthetic datasets, which make the datasets ideal test beds for 3D segmentation. They have achieved mIOU on ScanNet, and on Stanford compared with the original works.
Visualization:
RGB input | Predictions | Ground truth |
---|---|---|