Populating 3D Scenes by Learning Human Scene Interaction (Part 2).

Learning Human Scene Interaction - The training pipeline.
type: insightlevel: medium

In the previous post, we have studied about POSA Representation for HSI and the corresponding dataset for Learning Human Scene Interaction. In this post, we will discover the prediction of egocentric feature map as well as how to train POSA framework in the human-scene interaction task.

Learning to predict egocentric feature map

Goal: To learn a probabilistic function from body pose and shape to the feature space of contact and semantics. Given a body, sample labelings of the vertices corresponding to likely scene contacts and their corresponding semantic label.

Training conditional Variational Autoencoder

Train a conditional Variational Autoencoder (cVAE), where they condition the feature map on the vertex positions, VbV_b, which are a function of the body pose and shape parameters.

Fig-3

Figure 1: cVAE architecture.

Encoder

Learn to approximate posterior Q(zf,Vb)Q(z| f, V_b):

  • Input: vertice coordinate (xi,yi,zi)(x_i,y_i,z_i), contact label fcf_c and semantic label fsf_s.
  • Output: Latent vector zN(0,I)z \sim N(0,I) for simpler randomness control (sampling)

The loss LKL\mathcal{L}_{KL} encourages approximate posterior Q(zf,Vb)Q(z|f,V_b) to match a distribution p(z)p(z): LKL=KL(Q(zf,Vb)  p(z))\mathcal{L}_{KL} = KL(Q(z|f, V_b)\ || \ p(z))

  • Spiral convolution [4]: Since ff is defined on the vertices of the body mesh MbM_b, graph convolution can be used as building block for VAE. It acts directly on the 3D mesh and becomes efficient at encoding the ordering relationship between nodes.

Fig-4

Figure 2: Spiral sequence contain vertices around the red star vertex.

Decoder

Learn to maximize the log-likelihood and reconstructs original per-vertex feature:

  • Input: vertice coordinate (xi,yi,zi)(x_i,y_i,z_i), latent vector zz from the encoder
  • Output: reconstructed contact label fc^\hat{f_c} and reconstructed contact semantic fs^\hat{f_s}

The reconstruction loss Lrec\mathcal{L}_{rec} encourages the reconstructed samples to resemble the input:

Lrec(f,f^)=λciBCE(fci,fc^i)+λsiCCE(fsi,fs^i)L_{rec}(f,\hat{f}) = \lambda_c * \sum_{i}BCE(f_c^i, \hat{f_c}^i) + \lambda_s *\sum_{i}CCE(f_s^i, \hat{f_s}^i)

Training optimizes the encoder and decoder parameters to minimize Ltotal\mathcal{L}_{total} using gradient descent:

Ltotal=αLKL+Lrec\mathcal{L}_{total} = \alpha * \mathcal{L}_{KL} + \mathcal{L}_{rec}

Fig-5

Figure 3: Random samples from trained cVAE.

References

[1]Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), 2019[2]Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J.Black. 2021b. Populating 3D Scenes by Learning Human-Scene Interaction. In Conference on Computer Vision and Pattern Recognition (CVPR).[3]Shunwang Gong, Lei Chen, Michael Bronstein, and Stefanos Zafeiriou. SpiralNet++: A fast and highly efficient mesh convolution operator. In International Conference on Computer Vision Workshops (ICCVw), 20[4]Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3D human pose ambiguities with 3D scene constrains. In International Conference on Computer Vision (ICCV), 2019.