In the previous post, we have studied about POSA Representation for HSI and the corresponding dataset for Learning Human Scene Interaction. In this post, we will discover the prediction of egocentric feature map as well as how to train POSA framework in the human-scene interaction task.
Learning to predict egocentric feature map
Goal: To learn a probabilistic function from body pose and shape to the feature space of contact and semantics. Given a body, sample labelings of the vertices corresponding to likely scene contacts and their corresponding semantic label.
Training conditional Variational Autoencoder
Train a conditional Variational Autoencoder (cVAE), where they condition the feature map on the vertex positions, , which are a function of the body pose and shape parameters.
Learn to approximate posterior :
- Input: vertice coordinate , contact label and semantic label .
- Output: Latent vector for simpler randomness control (sampling)
The loss encourages approximate posterior to match a distribution :
- Spiral convolution : Since is defined on the vertices of the body mesh , graph convolution can be used as building block for VAE. It acts directly on the 3D mesh and becomes efficient at encoding the ordering relationship between nodes.
Learn to maximize the log-likelihood and reconstructs original per-vertex feature:
- Input: vertice coordinate , latent vector from the encoder
- Output: reconstructed contact label and reconstructed contact semantic
The reconstruction loss encourages the reconstructed samples to resemble the input:
Training optimizes the encoder and decoder parameters to minimize using gradient descent: