Populating 3D Scenes by Learning Human Scene Interaction (Part 3).

Learning Human Scene Interaction - The inference pipeline and effectiveness analysis.
type: insightlevel: medium

In this post, we will discover the inference phase of POSA framework in the human-scene interaction task and how to evaluate its effectiveness.

Inference phase

Putting people into scenes: Given a scene MsM_s, semantic labels of objects present, and a body mesh MbM_b, POSA finds where in MsM_s this given pose is likely to happen:

First, given the posed body, use the decoder of cVAE to generate a feature map by sampling P(fGenz,Vb)P(f_{Gen}|z, Vb) with fGen=[fc^,fs^]f_{Gen} = [\hat{f_c}, \hat{f_s}] .

Second, optimize the objective function:

E(τ,θ0,θ)=Lafford+Lpen+LregE(\tau, \theta_0, \theta) = \mathcal{L}_{\text{afford}} + \mathcal{L}_{\text{pen}} + \mathcal{L}_{\text{reg}}

where τ\tau is the body translation, θ0θ_0 is the global body orientation, θ\theta is the body pose and:

  • The afforance loss Lafford\mathcal{L}_{\text{afford}} finds position in the scene where given pose is likely to happen.
Lafford=λ1fGenc.fd22+λ2iCCE(fGensi,fsi)\mathcal{L}_{\text{afford}} = \lambda_1 ||f_{Gen_c} . f_d||_2^2 + \lambda_2 \sum_i CCE(f_{Gen_s}^i, f_s^i)
  • The penetration penalty Lpen\mathcal{L}_{\text{pen}} discourages the body from penetrating the scene.
Lpen=λpenfdi<0(fdi)2\mathcal{L}_{\text{pen}} = \lambda_{\text{pen}} \sum_{f_d^i < 0} (f_d^i)^2
  • The regularizer Lreg\mathcal{L}_{\text{reg}} that encourages the estimated pose to remain close to the initial pose θinit\theta_{\text{init}} of MbM_b.

Lreg=λregθθinit22\mathcal{L}_{\text{reg}} = \lambda_{\text{reg}} ||\theta - \theta_{\text{init}}||_2^2


Figure 1: Putting realistic people in scenes.

Locating Clothed Bodies:


Figure 2: Locate clothed bodies in scenes.
Using SMPL-X fits to clothed meshes from the AGORA dataset. The optimization objective now is defined:

E(τ,θ0)=Lafford+LpenE(\tau, \theta_0) = \mathcal{L}_{\text{afford}} + \mathcal{L}_{\text{pen}}

Monocular Pose Estimation with HSI: Fit SMPL-X to RGB image features such that the contacts are consistent with the 3D scene and its semantics, in order to minimize an objective function of multiple terms: the re-projection error of 2D joints, priors and physical constraints on the body:

ESMPLify-X(β,θ,ψ,τ)=EJ+λθEθ+λαEα+λβEβ+λPEPE_{\text{SMPLify-X}}(\beta, \theta, \psi, \tau ) = E_J + \lambda_{\theta}E_{\theta} + \lambda_{\alpha}E_{\alpha} + \lambda_{\beta}E_{\beta} +\lambda_{\mathcal{P}}E_{\mathcal{P}}

To get a pose matching the image observations and roughly obeying scene constraints, sample features from P(fGenz,Vb)P(f_{Gen}|z, V_b) from body pose, then minimize

E(β,θ,ψ,τ,Ms)=ESMPLify-X+fGencfd+LpenE(\beta, \theta, \psi, \tau, M_s ) = E_{\text{SMPLify-X}} + ||f_{Gen_c} \cdot f_d|| + \mathcal{L}_{\text{pen}}


Comparison to PROX ground truth: They take 4 real scenes from the PROX test set, 100 SMPL-X bodies from the AGORA dataset, corresponding to 100 different 3D scans from Renderpeople, and take each of these bodies and sample one feature map for each using cVAE. Then, automatically optimize the placement of each sample in all the scenes, one body per scene. The pose is changed slightly to fit the scene for unclothed bodies and kept fixed for clothed bodies.

For each variant, the optimization results in 400 unique body-scene pairs. They render each 3D human-scene interaction from 2 views so that subjects are able to get a good sense of the 3D relationships from the images.


Figure 3: Comparison to PROX ground truth. Subjects are shown pairs of a generated 3D human-scene interaction and PROX ground truth.

Comparison between POSA and PLACE directly compare POSA and PLACE using the above protocol. Adding semantics to POSA improves realism.


Figure 4: POSA compared to PLACE for 3D human-scene interaction generation.

Physical Plausibility: They take 1200 bodies from the AGORA dataset and place all of them in each of the 4 test scenes of PROX. They compute the following scores:

  • Non-collision score for each body mesh MbM_b, which is the ratio of body mesh vertices with positive scene signed distance field (SDF) values divided by the total number of SMPL-X vertices.
  • The contact score for each MbM_b, which is 1 if at least one vertex of MbM_b has a non-positive value.

Experiment shows that POSA and PLACE are comparable under these metrics.


Figure 5: Evaluation of the physical plausibility metric.


[1]Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), 2019[2]Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J.Black. 2021b. Populating 3D Scenes by Learning Human-Scene Interaction. In Conference on Computer Vision and Pattern Recognition (CVPR).[3]Shunwang Gong, Lei Chen, Michael Bronstein, and Stefanos Zafeiriou. SpiralNet++: A fast and highly efficient mesh convolution operator. In International Conference on Computer Vision Workshops (ICCVw), 20[4]Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3D human pose ambiguities with 3D scene constrains. In International Conference on Computer Vision (ICCV), 2019.