Populating 3D Scenes by Learning Human Scene Interaction (Part 1).

Introduction to Learning Human Scene Interaction
type: insightlevel: medium

People constantly interact with their surroundings, and such interactions have semantics, specifically as combinations of actions and object instances. These interactions are becoming more and more diverse, and capture meaningful semantics. Understanding the interaction is crucial for developing intelligent systems that can effectively interact with humans in various contexts, including virtual environments, robotics, and human-computer interfaces. Moreover, through the semantics of these interactions, researchers can further understand how humans contact with difference environments with their pose and body, and capture meaningful scene semantics.


Human constantly interact with 3D space and such interactions involve physical contact between surfaces that is semantically meaningful. Thus, it is important to learn how humans interact with scenes and study their applications.

Despite the importance of the interactions, existing representations of the human body do not explicitly represent, support, or capture them.

SMPL-X model [1] can represent the shape and pose of people. Moreover, this representation includes hand and face, and it supports reasoning about contact between the body and the world. However, some challanges still remain:

  • SMPL-X does not explicitly model contact .
  • Not all parts of the body surface are equally likely to be in contact with the scene.
  • The poses of body and scene semantics are highly intertwined.

POSA [2] (Pose with prOximitieS and contActs) leverages SMPL-X to capture contact and the semantics of Human-Scene Interactions (HSI) in a body-centric representation. POSA aims to solve challenging problems:

  • Automatic scene population: Given a 3D scene and a body in a particular pose, where in the scene is this pose most likely?
  • Monocular 3D human pose estimation in a 3D scene.


Training data: PROX-E [3] dataset was used for this task. It is a set of n pairs of 3D meshes

M={{Mb,1,Ms,1},{Mb,2,Ms,2},...,{Mb,n,Ms,n}}\mathcal{M} = \left \{ \left \{ M_{b,1}, M_{s,1} \right \}, \left \{M_{b,2}, M_{s,2}\right \},... , \left \{M_{b,n}, M_{s,n}\right \} \right \}

comprising body meshes Mb,iM_{b,i} and scene meshes Ms,iM_{s,i} and ii is the index of M\mathcal{M}.

  • Mb=(Vb,Fb)M_b= (V_b, F_b): body mesh which has a fixed topology with Nb=Vb=10475N_b = |V_b| = 10475 vertices VbRNb×3V_b \in \mathbb{R}^{N_b \times 3 } and body mesh faces FbF_b .
  • Ms=(Vs,Fs,Ls)M_s = (V_s, F_s, L_s): scene mesh which has a varying number of vertices Ns=VsN_s = |V_s|, triangle connectivity FsF_s to model arbitrary scenes, and per-vertex semantic labels LsL_s (e.g chair, bed, sofa,...).


Figure 1: PROX-E Dataset.

Human meshes are represented by SMPL-X model, i.e. a differentiable function M(θ,β,ψ):Rθ×β×ψRNb×3M(\theta, \beta, \psi) : \mathbb{R}^{|\theta| \times |\beta| \times |\psi|} → \mathbb{R}^{N_b \times 3} parameterized by pose θ\theta, shape β\beta and facial expressions ψ\psi.

  • The pose vector θ=(θb,θf,θlh,θrh)\theta = (\theta_b, \theta_f , \theta_{lh}, \theta_{rh}) is comprised of body θbR66\theta_b \in \mathbb{R}^{66} , face parameters θfR9\theta_f \in \mathbb{R}^9, in axis-angle representation, and θlh,θrhR12\theta_{lh}, \theta_{rh} \in \mathbb{R}^{12} which parameterize the poses of the left and right hands respectively in a low-dimensional pose space.

  • The shape parameters, βR10\beta \in \mathbb{R}^{10}, represent coefficients in a low-dimensional shape space learned from a large corpus of human body scans.

  • The joints, J(β)J(\beta), of the body in the canonical pose are regressed from the body shape.

Methodology: POSA Representation for HSI

POSA encodes the relationship between the human mesh MbM_b and the scene mesh MsM_s in an egocentric feature map ff that encodes per-vertex features on the SMPL-X mesh MbM_b:

$f : (V_b, M_s) \rightarrow [f_c, f_s] $ where fcf_c is the contact label, fsf_s is the semantic label of the contact point and NfN_f is the feature dimension.


Figure 2: Illustration of the proposed representation.

For each vertex VbiV_b^i on the body, find its closest scene point:

Ps=argminPsSsPsVbiP_s = \underset{P_s \in \mathcal{S}_s}{\operatorname{argmin}}|| P_s − V_b^i ||

The distance fdf_d is calculated:

fd=PsVbiRf_d =|| P_s − V_b^i || \in \mathbb{R}

Given fdf_d, determine whether vertex VbiV_b^i is in contact with the scene or not by comparing fdf_d with a constant threshold :

fc={1  if  fdConstant Threshold0  if  fd>Constant Thresholdf_c = \left\{\begin{matrix} 1\ \ if \ \ f_d \leq \text{Constant Threshold} \\ 0\ \ if \ \ f_d > \text{Constant Threshold} \end{matrix}\right.

The semantic label of the contacted surface fs is a one-hot encoding of the object class:

fs={0,1}Nof_s = \left \{0, 1 \right \}^{N_o}

where NoN_o is the number of object classes.


[1]Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), 2019[2]Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J.Black. 2021b. Populating 3D Scenes by Learning Human-Scene Interaction. In Conference on Computer Vision and Pattern Recognition (CVPR).[3]Shunwang Gong, Lei Chen, Michael Bronstein, and Stefanos Zafeiriou. SpiralNet++: A fast and highly efficient mesh convolution operator. In International Conference on Computer Vision Workshops (ICCVw), 20[4]Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3D human pose ambiguities with 3D scene constrains. In International Conference on Computer Vision (ICCV), 2019.