People constantly interact with their surroundings, and such interactions have semantics, specifically as combinations of actions and object instances. These interactions are becoming more and more diverse, and capture meaningful semantics. Understanding the interaction is crucial for developing intelligent systems that can effectively interact with humans in various contexts, including virtual environments, robotics, and human-computer interfaces. Moreover, through the semantics of these interactions, researchers can further understand how humans contact with difference environments with their pose and body, and capture meaningful scene semantics.
Introduction
Human constantly interact with 3D space and such interactions involve physical contact between surfaces that is semantically meaningful. Thus, it is important to learn how humans interact with scenes and study their applications.
Despite the importance of the interactions, existing representations of the human body do not explicitly represent, support, or capture them.
SMPL-X model [1] can represent the shape and pose of people. Moreover, this representation includes hand and face, and it supports reasoning about contact between the body and the world. However, some challanges still remain:
- SMPL-X does not explicitly model contact .
- Not all parts of the body surface are equally likely to be in contact with the scene.
- The poses of body and scene semantics are highly intertwined.
POSA [2] (Pose with prOximitieS and contActs) leverages SMPL-X to capture contact and the semantics of Human-Scene Interactions (HSI) in a body-centric representation. POSA aims to solve challenging problems:
- Automatic scene population: Given a 3D scene and a body in a particular pose, where in the scene is this pose most likely?
- Monocular 3D human pose estimation in a 3D scene.
Dataset
Training data: PROX-E [3] dataset was used for this task. It is a set of n pairs of 3D meshes
comprising body meshes and scene meshes and is the index of .
- : body mesh which has a fixed topology with vertices and body mesh faces .
- : scene mesh which has a varying number of vertices , triangle connectivity to model arbitrary scenes, and per-vertex semantic labels (e.g chair, bed, sofa,...).
Human meshes are represented by SMPL-X model, i.e. a differentiable function parameterized by pose , shape and facial expressions .
The pose vector is comprised of body , face parameters , in axis-angle representation, and which parameterize the poses of the left and right hands respectively in a low-dimensional pose space.
The shape parameters, , represent coefficients in a low-dimensional shape space learned from a large corpus of human body scans.
The joints, , of the body in the canonical pose are regressed from the body shape.
Methodology: POSA Representation for HSI
POSA encodes the relationship between the human mesh and the scene mesh in an egocentric feature map that encodes per-vertex features on the SMPL-X mesh :
$f : (V_b, M_s) \rightarrow [f_c, f_s] $ where is the contact label, is the semantic label of the contact point and is the feature dimension.
For each vertex on the body, find its closest scene point:
The distance is calculated:
Given , determine whether vertex is in contact with the scene or not by comparing with a constant threshold :
The semantic label of the contacted surface fs is a one-hot encoding of the object class:
where is the number of object classes.