/ / 89 min read

Scalable Group Choreography via Variational Phase Manifold Learning (Part 2)

Methodology for handling scalability of group dance generation

In the previous part, we have explore the introduction about group dance scalability and the what is manifold. In this part, we introduct our main proposal Variational Phase Manifold Learning.

Figure 1. We present a new group dance generation method that can generate a large number of dancers within a fixed resource consumption. The illustration shows a generated group dance sample with 100100 dancers.

Task Definition

Given an input music sequence a={a1,at,...,aT}\mathbf{a} = \{a_1, a_t, ...,a_T\} with t={1,...,T}t = \{1,..., T\} indicates the index of the music frames, our goal is to generate the group motion sequences of NN arbitrary dancers: x={x11,...,xT1;...;x1N,...,xTN}\mathbf{x} = \{x^1_1,..., x^1_T; ...;x^N_1,...,x^N_T\} where xtnx^n_t is the pose of nn-th dancer at frame tt. We use the 6D continuous rotation for every joint, along with 3D joint positions and velocities. Additionally, the corresponding 3D root translation vectors are concatenated into the pose representations to involve the trajectory of motion. Previous group dance methods, which generate the whole group at once, cannot deal with the increasing number of dancers and can only create group sequences up to a pre-defined number of dancers, due to the vast complexity of the architecture. In contrast, we aim to generate group dance with an unlimited number of dancers.

Figure 2. An example output.

Phase-conditioned Dance VAE

Our goal is to learn a continuous manifold such that the motion can be generated by sampling from this learned manifold. We assume that although different dancers within the same group may present visually distinctive movements, the properties of their motions, such as timing, periodicity, or temporal alignment are intrinsically similar. We aim to learn a generative phase representation for each group of dancer in order to synthesize their motion indefinitely. Our generative model is built upon the conditional Variational Autoencoder architecture, thanks to its diverse generation capability and fast sampling speed. However, instead of directly encoding the data into a Gaussian latent distribution as in common VAE approaches, we model the latent variational distribution by the phase parameters extracted from the latent motion curve, which we call variational phase manifold. The latent phase manifold is well-structured and can well describe key characteristics of motion (such as its timing, local periodicity, and transition), which benefits learning motion features.

The overview of our Phase-conditioned Dance VAE is illustrated in Figure 3. Specifically, the model contains three main networks: an encoder E\mathcal{E} to capture the approximate posterior distribution conditioned on both motion and music qϕ(zx,a)q_\phi(\mathbf{z}|\mathbf{x},\mathbf{a}), a prior network P\mathcal{P} to learn the conditional prior given only the music pθ(za)p_\theta(\mathbf{z}|\mathbf{a}) , and a decoder D\mathcal{D} to learn to reconstruct the data from the latent distribution pθ(xz,a)p_\theta(\mathbf{x}|\mathbf{z},\mathbf{a}). The new motion is generated by sampling the frequency-domain parameters predicted by the prior network, which is then passed through the decoder network to reconstruct the motion in the original data space. Furthermore, we adopt Transformer-based architecture in each network to effectively capture long-range dependencies and holistic context of the dance sequence.

Figure 3. Overview of our Phase-conditioned Dance VAE (PDVAE) for scalable group dance generation. It consists of an Encoder, a Prior, and a Decoder network. During training, we encode the corresponding motion and music inputs into a latent phase manifold, which is variational and parameterized by the frequency domain parameters of periodic functions. The latent phases can be sampled from the manifold and then decoded back to the original data space to obtain new motions. The consistency loss Lcsc\mathcal{L}_{\text{csc}} is further imposed to constrain the manifold to be consistently unified for dancers that belong to the same group. At inference stage, only the Prior and the Decoder are used to synthesize group dances efficiently. .

Encoder

The encoder E\mathcal{E} is expected to take both the motion and music feature sequence as input, and produce a distribution over possible latent variables capturing the cross-modal relationship between them. To transform the joint input space into a learned phase manifold, we adopt the Transformer decoder architecture where the Cross-Attention mechanism is utilized to learn the relationship between the motion and the music. Accordingly, the output of the encoder is a batch of latent curves (i.e., the activation sequences per channel) that can particularly capture different spatial and temporal aspects of the motion sequence. However, instead of training the model to directly reconstruct the input motion from the extracted latent curves, we further enforce each channel of the latent space to have a periodic functional form (i.e., sinusoidal). This enables us to effectively learn a compact parameterization for each latent channel from a small set of parameters in the frequency domain.

Generative Variational Phase Manifold

Here we focus on learning the periodicity and non-linear temporal alignment of the motion in the latent space. In particular, given the output latent curves from the encoder L=E(x,a)RD×T\mathbf{L} = \mathcal{E}(x,a) \in \mathbb{R}^{D \times T} with DD is the number of desired phase channels to be extracted from the motion, we parameterize each latent curve in L\mathbf{L} using a sinusoidal function with amplitude (A\mathbf{A}), frequency (F\mathbf{F}), offset (B\mathbf{B}) and phase shift (S\mathbf{S}) parameters. To allow for variational phase manifold learning, we opt to predict two sets of parameters μE={μA;μF;μB;μS}\mathbf{\mu}_{\mathcal{E}} =\{\mathbf{\mu}^A; \mathbf{\mu}^F; \mathbf{\mu}^B; \mathbf{\mu}^S \} and σE={σA;σF;σB;σS}\mathbf{\sigma}_{\mathcal{E}} =\{\mathbf{\sigma}^A; \mathbf{\sigma}^F; \mathbf{\sigma}^B; \mathbf{\sigma}^S \}, which corresponds to the mean and variance of R4D\mathbb{R}^{4D} dimensional Gaussian distribution:

qϕ(zx,a)=N(z;μE,σE)q_\phi(\mathbf{z}|\mathbf{x},\mathbf{a}) = \mathcal{N}(\mathbf{z};\mathbf{\mu}_{\mathcal{E}}, \mathbf{\sigma}_{\mathcal{E}})

To do so, we first apply differentiable Fast Fourier Transform (FFT) to each channel of the latent curve L\mathbf{L} and create the zero-indexed matrix of Fourier coefficients as c=FFT(L)\mathbf{c}=FFT(\mathbf{L}) with cCD×K+1\mathbf{c} \in \mathbb{C}^{D \times K+1}, K=T2K =\lfloor \frac{T}{2}\rfloor. Correspondingly, we compute the per channel power spectrum pRD×K+1\mathbf{p} \in \mathbb{R}^{D \times K+1} as pi,j=2Nci,j2\mathbf{p}_{i,j} = \frac{2}{N}|\mathbf{c}_{i,j}|^2, where ii is the channel index and jj is the index for the frequency bands. Correspondingly, the distributional mean parameters of the periodic sinusoidal function are then calculated as follows:

μiA=2Tj=1Kpi,j,μiF=j=1Kfjpi,jj=1Kpi,j,μiB=ci,0T,\mathbf{\mu}^A_i = \sqrt{\frac{2}{T}\sum_{j=1}^K \mathbf{p}_{i,j}}, \quad \mathbf{\mu}^F_i = \frac{\sum_{j=1}^K \mathbf{f}_j \cdot \mathbf{p}_{i,j}}{ \sum_{j=1}^K \mathbf{p}_{i,j}}, \quad \mathbf{\mu}^B_i = \frac{\mathbf{c}_{i,0}}{T},

where f=(0,1T,,KT)\mathbf{f} = (0, \frac{1}{T},\dots,\frac{K}{T}) is the frequencies vector. At the same time, the phase shift S\mathbf{S} is predicted using a fully-connected (FC) layer with two arctan\arctan activation as:

(sy,sx)=FC(Li),μiS=arctan(sy,sx),(s_y, s_x) = \text{FC}(\mathbf{L}_i), \quad \mathbf{\mu}^S_i = \arctan(s_y,s_x),

To predict the distributional variance of the phase amplitude and phase shift parameters {σA,σS}\{\mathbf{\sigma}^A, \mathbf{\sigma}^S\}, We additionally apply a separate two-layer MLP network over each channel of the latent curves. The variational latent phase parameters are sampled by utilizing parameterization trick, i.e., AN(μA,σA)\mathbf{A}\sim\mathcal{N}(\mathbf{\mu}^A,\mathbf{\sigma}^A) and SN(μS,σS)\mathbf{S}\sim\mathcal{N}(\mathbf{\mu}^S,\mathbf{\sigma}^S). In our experiments, we find that sampling the phase frequency F\mathbf{F} and offset B\mathbf{B} often produce unstable and non-coherent group movements. This might be because the frequency amplitudes of the dancers within the same group are likely to associate with the rhythmic pattern of the musical beats while the offsets capture their alignment, thereby should be consistent with each other. Therefore, we treat those parameters as deterministic by constraining their variance to zero.

Finally, the sampled set of phase parameters z={A;F;B;S}\mathbf{z} = \{\mathbf{A};\mathbf{F};\mathbf{B};\mathbf{S}\} are used to reconstruct a parametric latent space consisting of multiple periodic curves to represent each intrinsic property of the motion by:

L^=Asin(2π(FTS))+B\hat{\mathbf{L}} = \mathbf{A} \cdot \sin (2\pi \cdot (\mathbf{F}\cdot\mathcal{T} - \mathbf{S})) + \mathbf{B}

where T\mathcal{T} is a known time window series obtained by evenly spacing the timesteps from 00 to TT. Intuitively this curve construction procedure can be viewed as a "quantization" layer to enforce the network to learn to represent the motion features in the frequency domain, which is useful in representing different aspects of human motion such as their timing and periodicity. In the last step, a decoder is utilized to reconstruct the original motion signals from the set of parametric latent curves.

Figure 4. Manifold conduction.

Decoder

To decode the latent space into the original motion space, previous works have to use a sinusoidal positional encoding sequence with duration TT as the proxy input to the sequence decoder. This is because their latent space is only formed by single latent vectors following a Gaussian distribution, which cannot span the time dimension. However, we observe that it usually results in unstable and inconsistent movements, as the proxy sequence is generic and usually contains less meaningful information for the decoder. Meanwhile, our method does not suffer from this problem as our latent space is built on multiple curves that can represent the motion information through time, thanks to the phase parameters. Subsequently, our decoder D\mathcal{D} is based on Transformer decoder architecture that takes the constructed parametric latent curve, as well as the music features as inputs, to reconstruct the corresponding dance motions. Here, we also utilize the cross-attention model where we consider the sequence of and music features as key and value along with the sampled latent curves as the query. The output of the decoder is a sequence of TT vectors in RD\mathbb{R}^D, which is then projected back to the original motion dimensions through a linear layer, to obtain the reconstructed outputs x^=pθ(xz,a)\hat{\mathbf{x}}=p_\theta(\mathbf{x}|\mathbf{z},\mathbf{a}). We additionally employ a global trajectory predictor to predict the global translation of the root joint based on the generated local motions, in order to avoid intersection problems between dancers.

Prior Network.

Since the ground-truth motion is generally inaccessible at test time (i.e., we only have access to the music), we also need to learn a prior P\mathcal{P} to match the posterior distribution of motion from which the latent phase can be sampled. Specifically, We follow the manifold procedure to predict the Gaussian distribution conditioned on the music sequence a\mathbf{a}, which is then used for sampling the latent phases:

pθ(za)=N(z;μP,σP)p_\theta(\mathbf{z}|\mathbf{a}) = \mathcal{N}(\mathbf{z};\mathbf{\mu}_{\mathcal{P}}, \mathbf{\sigma}_{\mathcal{P}})

where a Transformer encoder is used to encode the input conditioning music sequence and predict the corresponding μP\mathbf{\mu}_{\mathcal{P}} and σP\mathbf{\sigma}_{\mathcal{P}}. We implement the prior network similarly to the encoder network, however, we use self-attention mechanism to capture the global music context. Learning the conditional prior is crucial for the conditional VAE to generalize to diverse types of music and motion. Intuitively speaking, each latent variable z\mathbf{z} is expected to represent possible dance motions x\mathbf{x} conforming to the music context a\mathbf{a}. Therefore, the prior should be able to encode different latent distributions given different musics.

Next

In the next part, we will explore the training procedure and experimental setups to validate the effectiveness of the proposed method.

Like What You See?