7 posts tagged with "computer graphic"

View All Tags

Scalable Group Choreography via Variational Phase Manifold Learning (Part 4)

In the previous part, we introduce training process and experimental setup. In this part, we validate the effectiveness and efficiency of the proposed method.

Figure 1. We present a new group dance generation method that can generate a large number of dancers within a fixed resource consumption. The illustration shows a generated group dance sample with 100100 dancers.

Experimental Results

Quality Comparison

Table 1 presents a comparison among the baselines FACT, Transflower, EDGE, GDanceR, GCD, and our proposed manifold-based method. The results clearly demonstrate that our model outperforms the baselines across all evaluations on two datasets, AIOZ-GDANCE and AIST-M. We observe that recent diffusion-based dance generation models, such as EDGE or GCD, achieve competitive performance on both single-dance metrics (FID, MMC, GenDiv, and PFC) and group dance metrics (GMR, GMC, and TIF). However, due to limitations in their training procedures, they still struggle with generating multiple dancing motions when faced with a large number of dancers, as indicated by their lower performance compared to our method. This suggests that our approach successfully maintains the quality of dance motions as the number of dancers increases.

Table 1. Performance comparison.
Additionally, Figure 2 illustrates that our proposed method outperforms other state-of-the-art models like GDanceR and GCD in addressing issues such as monotonous, repetitive, sinking, and overlapping dance motions.
Figure 2. Visualization of a dancing sample between different methods. GDanceR displays monotonous, repetitive, or sinking dance motions. GCD exhibits more divergence in dance motions, yet dancers may intersect since their optimization does not address this issue explicitly. Blue boxes mark these issues. In contrast, our manifold-based solution ensures the divergence of dancing motions, while the phase motion path demonstrates its effectiveness in addressing floating and crossing issues in group dances.

Scalable Generation Analysis

Table 2 illustrates the performance of different group dance generation methods (GCD, GDancer, and Ours) when generating dance movements with an increasing number of dancers. When the number of dancers is increased to 10, GCD appears to run out of memory, which is also observed in GDanceR when the number of dancers increases to 100. Regardless of the number of dancers, our method consistently achieves stable and competitive results. This implies that our proposed method successfully addresses the scalability issue in group dance generation without compromising the overall performance of each individual dance motion.

Table 2. Performance of group dance generation methods when we increase the number of generated dancers. The experiments are done with common consumer GPUs with 4GB memory. (N/A means models could not run due to inadequate memory footprint).

Figure 3 illustrates the memory consumption to generate dance motions in groups for each method. Noticeably, our proposal still achieves the highest performance while consuming immensely fewer resources required for generating group dance motions (See Figure 4 for illustrations). This, again, indicates that our method successfully breaks the barrier of limited generated dancers by using the manifold.

Figure 3. Memory usage vs. number of dancers in different dance generators.

Figure 4. Visualization of a dancing sample between different methods. GDanceR displays monotonous, repetitive, or sinking dance motions. GCD exhibits more divergence in dance motions, yet dancers may intersect since their optimization does not address this issue explicitly. Blue boxes mark these issues. In contrast, our manifold-based solution ensures the divergence of dancing motions, while the phase motion path demonstrates its effectiveness in addressing floating and crossing issues in group dances.

Ablation Study

Table 3 presents the performance improvements achieved through the integration of consistency loss and phase manifold. Additionally, we showcase the effectiveness of our proposed approach across three different backbones: Transformer, LSTM, and CNN. Evaluation metrics including FID, GMR, and GMC are utilized. The results indicate that the absence of consistency loss leads to an increase in GMR and a decrease in GMC, suggesting a significant enhancement in the realism and correlation of group dance motions facilitated by the inclusion of the proposed objective. Meanwhile, with out the phase manifold, the model exhibits remarkably higher scores in both the FID and GMR metrics, suggesting that phase manifold can effectively improve the distinction in dance motions while maintaining the realism of group dances, even when the number of dancers in a group is large. In comparing three backbones—Transformer, LSTM, and CNN—we have observed that the chosen Transformer backbone achieved the best results compared to LSTM or CNN.

Table 3. Module contribution and loss analysis.

User Study

User studies are vital for evaluating generative models, as user perception is pivotal for downstream applications; thus, we conducted two studies with around 70 participants each, diverse in background, with experience in music and dance, aged between 20 to 40, consisting of approximately 47\% females and 53\% males, to assess the effectiveness of our approach in group choreography generation.

In the user study, we aim to assess the realism of sample outputs with more and more dancers. Specifically, participants assign scores ranging from 0 to 10 to evaluate the realism of each dance clip with 2 to 10 dancers. Figure 5 shows that, across all methods, the more the number of dancers is increased, the lower the realism is found. However, the drop in realism of our proposed method is the least compared to GCD and GDanceR. The results indicate our method's good performance compared to other baselines when the number of dancers increases.

Figure 5. Realism between different methods when number of dancers is varied.

Discussion

While our approach leverages the VAE as a primary solution for generating a manifold, it is important to acknowledge certain inherent limitations associated with this choice. One notable challenge is the susceptibility to issues such as posterior collapse and unstable sampling within the VAE framework. These challenges can result in generated group dance motions that may not consistently meet performance expectations.

One specific manifestation of this limitation is the potential for false decoding when sampling points that lie too far from the learned distribution. This scenario can lead to unexpected rotations or disruptions in the physics of the generated content. The impact of this problem becomes evident in instances where the generated samples deviate significantly from the anticipated distribution, introducing inaccuracies and distortions.

To address these challenges, we recognize the need for ongoing efforts to mitigate the effects of posterior collapse and unstable sampling. While the problem is acknowledged, our approach incorporates measures to limit its impact. Future research directions could explore alternative generative models or additional techniques to enhance the robustness and reliability of the generated results in the face of these identified limitations.

Scalable Group Choreography via Variational Phase Manifold Learning (Part 3)

In the previous part, we introduct our main proposal Variational Phase Manifold Learning. In this part, we have explore the training process and experimental setup to validate the proposed method.

Figure 1. We present a new group dance generation method that can generate a large number of dancers within a fixed resource consumption. The illustration shows a generated group dance sample with 100100 dancers.

Training

During training, we consider the following variational lower bound to mainly train our dance generation VAE model:

logpθ(xa)Eqϕ[logpθ(xz,a)]DKL(qϕ(zx,a)pθ(za))\log p_\theta(\mathbf{x}|\mathbf{a}) \geq \mathbb{E}_{q_\phi} \left[ \log p_\theta(\mathbf{x}|\mathbf{z},\mathbf{a}) \right] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x},\mathbf{a}) \Vert p_\theta(\mathbf{z}|\mathbf{a}))

In practice, we apply the conditional VAE loss, which is defined as the weighted sum Lcvae=Lrec+λKLLKL\mathcal{L}_{\text{cvae}} = \mathcal{L}_{\text{rec}} + \lambda_{\text{KL}}\mathcal{L}_{\text{KL}}. In particular, the reconstruction term Lrec\mathcal{L}_{\text{rec}} measures the motion reconstruction error given the decoder output (via a smooth-L1 loss). The KL divergence term LKL\mathcal{L}_{\text{KL}} compares the divergence DKLD_{\text{KL}} between the posterior and the prior distribution to enforce them to be close to each other.

The conditional VAE objective above is calculated for each dancer separately and cannot capture the correlation between dancers within a group. Therefore, it is essential to impose consistency among dancers and avoid strange effects such as unsynchronized dance. To this end, we propose a group consistency loss by enforcing the latent phase manifold to be similar for the same group, given the input music. Specifically, we first calculate the phase manifold features based on the frequency domain parameters as follows:

P2i1=Aisin(2πSi),P2i=Aicos(2πSi)\mathbf{P}_{2i-1} = \mathbf{A}_i\sin(2\pi \cdot \mathbf{S}_i), \qquad \mathbf{P}_{2i} = \mathbf{A}_i\cos(2\pi\cdot \mathbf{S}_i)

where PR2D\mathbf{P}\in\mathbb{R}^{2D} is the phase manifold vector that encodes the spatial-temporal information of the motion state. The phase feature may look similar to the positional encodings of transformers in the sense that both use multi-resolution sinusoidal functions. However, the phase feature is much richer in terms of representation capacity since it learns to embed the spatial (body joints) and temporal (positions in time) information of the motion curves, whereas the positional encodings only encode the position of words. Finally, our consistency objective is to constrain phase manifold between dancers within a group to be as close as possible to each other, which is formulated as:

Lcsc=DKL(qϕ(zxm,a)(qϕ(zxn,a))+PmPn22\mathbf{\mathcal{L}}_{\text{csc}} = D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}^m,\mathbf{a}) \Vert (q_\phi(\mathbf{z}|\mathbf{x}^n,\mathbf{a}) ) + \Vert \mathbf{P}^m - \mathbf{P}^n\Vert^2_2

where the first term encourages the network to map different dancers belonging to the same group (xm\mathbf{x}^m and xn\mathbf{x}^n) into the same set of distributional phase parameters while the second term penalizes the discrepancy in their corresponding phase manifolds. In general, this loss is applied to ensure every dancer is embedded into a single unified manifold that can effectively represent their corresponding group. To summarize, our total training loss is defined as the combination of the VAE loss and the consistency loss L=Lcvae+λcscLcsc\mathcal{L} = \mathbf{\mathcal{L}}_{\text{cvae}} + \lambda_{\text{csc}}\mathbf{\mathcal{L}}_{\text{csc}}.

For testing, we can efficiently generate motions for an unlimited number of dancers by indefinitely drawing samples from the learned continuous group-consistent phase manifold. It is noteworthy that for inference, we only need to process the prior network once to obtain the latent distribution. To generate a new motion, we can sample from this latent (Gaussian) distribution and use the decoder to decode it back to the motion space. This approach is much more efficient and has significantly higher scalability than previous approaches that is limited by the number of dancers processed simultaneously by the entire network.

Figure 2. An example output.

Experiments

Implementation Details

Our model was trained on 4 NVIDIA V100 GPUs using Adam optimizer with a fixed learning rate of 10410^{-4} and a mini-batch size of 32 per GPU. For training losses, the weights are empirically set to λKL=5×104\lambda_{\text{KL}} = 5\times 10^{-4} and λcsc=104\lambda_{\text{csc}} = 10^{-4}, respectively. The Transformer encoders and decoders consist of 5 layers for both encoder, decoder, and prior Network with 512-dimensional hidden units. Meanwhile, the number of latent phase channels is set to 256. To further capture the periodic nature of the phase feature, we also use Siren activation following the initialization scheme. This can effectively model the periodicity inherent in the motion data, and thus can benefit motion synthesis.

Experimental Settings

Dataset. In our experiments, we utilize the AIOZ-GDance and AIST-M datasets. AIOZ-GDance is the largest music-driven dataset focusing on group dance, encompassing paired music and 3D group motions extracted from in-the-wild videos through a semi-automatic process. This dataset spans 7 dance styles and 16 music genres. For consistency, we adhere to the training and testing split during our experiments.

Evaluation Protocol. We employ several metrics to assess the quality of individual dance motions, including Frechet Inception Distance (FID), Motion-Music Consistency (MMC, and Generation Diversity (GenDiv), along with the Physical Foot Contact score (PFC). Specifically, the FID score gauges the realism of individual dance movements concerning the ground-truth dance. MMC assesses the matching similarity between motion and music beats, reflecting how well-generated dances synchronize with the music's rhythm. GenDiv is computed as the average pairwise distance of kinetic features among motions. PFC evaluates the physical plausibility of foot movements by determining the agreement between the acceleration of the character's center of mass and the foot's velocity.

In assessing the quality of group dance, we adopt three metrics : Group Motion Realism (GMR), Group Motion Correlation (GMC), and Trajectory Intersection Frequency (TIF). Broadly, GMR gauges the realism of generated group motions in comparison to ground-truth data, employing Frechet Inception Distance on extracted group motion features. GMC evaluates the synchronization among dancers within the generated group by computing their cross-correlation. TIF quantifies the frequency of collisions among the generated dancers during their dance movements.

Baselines. Our method is subjected to comparison with various recent techniques in music-driven dance generation, namely FACT, Transflower, and EDGE. These approaches are adapted for benchmarking within the context of group dance generation, considering that their original designs were tailored for single-dance scenarios. Additionally, our evaluation includes a comparison with GDanceR, GCD, and DanY. All of the mentioned works are specifically designed for the generation of group choreography.

Figure 3. Generated Group Dance from GCD baselines

Next

In the next part, we will explore the effectiveness of the proposal through quantitative and qualitative results.

Scalable Group Choreography via Variational Phase Manifold Learning (Part 2)

In the previous part, we have explore the introduction about group dance scalability and the what is manifold. In this part, we introduct our main proposal Variational Phase Manifold Learning.

Figure 1. We present a new group dance generation method that can generate a large number of dancers within a fixed resource consumption. The illustration shows a generated group dance sample with 100100 dancers.

Task Definition

Given an input music sequence a={a1,at,...,aT}\mathbf{a} = \{a_1, a_t, ...,a_T\} with t={1,...,T}t = \{1,..., T\} indicates the index of the music frames, our goal is to generate the group motion sequences of NN arbitrary dancers: x={x11,...,xT1;...;x1N,...,xTN}\mathbf{x} = \{x^1_1,..., x^1_T; ...;x^N_1,...,x^N_T\} where xtnx^n_t is the pose of nn-th dancer at frame tt. We use the 6D continuous rotation for every joint, along with 3D joint positions and velocities. Additionally, the corresponding 3D root translation vectors are concatenated into the pose representations to involve the trajectory of motion. Previous group dance methods, which generate the whole group at once, cannot deal with the increasing number of dancers and can only create group sequences up to a pre-defined number of dancers, due to the vast complexity of the architecture. In contrast, we aim to generate group dance with an unlimited number of dancers.

Figure 2. An example output.

Phase-conditioned Dance VAE

Our goal is to learn a continuous manifold such that the motion can be generated by sampling from this learned manifold. We assume that although different dancers within the same group may present visually distinctive movements, the properties of their motions, such as timing, periodicity, or temporal alignment are intrinsically similar. We aim to learn a generative phase representation for each group of dancer in order to synthesize their motion indefinitely. Our generative model is built upon the conditional Variational Autoencoder architecture, thanks to its diverse generation capability and fast sampling speed. However, instead of directly encoding the data into a Gaussian latent distribution as in common VAE approaches, we model the latent variational distribution by the phase parameters extracted from the latent motion curve, which we call variational phase manifold. The latent phase manifold is well-structured and can well describe key characteristics of motion (such as its timing, local periodicity, and transition), which benefits learning motion features.

The overview of our Phase-conditioned Dance VAE is illustrated in Figure 3. Specifically, the model contains three main networks: an encoder E\mathcal{E} to capture the approximate posterior distribution conditioned on both motion and music qϕ(zx,a)q_\phi(\mathbf{z}|\mathbf{x},\mathbf{a}), a prior network P\mathcal{P} to learn the conditional prior given only the music pθ(za)p_\theta(\mathbf{z}|\mathbf{a}) , and a decoder D\mathcal{D} to learn to reconstruct the data from the latent distribution pθ(xz,a)p_\theta(\mathbf{x}|\mathbf{z},\mathbf{a}). The new motion is generated by sampling the frequency-domain parameters predicted by the prior network, which is then passed through the decoder network to reconstruct the motion in the original data space. Furthermore, we adopt Transformer-based architecture in each network to effectively capture long-range dependencies and holistic context of the dance sequence.

Figure 3. Overview of our Phase-conditioned Dance VAE (PDVAE) for scalable group dance generation. It consists of an Encoder, a Prior, and a Decoder network. During training, we encode the corresponding motion and music inputs into a latent phase manifold, which is variational and parameterized by the frequency domain parameters of periodic functions. The latent phases can be sampled from the manifold and then decoded back to the original data space to obtain new motions. The consistency loss Lcsc\mathcal{L}_{\text{csc}} is further imposed to constrain the manifold to be consistently unified for dancers that belong to the same group. At inference stage, only the Prior and the Decoder are used to synthesize group dances efficiently. .

Encoder

The encoder E\mathcal{E} is expected to take both the motion and music feature sequence as input, and produce a distribution over possible latent variables capturing the cross-modal relationship between them. To transform the joint input space into a learned phase manifold, we adopt the Transformer decoder architecture where the Cross-Attention mechanism is utilized to learn the relationship between the motion and the music. Accordingly, the output of the encoder is a batch of latent curves (i.e., the activation sequences per channel) that can particularly capture different spatial and temporal aspects of the motion sequence. However, instead of training the model to directly reconstruct the input motion from the extracted latent curves, we further enforce each channel of the latent space to have a periodic functional form (i.e., sinusoidal). This enables us to effectively learn a compact parameterization for each latent channel from a small set of parameters in the frequency domain.

Generative Variational Phase Manifold

Here we focus on learning the periodicity and non-linear temporal alignment of the motion in the latent space. In particular, given the output latent curves from the encoder L=E(x,a)RD×T\mathbf{L} = \mathcal{E}(x,a) \in \mathbb{R}^{D \times T} with DD is the number of desired phase channels to be extracted from the motion, we parameterize each latent curve in L\mathbf{L} using a sinusoidal function with amplitude (A\mathbf{A}), frequency (F\mathbf{F}), offset (B\mathbf{B}) and phase shift (S\mathbf{S}) parameters. To allow for variational phase manifold learning, we opt to predict two sets of parameters μE={μA;μF;μB;μS}\mathbf{\mu}_{\mathcal{E}} =\{\mathbf{\mu}^A; \mathbf{\mu}^F; \mathbf{\mu}^B; \mathbf{\mu}^S \} and σE={σA;σF;σB;σS}\mathbf{\sigma}_{\mathcal{E}} =\{\mathbf{\sigma}^A; \mathbf{\sigma}^F; \mathbf{\sigma}^B; \mathbf{\sigma}^S \}, which corresponds to the mean and variance of R4D\mathbb{R}^{4D} dimensional Gaussian distribution:

qϕ(zx,a)=N(z;μE,σE)q_\phi(\mathbf{z}|\mathbf{x},\mathbf{a}) = \mathcal{N}(\mathbf{z};\mathbf{\mu}_{\mathcal{E}}, \mathbf{\sigma}_{\mathcal{E}})

To do so, we first apply differentiable Fast Fourier Transform (FFT) to each channel of the latent curve L\mathbf{L} and create the zero-indexed matrix of Fourier coefficients as c=FFT(L)\mathbf{c}=FFT(\mathbf{L}) with cCD×K+1\mathbf{c} \in \mathbb{C}^{D \times K+1}, K=T2K =\lfloor \frac{T}{2}\rfloor. Correspondingly, we compute the per channel power spectrum pRD×K+1\mathbf{p} \in \mathbb{R}^{D \times K+1} as pi,j=2Nci,j2\mathbf{p}_{i,j} = \frac{2}{N}|\mathbf{c}_{i,j}|^2, where ii is the channel index and jj is the index for the frequency bands. Correspondingly, the distributional mean parameters of the periodic sinusoidal function are then calculated as follows:

μiA=2Tj=1Kpi,j,μiF=j=1Kfjpi,jj=1Kpi,j,μiB=ci,0T,\mathbf{\mu}^A_i = \sqrt{\frac{2}{T}\sum_{j=1}^K \mathbf{p}_{i,j}}, \quad \mathbf{\mu}^F_i = \frac{\sum_{j=1}^K \mathbf{f}_j \cdot \mathbf{p}_{i,j}}{ \sum_{j=1}^K \mathbf{p}_{i,j}}, \quad \mathbf{\mu}^B_i = \frac{\mathbf{c}_{i,0}}{T},

where f=(0,1T,,KT)\mathbf{f} = (0, \frac{1}{T},\dots,\frac{K}{T}) is the frequencies vector. At the same time, the phase shift S\mathbf{S} is predicted using a fully-connected (FC) layer with two arctan\arctan activation as:

(sy,sx)=FC(Li),μiS=arctan(sy,sx),(s_y, s_x) = \text{FC}(\mathbf{L}_i), \quad \mathbf{\mu}^S_i = \arctan(s_y,s_x),

To predict the distributional variance of the phase amplitude and phase shift parameters {σA,σS}\{\mathbf{\sigma}^A, \mathbf{\sigma}^S\}, We additionally apply a separate two-layer MLP network over each channel of the latent curves. The variational latent phase parameters are sampled by utilizing parameterization trick, i.e., AN(μA,σA)\mathbf{A}\sim\mathcal{N}(\mathbf{\mu}^A,\mathbf{\sigma}^A) and SN(μS,σS)\mathbf{S}\sim\mathcal{N}(\mathbf{\mu}^S,\mathbf{\sigma}^S). In our experiments, we find that sampling the phase frequency F\mathbf{F} and offset B\mathbf{B} often produce unstable and non-coherent group movements. This might be because the frequency amplitudes of the dancers within the same group are likely to associate with the rhythmic pattern of the musical beats while the offsets capture their alignment, thereby should be consistent with each other. Therefore, we treat those parameters as deterministic by constraining their variance to zero.

Finally, the sampled set of phase parameters z={A;F;B;S}\mathbf{z} = \{\mathbf{A};\mathbf{F};\mathbf{B};\mathbf{S}\} are used to reconstruct a parametric latent space consisting of multiple periodic curves to represent each intrinsic property of the motion by:

L^=Asin(2π(FTS))+B\hat{\mathbf{L}} = \mathbf{A} \cdot \sin (2\pi \cdot (\mathbf{F}\cdot\mathcal{T} - \mathbf{S})) + \mathbf{B}

where T\mathcal{T} is a known time window series obtained by evenly spacing the timesteps from 00 to TT. Intuitively this curve construction procedure can be viewed as a "quantization" layer to enforce the network to learn to represent the motion features in the frequency domain, which is useful in representing different aspects of human motion such as their timing and periodicity. In the last step, a decoder is utilized to reconstruct the original motion signals from the set of parametric latent curves.

Figure 4. Manifold conduction.

Decoder

To decode the latent space into the original motion space, previous works have to use a sinusoidal positional encoding sequence with duration TT as the proxy input to the sequence decoder. This is because their latent space is only formed by single latent vectors following a Gaussian distribution, which cannot span the time dimension. However, we observe that it usually results in unstable and inconsistent movements, as the proxy sequence is generic and usually contains less meaningful information for the decoder. Meanwhile, our method does not suffer from this problem as our latent space is built on multiple curves that can represent the motion information through time, thanks to the phase parameters. Subsequently, our decoder D\mathcal{D} is based on Transformer decoder architecture that takes the constructed parametric latent curve, as well as the music features as inputs, to reconstruct the corresponding dance motions. Here, we also utilize the cross-attention model where we consider the sequence of and music features as key and value along with the sampled latent curves as the query. The output of the decoder is a sequence of TT vectors in RD\mathbb{R}^D, which is then projected back to the original motion dimensions through a linear layer, to obtain the reconstructed outputs x^=pθ(xz,a)\hat{\mathbf{x}}=p_\theta(\mathbf{x}|\mathbf{z},\mathbf{a}). We additionally employ a global trajectory predictor to predict the global translation of the root joint based on the generated local motions, in order to avoid intersection problems between dancers.

Prior Network.

Since the ground-truth motion is generally inaccessible at test time (i.e., we only have access to the music), we also need to learn a prior P\mathcal{P} to match the posterior distribution of motion from which the latent phase can be sampled. Specifically, We follow the manifold procedure to predict the Gaussian distribution conditioned on the music sequence a\mathbf{a}, which is then used for sampling the latent phases:

pθ(za)=N(z;μP,σP)p_\theta(\mathbf{z}|\mathbf{a}) = \mathcal{N}(\mathbf{z};\mathbf{\mu}_{\mathcal{P}}, \mathbf{\sigma}_{\mathcal{P}})

where a Transformer encoder is used to encode the input conditioning music sequence and predict the corresponding μP\mathbf{\mu}_{\mathcal{P}} and σP\mathbf{\sigma}_{\mathcal{P}}. We implement the prior network similarly to the encoder network, however, we use self-attention mechanism to capture the global music context. Learning the conditional prior is crucial for the conditional VAE to generalize to diverse types of music and motion. Intuitively speaking, each latent variable z\mathbf{z} is expected to represent possible dance motions x\mathbf{x} conforming to the music context a\mathbf{a}. Therefore, the prior should be able to encode different latent distributions given different musics.

Next

In the next part, we will explore the training procedure and experimental setups to validate the effectiveness of the proposed method.

Scalable Group Choreography via Variational Phase Manifold Learning (Part 1)

Generating group dance motion from the music is a challenging task with several industrial applications. Although several methods have been proposed to tackle this problem, most of them prioritize optimizing the fidelity in dancing movement, constrained by predetermined dancer counts in datasets. This limitation impedes adaptability to real-world applications. Our study addresses the scalability problem in group choreography while preserving naturalness and synchronization.

Figure 1. We present a new group dance generation method that can generate a large number of dancers within a fixed resource consumption. The illustration shows a generated group dance sample with 100100 dancers.

In particular, we propose a phase-based variational generative model for group dance generation on learning a generative manifold. Our method achieves high-fidelity group dance motion and enables the generation with an unlimited number of dancers while consuming only a minimal and constant amount of memory. The intensive experiments on two public datasets show that our proposed method outperforms recent state-of-the-art approaches by a large margin and is scalable to a great number of dancers beyond the training data.

Introduction

The proliferation of digital social media platforms has significantly increased the popularity of creating and sharing dance videos. This surge in interest has led to the daily production and consumption of millions of dance videos across various online platforms, attracting attention from the research community in fields such as computer vision and computer graphics. Recent advancements in these areas have focused on generating authentic dance movements in response to music, with broad applications spanning animation, virtual idols, virtual metaverses, and dance education. These technologies provide powerful tools for artists, animators, and educators, enhancing creativity and enriching the dance experience for both performers and audiences.

Figure 2. An example visualization of a virtual group dancer.
While considerable progress has been made in generating solo dance motions, creating synchronized and realistic group dance performances remains a complex and unresolved challenge. Existing methods typically face limitations in scalability, either generating dances for a fixed number of performers or suffering from high memory consumption due to architectural constraints. These approaches, often based on collaborative mechanisms such as cross-entity or global attention, struggle to scale up for larger groups of dancers, limiting their practical applicability. Moreover, the reliance on predefined datasets with a fixed number of dancers further restricts these models from being adapted to real-world scenarios requiring larger group choreographies.

To address these challenges, we propose a novel approach to scalable group dance generation using a phase-based variational generative model, termed Phase-conditioned Dance VAE (PDVAE). Unlike traditional variational autoencoders that operate in high-dimensional motion space and often struggle to capture temporal dynamics, PDVAE leverages phase parameters in the frequency domain to represent the latent motion space. This enables the generation of realistic and synchronized group dance performances without increasing computational and memory costs, even as the number of dancers grows. PDVAE provides a flexible and efficient solution for generating crowd-scale dance animations, with potential applications in diverse fields such as entertainment, virtual reality, education, and media production.

In this work, we aim to advance the state-of-the-art in scalable group dance generation by overcoming the limitations of existing methods and demonstrating the feasibility of generating large-scale, natural, and synchronized dance performances efficiently.

To summarize, our key contributions are as follows: - We introduce PDVAE, a phase-based variational generative model for scalable group dance generation. The method focuses on generating large-scale group dance under limited resources. - To effectively learn the manifold that is group-consistent (i.e., dancers within a group lie upon the same manifold), we propose a group consistency loss that enforces the networks to encode the latent phase manifold to be identical for the same group given the input music. - Extensive experiments along with thorough user study evaluations demonstrate the state-of-the-art performance of our model while achieving effective scalability.

Related Works

Music-driven Choreography

Crafting natural human choreography derived from music poses a complex challenge, encompassing the need for synchronization, coherence, and expressiveness between movement and musical inputs. Earlier approaches often relied on music-motion similarity constraints to ensure alignment between the generated dance and the accompanying music. Many of these methods employ heuristic algorithms to stitch together pre-existing dance segments from limited music-dance databases, successfully producing extended and realistic dance sequences. However, such techniques are constrained when attempting to generate novel dance fragments, as they depend heavily on pre-defined segments rather than innovative motion generation .

Recent advancements have focused on utilizing deep learning architectures to map music into dance motions. Various techniques, including Convolutional Networks (CNN) , Recurrent Networks (RNN) , Graph Neural Networks (GNN) , Generative Adversarial Networks (GAN) , and Transformer models , have been explored for dance generation. These models typically rely on inputs such as the current music and a brief history of previous dance motions to predict the next human poses in a sequence. For instance, innovations in multi-modal feature fusion have enabled the simultaneous integration of music and text, producing dance sequences guided by both musical and textual cues .

Despite the progress made by these methods, generating synchronized and coordinated dance movements for multiple dancers remains a significant challenge. Achieving harmony between multiple dancers requires not only temporal alignment but also the consideration of spatial relationships and interactions between performers. This adds complexity to the generation process, making it difficult for current techniques to manage multiple dancers cohesively. Additionally, these methods are often constrained by the limited number of dancers present in their training datasets .

Figure 3. Multi-person Tracking with GNN.

Several recent works have sought to address these limitations. For instance, Perez et al. propose a multimodal transformer combined with a normalizing-flow-based decoder to predict a distribution of possible future poses, offering improved flexibility in motion generation. Feng et al. introduce a motion repeat constraint for long-term generation, allowing their model to generate future frames while taking into account historical dance motions. Le et al. further explore group dance by examining consistency and diversity among dancers, ensuring that generated movements maintain coherence across multiple performers. However, these methods are still limited by the number of dancers depicted in training datasets, restricting their scalability for larger group performances.

Figure 4. Local Mesh Fitting process for conducting a dance dataset.

Overall, while significant strides have been made in music-driven dance generation, further research is needed to overcome the challenges of scalability and synchronization in group dance synthesis.

Motion Manifold Learning

Motion manifold learning has garnered significant attention in computer vision and artificial intelligence, primarily aiming to understand the fundamental structures underlying human movement and dynamics. This approach offers the capability to generate human motion patterns, providing insights into intrinsic motion dynamics, managing nonlinear relationships within motion data, and learning contextual and hierarchical representations. Consequently, numerous methodologies have emerged, each contributing distinct perspectives to advance the comprehension and synthesis of human motion.

Holden et al. pioneered motion manifold learning by generating character movements from high-level parameters mapped to a motion manifold, eliminating the need for manual preprocessing while enabling natural, smooth post-generation editing of motion sequences. MotionCLIP introduced a 3D human motion autoencoder aligned with CLIP's semantic space, facilitating semantic text-based motion generation, disentangled editing, and abstract language specification. This approach capitalizes on CLIP's rich semantic knowledge, integrating it within the motion manifold for enhanced control and interpretation of human movement. Sun et al. further advanced this field by employing VQ-VAE to learn a low-dimensional motion manifold, effectively refining motion sequences to improve coherence and continuity.

Figure 5. Manifold conduction.
In the context of group dance generation, motion manifold learning presents a promising solution to scalability challenges, particularly the restricted number of dancers present in most datasets. By learning a distribution over dance motions within the manifold, this approach enables the generation of synchronized, cohesive group dance sequences, potentially overcoming dataset limitations. This direction of research highlights the potential of manifold-based methods to enhance scalability, allowing for the synthesis of large-scale, realistic group dance performances that maintain temporal and spatial harmony.

Next

In the next part, we will dive deeply into the main proposal that leveragw manifold learning to handle scalability of group dance generation.

Music-Driven Group Choreography (Part 3)

This is the final part of the series group dance choreography, In this part, we will provide detailed analyses of our proposed group dance generation method.

Experiments

AIOZ-GDANCE Statistics

Figure 1. Distribution (%) of music genres (a) and dance styles (b) in our dataset.

In Figure 1, we show the distribution of music genres and dance styles in our dataset. As illustrated in Figure 1 (Left), Pop and Electronic are popular music genres while other music genres nearly share the same distribution. Meanwhile, on the right of Figure 1, Zumba, Aerobic, and Commercial are the dominant dance styles.

Figure 2. The correlation between dance styles and number of dancers (a); and between dance styles and music genres (b).

Figure 2 (Left) shows the number of dancers in each dance style. Naturally, we see that Zumba, Aerobic, and Commercial have more dancers. On the right of this figure, we illustrate the correlation between music genres and dance styles.

Evaluation Metrics

Similar to prior works on single-dance generation, we evaluate the generated motion quality by calculating the distribution distance between the generated and the ground-truth motions using Frechet Inception Distance (FID)[1, 2]. To evaluate how well the generated 3D motion correlates to the input music, we use the Motion-Music Consistency metric (MMC) [2,3]. We also evaluate our model's ability to generate diverse dance motions when given various input music by measuring Generation Diversity (GenDiv) [2,3].

To evaluate the group dancing quality, we propose three new metrics: Group Motion Realism (GMR), Group Motion Correlation (GMC), and Trajectory Intersection Frequency (TIF). Detailed calculations of these metrics are described as follows:

Group Motion Realism (GMR). To calculate the realism between generated and ground-truth group motion, we need to find a single unified representation for all dancers' motions in the scene. Based on the kinetic features of a single motion sequence [4], we propose to calculate Group Motion Realism (GMR), smaller is better. Specifically, for each entity, we compute the velocity of each element jj of the pose vector: vtn=yt+1nytnΔtv^n_t = \frac{y^n_{t+1} - y^n_t}{\Delta t} where Δt\Delta t is the time period between two consecutive frames. Note that the pose vector of each entity at each frame consists of the root orientation, root position and joint angles. The group kinetic features of a sequence is approximated by taking the logarithm of the total kinetic energy of all group entities as:

ej=log(1+1T1Nt=1Tn=1Nmj(vt,jn)2)e_j = \log \left(1 + \frac{1}{T}\frac{1}{N} \sum_{t=1}^T \sum_{n=1}^N m_j (v^n_{t,j})^2\right)

where mjm_j is the moment of inertia or mass of each joint. We assume that mjm_j is constant with respect to time and entity. Then, we split the sequence into smaller chunks and calculate the features of these chunks. This process is identical for both the generated and ground-truth sequences. Finally, we utilize these sets of features (from generated and ground-truth group dance) to calculate the GMR using the standard FID formulation as in [1].

Group Motion Correlation (GMC). We also evaluate the synchrony and the correlation between dancers within the generated group. We assume that the correlation of movements between individuals is likely to reflect their interaction in the choreography. For every pair of motions within a group, we first align the two motion sequences using Dynamic Time Warping algorithm based on the Euclidean distance in the joint position space (obtained by SMPL joint regressor). We then calculate the mean cross-correlations between the time-aligned motion pairs using the kinetic features [4]. The generated group motion correlation degree is then calculated as the average of all motion pairs.

Trajectory Intersection Frequency (TIF). For the generated group sequences, the intersection rate is calculated over all FF frames as:

TIR=Fi,j:ijI[intersect(M(yi),M(yj))]F,\text{TIR} = \frac{\sum_{F}\sum_{i,j : i\neq j} \mathbb{I}[\text{intersect}(M(y^i),M(y^j))]}{F},

where MM is the SMPL skinning function [5] which can output a 6890-vertices human mesh from the input pose parameters yy. intersect(x,y)\text{intersect}(x,y) is a function that returns 1 if the two meshes are intersect with each other and 0 otherwise. For TIF, smaller value is better and indicates less intersection of the generated group.

Cross-entity Attention Analysis

We compare our method with FACT [2]. FACT is a recent state-of-the-art method designed for single dance generation, thus giving our method an advantage. However, it is still the closest competing method as we propose a new group dance dataset that is not available for benchmarking before. We also analyse our method with and without using Cross-entity Attention. We train all methods with mini-batch containing all dancers within the group instead of sampling each dancer independently as in FACT’s original implementation.

Figure 3. Comparison between FACT and our GDanceR. Our method handles better the consistency and cross-body intersection problem between dancers.

Table 1 shows the method comparison between the baseline FACT[2] and our proposed GDanceR with and without Cross-entity Attention. The results show that GDanceR, especially with the Cross-entity Attention, outperforms the baseline by a large margin in all metrics. In Figure 3, we also visualize the example outputs of FACT and GDanceR. It is clear that FACT does not handle well the intersection problem. This is understandable as FACT is not designed for group dance generation, while our method with the Cross-entity Attention can deal with this problem better.

Table 1. Generation results comparison on AIOZ-GDANCE dataset. w/o CA denotes without using Cross-entity Attention.

Number of Dancers Analysis.

Table 2 demonstrates the generation results of our method when we want to generate different numbers of dancers. In general, the FID, GMR, and GMC metrics do not show much correlation with the numbers of generated dancers since the results are varied. On the other hand, MMC shows its stability among all setups (0.248\sim 0.248), which indicates that our network is robust in generating motion from given music regardless of the changing of initial positions. The generation diversity (GenDiv) decreases while the intersection frequency (TIF) increases when more dancers are generated. These results show that dealing with the collision during the group generation process is worth further investigation.

Table 2. Performance of our proposed method when increasing the number of generated dancers.

Dance Style Analysis

Figure 4. Examples of generated group motions from our method.

Different dance styles exhibit different challenges in group dance generation. As shown in Table 3, Aerobic and Zumba are quite similar for generating choreography as they usually focus on workout and sporty movements. Besides, while Commercial and Irish are easier for the model to reproduce the motions, Bollywood and Samba contain highly skilled movements that are challenging to capture and represent accurately. In Figure 4, we show the generated results of GDanceR with different dance styles. Our Supplementary Material and Demonstration Video also provide more examples.

Table 3. The results of different dance styles. These results are obtained by training the model on each dance style.

Ablation on Latent Motion Fusion.

We investigate different fusion strategies between the local motion hih^i and global-aware motion gig^i to obtain the final motion representation ziz^i. Specifically, we experiment with three settings: (i) No Fusion: the final motion is the global-aware motion obtained from our Cross-entity Attention (zi=giz^i = g^i); (ii) Concatenate: the final motion is the concatenation of the local and global-aware motion (zi=[hi;gi]z^i = [h^i; g^i]); (iii) Add: the final motion is the addition between local and global (zi=hi+giz^i = h^i + g^i). Table 4 summarizes the results. We find that fusing the motion by adding both the local and global motion features achieves the best results. In this strategy, the global information between entities is encoded to the local motion in an effective way so that the final motion retain the comprehensive information of their own past motion as well as the motion of every other entity. While in the concatenation, the model is prone to overfitting due to the redundant information of both the local and global representation. On the other hand, No Fusion can degrade the amount of information of the past motion, leading to insufficient input information and the Decoder may fail to generate the temporally-coherent motion aligned with the music.

Table 4. Ablation study on different fusion strategies for the latent motion representation.

Conclusion

In summary, we have introduced AIOZ-GDANCE, the largest dataset for audio-driven group dance generation. Our dataset contains in-the-wild videos and covers different dance styles and music genres. We then propose a strong baseline along with new evaluation metrics for group dance generation task. We also perform extensive experiments to validate our method on this interesting yet unexplored problem, using our new dataset and evaluation protocols. We hope that the release of our dataset will foster more research on audio-driven group choreography.

References

[1] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS 2017.

[2] Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music-conditioned 3d dance generation with aist++. ICCV 2021

[3] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Dancing to music. NIPS 2019.

[4] Kensuke Onuma, Christos Faloutsos, and Jessica K Hodgins. Fmdistance: A fast and effective distance function for motion capture data. Eurographics 2008.

[5] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics, 2015

Music-Driven Group Choreography (Part 2)

In the previous post, we have introduced AIOZ-GDANCE, a new largescale in-the-wild dataset for music-driven group dance generation. On the basis of the new dataset, we introduce the first strong baseline for group dance generation that can jointly generate multiple dancing motions expressively and coherently.

Figure 1. The overall architecture of GDanceR. Our model takes in a music sequence and a set of initial positions, and then auto-regressively generates coherent group dance motions that are attuned to the input music.

Music-driven Group Dance Generation Method

Problem Formulation

Given an input music audio sequence {m1,m2,...,mT}\{m_1, m_2, ...,m_T\} with t={1,...,T}t = \{1,..., T\} indicates the index of music segments, and the initial 3D positions of NN dancers {τ01,τ02,...,τ0N}\{\tau^1_0, \tau^2_0, ..., \tau^N_0 \}, τ0iR3\tau^i_0 \in \mathbb{R}^{3}, our goal is to generate the group motion sequences {y11,...,yT1;...;y1n,...,yTn}\{y^1_1,..., y^1_T; ...;y^n_1,...,y^n_T\} where ytiy^i_t is the generated pose of ii-th dancer at time step tt. Specifically, we represent the human pose as a 72-dimensional vector y=[τ;θ]y = [\tau; \theta] where τ\tau, θ\theta represent the root translation and pose parameters of the SMPL model [1], respectively.

In general, the generated group dance motion should meet the two conditions: (i) consistency between the generated dancing motion and the input music in terms of style, rhythm, and beat; (ii) the motions and trajectories of dancers should be coherent without cross-body intersection between dancers. To that end, we propose the first baseline method, for group dance generation that can jointly generate multiple dancing motions expressively and coherently. Figure 1 shows the architecture of our proposed Music-driven 3D Group Dance generatoR (GDanceR), which consists of three main components:

  • Transformer Music Encoder.
  • Initial Pose Generator.
  • Group Motion Generator.

Transformer Music Encoder

From the raw audio signal of the input music, we first extract music features using the available audio processing library Librosa. Concretely, we extract the mel frequency cepstral coefficients (MFCC), MFCC delta, constant-Q chromagram, tempogram, onset strength and one-hot beat, which results in a 438-dimensional feature vector. We then encode the music sequence M={m1,m2,...,mT}M =\{m_1, m_2, ...,m_T\}, mtR438m_t \in \mathbb{R}^{438} into a sequence of hidden representation {a1,a2,...,aT}\{a_1, a_2,..., a_T\}, atRdaa_t \in \mathbb{R}^{d_a}. In practice, we utilize the self-attention mechanism of transformer [2] to effectively encode the multi-scale information and the long-term dependency between music frames. The hidden audio at each time step is expected to contain meaningful structural information to ensure that the generated dancing motion is coherent across the whole sequence.

Specifically, we first embed the music features mtm_t using a Linear layer followed by Positional Encoding to encode the time ordering of the sequence.

U=PE(MWau)U = \text{PE}({M} W^u_a)

where PE\text{PE} denotes the Positional Encoding, and WauR438×daW^u_a \in \mathbb{R}^{438 \times d_a} is the parameters of the linear projection layer. Then, the hidden audio information can be calculated using self-attention mechanism:

A=FF(softmax((UqUk)Tdk)Uv),Uq=UWaq,Uk=UWak,Uv=UWav\\ \mathbb{A} = \text{FF}(\text{softmax}\left(\frac{(U^q U^k)^T}{\sqrt{d_{k}}} \right) U^v ), \\ U^q = U W^q_a, \quad U^k = U W^k_a, \quad U^v = UW^v_a

where Waq,WakRda×dkW^q_a, W^k_a \in \mathbb{R}^{d_a \times d_k}, and WavRda×dvW^v_a \in \mathbb{R}^{d_a \times d_v} are the parameters that transform the linear embedding audio UU into a query UqU^q, a key UkU^k, and a value UvU^v respectively. dad_a is the dimension of the hidden audio representation while dkd_k is the dimension of the query and key, and dvd_v is the dimension of value. FF\text{FF} is a feed-forward neural network.

Initial Pose Generator

Figure 2.The Transformer Music Encoder encodes the acoustic and rhythmic information to generate the initial poses from the input positions.

Given the initial positions of all dancers, we generate the initial poses by combing the audio feature with the starting positions. We aggregate the audio representation by taking an average over the audio sequence. The aggregated audio is then concatenated with the input position and fed to a multilayer perceptron (MLP) to predict the initial pose for each dancer:

y0i=MLP([1Tt=1Tat;τ0i]),y^i_0 = \text{MLP}\left( \left[\frac{1}{T}\sum_{t=1}^T a_t ; \tau^i_0 \right] \right),

where [;][;] is the concatenation operator, τ0i\tau^i_0 is the initial position of the ii-th dancer.

Group Motion Generator

Figure 3. The Group Motion Generator auto-regressively generates coherent group dance motions based on the encoded acoustic information.

To generate the group dance motion, we aim to synthesize the coherent motion of each dancer such that it aligns well with the input music. Furthermore, we also need to maintain global consistency between all dancers. As shown in Figure 3, our Group Generator comprises a Group Encoder to encode the group sequence information and an MLP Decoder to decode the hidden representation back to the human pose space. To effectively extract both the local motion and global information of the group dance sequence through time, we design our Group Encoder based on two factors: Recurrent Neural Network [3] to capture the temporal motion dynamics of each dancer, and Attention mechanism [2] to encode the spatial relationship of all dancers.

Specifically, at each time step, the pose of each dancer in the previous frame yt1iy^i_{t-1} is sent to an LSTM unit to encode the hidden local motion representation htih^i_t:

hti=LSTM(yt1i,ht1i){h^i_t=\text{LSTM}(y^i_{t-1},h^i_{t-1})}

To ensure the motions of all dancers have global coherency and discourage strange effects such as cross-body intersection, we introduce the Cross-entity Attention mechanism. In particular, each individual motion representation is first linearly projected into a key vector kik^i, a query vector qiq^i and a value vector viv^i as follows: \begin{equation} k^i = h^i W^{k}, \quad q^i = h^i W^{q}, \quad v^i = h^i W^{v}, \end{equation} where Wq,WkRdh×dkW^q, W^k \in \mathbb{R}^{d_h \times d_k}, and WvRdh×dvW^v \in \mathbb{R}^{d_h \times d_v} are parameters that transform the hidden motion hh into a query, a key, and a value, respectively. dkd_k is the dimension of the query and key while dvd_v is the dimension of the value vector. To encode the relationship between dancers in the scene, our Cross-entity Attention also utilizes the Scaled Dot-Product Attention as in the Transformer [3].

Figure 4. The Group Encoder learns to encode the relations among dancers through our proposed Cross-entity Attention mechanism.

In practice, we find that people having closer positions to each other tend to have higher correlation in their movement. Therefore, we adopt Spacial Encoding strategy to encode the spacial relationship between each pair of dancers. The Spacial Encoding between two entities based on their distance in the 3D space is defined as follows:

eij=exp(τiτj2dτ),e_{ij} = \exp\left(-\frac{\Vert \tau^i - \tau^j \Vert^2}{\sqrt{d_{\tau}}}\right),

where dτd_{\tau} is the dimension of the position vector τ\tau. Considering the query qiq^i, which represents the current entity information, and the key kjk^j, which represents other entity information, we inject the spatial relation information between these two entities onto their cross attention coefficient:

αij=softmax((qi)kjdk+eij).\alpha_{ij} = \text{softmax}\left(\frac{(q^i)^\top k^j}{\sqrt{d_k}} + e_{ij}\right).

To preserve the spatial relative information in the attentive representation, we also embed them into the hidden value vector and obtain the global-aware representation gig^i of the ii-th entity as follows:

gi=j=1Nαij(vj+eijγ),g^i = \sum_{j=1}^N\alpha_{ij}(v^j + e_{ij}\gamma),

where γRdv\gamma \in \mathbb{R}^{d_v} is the learnable bias and scaled by the Spacial Encoding. Intuitively, the Spacial Encoding acts as the bias in the attention weight, encouraging the interactivity and awareness to be higher between closer entities. Our attention mechanism can adaptively attend to each dancer and others in both temporal and spatial manner, thanks to the encoded motion as well as the spatial information.

We then fuse both the local and global motion representation by adding hih^i and gig^i to obtain the final latent motion ziz^i. Our final global-local representation of each entity is expected to carry the comprehensive information of their own past motion as well as the motion of every other entity, enabling the MLP Decoder to generate coherent group dancing sequences. Finally, we generate the next movement yti{y}^i_t based on the final motion representation ztiz^i_t as well as the hidden audio representation ata_t, and thus can capture the fine-grained correspondence between music feature sequence and dance movement sequence:

yti=MLP([zti;at]).y^i_t = \text{MLP}([z^i_t; a_t]).

Built upon these components, our model can effectively learn and generate coherent group dance animation given several pieces of music. In the next part, we will go through the experiments and detailed studies of the method.

References

[1] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multiperson linear model. ACM Trans. Graphics, 2015

[2] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. NIPS 2017.

[3] Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997 Nov 15;9(8):1735-80.

Music-Driven Group Choreography (Part 1)

Dancing is an important part of human culture and remains one of the most expressive physical art and communication forms. With the rapid development of digital social media platforms, creating dancing videos has gained significant attention from social communities. As a result, millions of dancing videos are created and watched daily on online platforms. Recently, studies of how to create natural dancing motion from music have attracted great attention in the research community.

Figure 1. We demonstrate the AIOZ-GDANCE dataset with in-the-wild videos, music audio, and 3D group dance motion.

Nevertheless, generating dance motion for a group of dancers remains an open problem and have not been well-investigated by the community yet. Motivated by these shortcomings and to foster research on group choreography, we establish AIOZ-GDANCE, a new largescale in-the-wild dataset for music-driven group dance generation. Unlike existing datasets that only support single dance, our new dataset contains group dance videos as shown in Figure 1, hence supporting the study of group choreography. On the basis of the new dataset, we propose the first strong baseline for group dance generation that can jointly generate multiple dancing motions expressively and coherently.

Dataset Construction

Figure 2. The pipeline of making our AIOZ-GDANCE dataset.

In this section, we will elaborate and describe the process to build our dataset from a large variety of videos available on the internet. Because our main goal is to develop a large-scale dataset with in-the-wild videos, setting up a MoCap system as in many classical approaches is not feasible. However, manually creating 3D groundtruth for millions of frames from dancing videos is also an extremely costly and tedious job. To that end, we propose a semi-automatic labeling method with humans in the loop to obtain the 3D ground truth for our dataset. The process to construct the data includes the five following key steps:

  1. Video collection
  2. Human Tracking
  3. Human Pose Estimation
  4. Local Fitting for Invidual Motions
  5. Global Scene Optimization

Data Collection and Preprocessing

Figure 3. Human Tracking.

Video Collection. We collect the in-the-wild, public domain group dancing videos along with the music from Youtube, Tiktok, and Facebook. All group dance videos are processed at 1920 × 1080 resolution and 30FPS.

Human Tracking. We perform tracking for all humans in the videos using the state-of-the-art multi-object tracker [1] to obtain the tracking bounding boxes. Note that although the tracker can produce reasonable results, there are failure cases in some frames. Therefore, we manually correct the bounding box of the incorrect cases. This tracking correction is crucial since we want the trajectory of each person to be accurately tracked in order to reconstruct their motion in latter stages.

Pose Estimation. Given the bounding boxes of each person in the video, we leverage a state-of-the-art 2D pose estimation method [2] to generate the initial 2D poses for each person. In practice, there exist some inaccurately detected keypoints due to motion blur and partial occlusion. We manually fix the incorrect cases to obtain the 2D keypoints of each human bounding box.

Local Mesh Fitting

Figure 4. Local Mesh Fitting.

To construct 3D group dance motion, we first reconstruct the full body motion for each dancer by fitting the 3D mesh. We then jointly optimize all dancer motions to construct the globally-coherent group motion. Finally, we post-process and remove wrong cases from the optimization results.

We use SMPL model [3] to represent the 3D human. The SMPL model is a differentiable function that maps the pose parameters θ\mathbf{\theta}, the shape parameters β\mathbf{\beta}, and the root translation τ\mathbf{\tau} into a set of 3D human body mesh vertices VR6890×3\mathbf{V}\in \mathbb{R}^{6890\times3} and 3D joints XRJ×3\mathbf{X}\in \mathbb{R}^{J\times3}, where JJ is the number of body joints.

Our optimizing motion variables for each individual dancer consist of a sequence of SMPL joint angles {θt}t=1T\{\mathbf{\theta}_t\}_{t=1}^T, a sequence of the root translation {τt}t=1T\{\mathbf{\tau}_t\}_{t=1}^T, and a single SMPL shape parameter β\mathbf{\beta}. We fit the sequence of SMPL motion variables to the tracked 2D keypoints by extending SMPLify-X framework [4] across the whole video sequence:

Elocal=EJ+λθEθ+λβEβ+λSES+λFEFE_{\rm local} = E_{\rm J} + \lambda_{\theta}E_{\theta} + \lambda_{\beta} E_{\beta} + \lambda_{\rm S}E_{\rm S} + \lambda_{\rm F}E_{\rm F}

where:

  • EJE_{\rm J} is the 2D reprojection term between the 2D keypoints and the 2D projection of the corresponding 3D poses.
  • EθE_{\theta} is the pose prior term from the latent space of the VPoser model [4] to encourage plausible human pose.
  • EβE_{\beta} is the shape prior term to regularize the body shape towards the mean shape of the SMPL body model.
  • ES=t=1T1θt+1θt2+j=1Jt=1T1Xj,t+1Xj,t2E_{\rm S} = \sum_{t=1}^{T-1}\Vert \mathbf{\theta}_{t+1} - \mathbf{\theta}_{t} \Vert^2 + \sum_{j=1}^J\sum_{t=1}^{T-1}\Vert \mathbf{X}_{j,t+1} - \mathbf{X}_{j,t} \Vert^2 is the smoothness term to encourage the temporal smoothness of the motion.
  • EF=t=1T1jFcj,tXj,t+1Xj,t2E_{\rm F} = \sum_{t=1}^{T-1} \sum_{j \in \mathcal{F}} c_{j,t}\Vert \mathbf{X}_{j,t+1} - \mathbf{X}_{j,t} \Vert^2 is to ensure feet joints to stay stationary when in contact (zero velocity). Where F\mathcal{F} is the set of feet joint indexes, cj,tc_{j,t} is the feet contact of joint jj at time tt.

Global Optimization

Figure 5. Global Optimization.

Given the 3D motion sequence of each dancer pp: {θtp,τtp}\{\mathbf{\theta}^p_t, \mathbf{\tau}^p_t\}, we further resolve the motion trajectory problems in group dance by solving the following objective:

Eglobal=EJ+λpenEpen+λregpEreg(p)+λdepp,p,tEdep(p,p,t)+λgcpEgc(p)E_{\rm global} = E_{\rm J} + \lambda_{\rm pen}E_{\rm pen} + \lambda_{\rm reg}\sum_{p}E_{\rm reg}(p) + \lambda_{\rm dep}\sum_{p,p',t}E_{\rm dep}(p,p',t) + \lambda_{\rm gc}\sum_{p}E_{\rm gc}(p)

EpenE_{\rm pen} is the Signed Distance Function penetration term to prevent the overlapping of reconstructed motions between dancers.

Ereg(p)=t=1Tθtpθ^tp2{E_{\rm reg}(p) =\sum_{t=1}^T\Vert \mathbf{\theta}^p_t - \hat{\mathbf{\theta}}^p_t\Vert^2} is the regularization term that prevents the motion from deviating too much from the prior optimized individual motion {θ^tp}\{\hat{\mathbf{\theta}}^p_t\} obtained by optimizing the local mesh for dancer pp.

In practice, we find that the relative depth ordering of dancers in the scene can be inconsistent due to the ambiguity of the 2D projection. To ensure the group motion quality, we watch the videos and manually provide the ordinal depth relation information of all dancers in the scene at each frame tt as follows:

rt(p,p)={1,if dancer p is closer than p1,if dancer p is farther than p0,if their depths are roughly equalr_t(p,p') = \begin{cases} 1, &\text{if dancer } p \text{ is closer than } p' \\ -1, &\text{if dancer } p \text{ is farther than } p' \\ 0, &\text{if their depths are roughly equal} \end{cases}

Given the relative depth information, we derive the depth relation term EdepE_{\rm dep}. This term encourages consistent ordinal depth relation between the motion trajectories of multiple dancers, especially when dancers partially occlude each other:

Edep(p,p,t)={log(1+exp(ztpztp)),rt(p,p)=1log(1+exp(ztp+ztp)),rt(p,p)=1(ztpztp)2,rt(p,p)=0E_{\rm dep}(p,p',t) = \begin{cases} \log(1+\exp(z^p_t - z^{p'}_t)), &r_t(p,p')=1 \\ \log(1+\exp(-z^p_t + z^{p'}_t)), &r_t(p,p')=-1 \\ (z^p_t - z^{p'}_t)^2, &r_t(p,p')=0 \\ \end{cases}

where ztpz^p_t is the depth component of the root translation τtp\mathbf{\tau}^p_t of the person pp at frame tt. Intuitively, for r(p,p)=1r(p,p')=1, zpz_p should be smaller than zpz_{p'} and otherwise.

Finally, we apply the global ground contact constraint EgcE_{\rm gc} to further ensure consistency between the motion of every person and the environment based on the ground contact information. This contact term is also needed to reduce the artifacts such as foot-skating, jittering, and penetration under the ground.

Egc(p)=t=1T1jFcj,tpXj,t+1pXj,tp2+cj,tp(Xj,tpf)n2E_{\rm gc}(p) = \sum_{t=1}^{T-1} \sum_{j \in \mathcal{F}} c^p_{j,t}\Vert \mathbf{X}^p_{j,t+1} - \mathbf{X}^p_{j,t} \Vert^2 + c^p_{j,t} \Vert (\mathbf{X}^p_{j,t} - \mathbf{f})^\top \mathbf{n}^* \Vert^2

where F\mathcal{F} is the set of feet joint indexes, n\mathbf{n}^* is the estimated plane normal and f\mathbf{f} is a 3D fixed point on the ground plane. The first term in Equation~\ref{eq_Egc} is the zero velocity constraint when the feet are in contact with the ground, while the second term encourages the feet position to stay near the ground when in contact. To obtain the ground plane parameters, we initialize the plane point f\mathbf{f} as the weighted median of all contact feet positions. The plane normal n\mathbf{n}^* is obtained by optimizing a robust Huber objective:

n=argminnXfeetH((Xfeetf)nn)+nn12,\mathbf{n}^* = \arg\min_{\mathbf{n}} \sum_{\mathbf{X}_{\rm feet}} \mathcal{H}\left((\mathbf{X}_{\rm feet} - \mathbf{f})^\top \frac{\mathbf{n}}{\Vert\mathbf{n}\Vert}\right) + \Vert \mathbf{n}^\top\mathbf{n} - 1 \Vert^2,

where H\mathcal{H} is the Huber loss function, Xfeet\mathbf{X}_{\rm feet} is the 3D feet positions of all dancers across the whole sequence that are labelled as in contact (i.e., cj,tp=1c^p_{j,t} = 1) .

How will AIOZ-GDANCE be useful to the community?

We bring up some interesting research directions that can be benefited from our dataset:

  • Group Dance Generation
  • Human Pose Tracking
  • Dance Education
  • Dance style transfer
  • Human behavior analysis

While single-person choreography is a hot research topic recently, group dance generation has not yet well investigated. We hope that the release of our dataset will foster more this research direction.

References

[1] Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and Ping Luo. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In CVPR, 2022

[2] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. RMPE: Regional multi-person pose estimation. In ICCV, 2017.

[3] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multiperson linear model. ACM Trans. Graphics, 2015

[4] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.