/ / 52 min read

Controllable Group Choreography using Contrastive Diffusion (Part 4)

Dive into the experimental details of controllable group choreography

In previous part, we have discovered the theory of Music-Motion Transformer. In this part, let investigate the experiment setups for its effectiveness verification, as well as qualyty experimental results.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Experiments

1.1 Implementation Details

The hidden layer of all the MLPs consists of 512512 units followed by GELU activation. The hidden dimension of all attention layers is set to d=512d=512, and the attention adopts a multi-head scheme with 88 attention heads. We also use a feature-wise linear modulation (FiLM) after each attention layer to strengthen the influence of the conditioning context. At the end of each attention block, we append a 22-layer feed-forward network with a feed-forward size of 10241024 to enhance the expressivity of the learned features. We extract the features from the raw audio signal by leveraging the representations from the frozen Jukebox, a pre-trained generative model for music, to enhance the model's generalization ability to several kinds of in-the-wild music. In total, the Group Diffusion Denoising Network is comprised of L=5L=5 stacked Music-Motion Transformer and Group Global Attention blocks, along with 22 transformer encoder layers to encode the music features. We implement the architecture of the Contrastive Encoder similarly to the Denoising Network but without cross attention since it does not take the music as input. The output sequence of the Contrastive Encoder is then averaged out and fed into an output layer with one unit. We also make the Contrastive Encoder aware of the current step in the diffusion chain by appending the diffusion timestep embedding to the motion sequence so that it can provide correct guidance signals in the sampling process. Overall, our model has approximately 62M62M trainable parameters.

1.2 Training

To train the denoising diffusion network, we use the "simple" objective as:

Lsimple=Ex0q(x0c),m[1,M][x0Gθ(xm,m,c)22]\mathcal{L}_{\rm simple} = \mathbb{E}_{x_0\sim q(x_0|c), m\sim [1,M]} \left[\Vert x_0 - \mathcal{G}_\theta(x_m,m,c) \Vert^2_2\right]

To improve the physical plausibility and prevent artifacts of the generated motion, we also utilize auxiliary geometric losses.

Lgeo=λposLpos+λvelLvel+λfootLfoot\mathcal{L}_{\rm geo} = \lambda_{\rm pos}\mathcal{L}_{\rm pos} + \lambda_{vel}\mathcal{L}_{\rm vel} + \lambda_{\rm foot}\mathcal{L}_{\rm foot}

In particular, geometric losses mainly consist of (i) a joint position loss Lpos\mathcal{L}_{\rm pos} to better constrain the global joint hierarchy via forward kinematics; (ii) a velocity loss Lvel\mathcal{L}_{\rm vel} to increase the smoothness and naturalness of the motion by penalizing the difference between the differences between the velocities of the ground-truth and predicted motions; and (iii) a foot contact loss Lfoot\mathcal{L}_{\rm foot} to mitigate foot skating artifacts and improve the realism of the generated motions by ensuring the feet to stay stationary when ground contact occurs.

Our total training objective is the combination of the "simple" diffusion objective, the auxiliary geometric losses, and the contrastive loss:

L=Lsimple+Lgeo+λnceLnce\mathcal{L} = \mathcal{L}_{\rm simple} + \mathcal{L}_{\rm geo} + \lambda_{\rm nce}\mathcal{L}_{\rm nce}

We train our model on 44 NVIDIA V100 GPUs using Adam optimizer with a learning rate of 1e41\mathrm{e}{-4} and a batch size of 6464 per GPU, which took about 77 days for 500500k iterations. The models are trained with M=1000M=1000 diffusion noising steps and a cosine noise schedule. During training, group dance motions are randomly sampled with sequence length T=150T=150 at 3030 Hz, which corresponds to 55-second pieces of music. For geometric losses, the loss weights are empirically set to λpos=1.0\lambda_{\rm pos}=1.0, λsmooth=1.0\lambda_{\rm smooth}=1.0, and λfoot=0.005\lambda_{\rm foot}=0.005, respectively. For the contrastive loss Lnce\mathcal{L}_{\rm nce}, its weight is λnce=0.001\lambda_{\rm nce} = 0.001, the probability of replacing dancers for negative sequences is 0.50.5, and the number of negative samples empirically is selected to 1010.

h~=S(w)hμ(h)σ(h)+b(w)\tilde{h} = S(w) * \frac{h-\mu(h)}{\sigma(h)}+ b(w)

where each channel of the whole activation sequence is first normalized separately by calculating the mean μ\mu and σ\sigma, and then scaled and biased using the outputted affine parameters S(w)S(w) and b(w)b(w). Intuitively, this operation shifts the activated hidden motion features of each individual motion towards a unified group representation to further encourage the association between them. Finally, the output features are then projected back to the original motion dimensions via a linear layer, to obtain the predicted outputs x^0\hat{x}_0.

1.3 Testing

At test time, we use the DDIM sampling technique with 5050 steps to accelerate the sampling speed of the reverse diffusion process. Accordingly, our model can achieve real-time generation at 3030 Hz on a single RTX 2080Ti GPU (excluding the music features extracting step), thanks to the parallelization of the Transformer architecture.

To enable long-term generation, we divide the input music sequence into multiple overlapping chunks, with each chunk having a maximum window size of 55 seconds and overlapped by half with the adjacent chunk. The group dance motions are then generated for each chunk along with the corresponding audio. Subsequently, we merge the outputs by blending the overlapped region between two consecutive chunks using spherical linear interpolation, with the interpolation weight gradually decaying from the current chunk to the next chunk. However, for group choreography synthesis, our model generates dance motions for each dancer in random order. Therefore, we need to establish correspondences between dancers across the chunks (i.e., identifying which one of the NN dancers in the next chunk corresponds to a dancer in the current chunk). To accomplish this, we organize all dancers in the current chunk into one set and the dancers in the next chunk into another set, forming a bipartite graph between the two chunks. We can then utilize the Hungarian algorithm to find the optimal matching, where the Euclidean distance between the two pose sequences serves as the matching weights. Our blending technique is applied at each step of the diffusion sampling process, starting from pure noise, thus it allows the model to gradually denoise the chunks to make them compatible for blending.

2. Experimental Settings

2.1 Dataset

We use AIOZ-GDance dataset in our experiments. AIOZ-GDance is a large-scale group dance dataset including paired music and 3D group motions captured from in-the-wild videos using a semi-automatic method, covering 7 dance styles and 16 music genres.

2.2 Evaluation Protocol

We use the following metrics to evaluate the quality of single dancing motion: Frechet Inception Distance (FID), Motion-Music Consistency (MMC), Generation Diversity (GenDiv), Physical Foot Contact score (PFC). Concretely, FID score measures the realism of individual dance movements against the ground-truth dance. The MMC evaluates the matching similarity between the motion and the music beats, i.e., how well generated dances follow the beat of the music. The generation diversity (GenDiv) is evaluated as the average pairwise distance of the kinetic features of the motions. The PFC evaluates the physical plausibility of the foot movements by calculating the agreement between the acceleration of the character's center of mass and the foot's velocity.

To evaluate the group dance quality, we follow three metrics: Group Motion Realism (GMR), Group Motion Correlation (GMC), and Trajectory Intersection Frequency (TIF).
In general, the GMR measures the realism between generated and ground-truth group motions by calculating Frechet Inception Distance on the extract group motion features. The GMC evaluates the synchrony between dancers within the generated group by calculating their cross-correlation. The TIF measures how often the generated dancers collide with each other in their dance movements.

2.3 Baselines

We compare our GCD method with several recent approaches on music-driven dance generation: FACT, Transflower, and EDGE, all of which are adapted for benchmarking in the context of group dance generation since the original methods were specifically designed for single-dance. We also evaluate against GDanceR, a recent model specifically designed for generating group choreography.

3. Quality Comparison

Table 1 shows a comparison among the baselines FACT, Transflower, EDGE, GDanceR, and our proposed GCD. The results clearly demonstrate that our default model setting with "neutral" mode outperforms the baselines significantly across all evaluations. We also observe that EDGE, a recent diffusion dance generation model, can yield very competitive performance on single-dance metrics (FID, MMC, GenDiv, and PFC). This suggests the advantages of diffusion approaches in motion generation tasks. However, it is still inferior to our model under several group dance metrics, showing the limitations of single dance methods in the context of group dance creation. Experimental results highlight the effectiveness of our approach in generating high-quality group dance motions.

Table. 1. Performance comparison. High Consistency: parameter γ=1\gamma=1; High Diversity: parameter γ=1\gamma=-1; Neutral: parameter γ=0\gamma=0

To complement the quantitative analysis, we present qualitative examples from FACT, GDanceR, and our GCD method in Figure 1. Notably, FACT struggles to deal with the intersection problem, which is reasonable given that it was not originally designed for group dance generation. As a result, the generated motions from FACT lack coordination and synchronization in most cases. While GDanceR shows improvements in terms of motion quality compared to FACT, the generated motions appear floating, unnatural, and sometimes unsynchronized in many cases. These drawbacks indicate that GDanceR's effort on generating group choreography would still require more refinement to produce consistent and cohesive movements among the dancers.

Figure. 1 Comparison between different dance generation methods when generating dancing in groups.

In contrast, our method excels in both controlling consistency and promoting diversity among the generated group dance motions. The outputs from our method demonstrate well-coordinated and realistic movements, implying that it can resolve the challenges of maintaining group coherence while delivering visually appealing results more effectively.

Overall, the conducted quantitative analysis and visual comparisons reaffirm the superior performance of our proposed GCD to generate high-quality, synchronized, and visually pleasing group dance motions.

Next

In the next post, we will mention our Consistency and Diversity Analysis.

Like What You See?