6 posts tagged with "dance generation"

View All Tags

Controllable Group Choreography using Contrastive Diffusion (Part 6)

In previous part, we have discovered the qualitative and quantitative expriments. In this part, let investigate ablation studies and user studies.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Ablation Analysis

Loss Terms. The contribution of the geometric loss Lgeo\mathcal{L}_{\rm geo} and contrastive loss Lnce\mathcal{L}_{\rm nce} in GCD is thoroughly analyzed and presented in Table 1. The results demonstrate that both losses play a crucial role in enhancing the overall performance across all four evaluation protocols. In particular, it can be seen that the effect of Lgeo\mathcal{L}_{\rm geo} on realism metrics (FID, GMR, and PFC) is significant. This observation can be attributed to the fact that this loss improves the physical plausibility and naturalness of the dance motions, empirically mitigating common artifacts such as jittery motion or foot skating. By enforcing the geometric constraints, GCD can generate faithful motions that are on par with real dances. Moreover, the contrastive loss Lnce\mathcal{L}_{\rm nce} contributes positively to the favorable results in the synchrony measures (GMR and GMC). This loss term encourages the model to synchronize the movements of multiple dancers within a group, thus improving the harmonious coordination and cohesion of the generated choreographies. In general, the results of Lgeo\mathcal{L}_{\rm geo} and Lnce\mathcal{L}_{\rm nce} validate their importance across various evaluation metrics.

Table. 1. Global module contribution and loss analysis. Experiments are conducted on GCD with γ=0\gamma = 0 (neutral mode).

Group Global Attention. Results presented in first two lines of Table 1 demonstrate substantial improvements obtained by incorporating Group Global Attention into GCD. It clearly shows that without the Group Global Attention, the performance on the group dance metrics (GMR and GMC) is significantly degraded. We also observe that the removal of this block resulted in inconsistent movements across dancers, where they seem to dance in freestyle without any group choreographic rules and collide with each other in many cases, although they may still follow the rhythm of the music. Results suggest the vital importance of ensuring coherency and regulating collisions for visually appealing group dance animations.

2. User Study

Qualitative user studies are important for evaluating generative models as the perception of users tends to be the most relevant metric for many downstream applications. Therefore, we conduct user studies to evaluate our approach in terms of group choreography generation. We organized two separate studies and enlisted roughly 50 individuals with diverse backgrounds to participate in our experiment. Each participant should have some relevant experience in music and dance (at least 1 month of studying or working in dance-related professions). The age of participants varied between 20 and 50, with approximately 55\% female and 45\% male.

In the initial study, we requested the participants to evaluate the dancing animations based on three criteria: the naturalness of the dancing motions (Realism), how well the movements match the music (Music-Motion Correspondence), and how well the dancers interact or synchronize with each other (Synchronization between Dancers). Participants were asked to rate scores from 0 to 10 for each criterion, ranging from 0-very poor, 5-acceptable, to 10-very good. The collected scores were then normalized to range [0,1][0,1].

This user study encompassed a total of 1893189 * 3 samples with songs that are not present in the train set, including those generated from GDanceR, real dance clips from the dataset, and generated results from our proposed method in neutral mode. Figure.1 shows average scores for all mentioned targets across three experiments. Notably, the ratings of our method are significantly higher than GDanceR across all three criteria. We also perform Tukey honest significance tests to determine the significant differences among the three methods. For the first two criteria (Realism and Music-Motion Correspondence), we observe that the mean scores of all methods are significantly different with p<0.05p < 0.05. For synchronization critera, the differences are significant except for the scores between our method and real dances (p0.07p\approx0.07). This highlights that our method can even achieve comparable scores with real dances, especially in the synchronization evaluation. This can be attributed to the proposed contrastive diffusion strategy, which can effectively maintain a balance between the consistency of the movements and the group/audio context, as well as diversity in generated dances.

Figure. 1. User study results in three criteria: Realism; Music-Motion Correspondence; and Synchronization between Dancers

In the second study, we aim to assess the diversity and consistency of the generated dance outputs and determine if they met the expectations of the users. Specifically, participants were asked to assign scores ranging from 0 to 10 to evaluate the consistency and diversity of each dance clip, i.e., how synchronized or how distinctive movements between dancers does the group dance present. A lower score indicated higher consistency, while a higher score indicated greater diversity. These scores were subsequently normalized to [1,1][-1,1] to align with the studying range of the control parameter employed in our proposed method.

Figure 2 depicts a scatter plot illustrating the relationship between the scores provided by the participants and the γ\gamma parameter that was used to generate the dance samples. The parameter values were randomly drawn from a uniform distribution with range [1,1][-1,1] to create the animations along with randomly sampled musical pieces. The survey shows a strong correlation between the user scores and the control parameter, in which we calculated the correlation coefficient to be approximately 0.88. The results indicate that the diversity and consistency level of the generated group choreography samples is mostly in agreement with the user evaluation, as indicated by the scores obtained.

Figure. 2. Correlation between the controlling consistency/diversity and the scores provided by the users.

3. DISCUSSION AND CONCLUSION

While controlling consistency and diversity in group dance generation by using our proposed GCD has numerous advantages and potentials, there are certain limitations. Firstly, it requires tuning the parameters and a complex system that is not trivial to train, to ensure that the generated dance motions can produce the desired level of similarity among dancers while still presenting enough variation to avoid repetitive or monotonous movements. This may involve long inference processes and may require significant computational resources in both the training and testing phases.

Secondly, over-controlling consistency and diversity may introduce constraints on the creative freedom of generated dances. While enforcing consistency can lead to synchronized and harmonious group movements, it may limit the possibility of exploring unconventional or new experimental dance styles. On the other hand, promoting diversity results in unique and innovative dance sequences, but it may sacrifice coherence and coordination among dancers.

{Although our model can synthesize semantically faithful group dance animation with effective coordination among dancers, it does not capture clear physical contact between dancers such as hand touching. This is because the data we used in training does not contain such detailed hand motion information. We think that exploring group dance with realistic physical hand interactions is a promising area for future work. Additionally, while our method offers a trade-off between diversity and consistency, achieving perfect alignment between high-diversity movements and music remains a challenging task. The diversity level among dancers and the alignment with music are also heavily influenced by the training data. We believe further efforts are required to reach this.}

Lastly, the subjective nature of evaluating consistency and diversity poses a challenge. Metrics for measuring these aspects may not be best fitted. We believe it is essential to consider diverse perspectives and demand domain experts to validate the effectiveness and quality of the generated dance motions.

To conclude, we have introduced GCD, a new method for audio-driven group dance generation that effectively controls the consistency and diversity of generated choreographies. By using contrastive diffusion along with the guidance technique, our approach enables the generation of a flexible number of dancers and long-term group dances without compromising fidelity. Through our experiments, we have demonstrated the capability of GCD to produce visually appealing and synchronized group dance motions. The results of our evaluation, including comparisons with existing methods, highlight the superior performance of our method across various metrics including realism and synchronization. By enabling control over the desired levels of consistency and diversity while preserving fidelity, our work has the potential for applications in entertainment, virtual performances, and artistic expression, advancing the effectiveness of deep learning in generative choreography.

Controllable Group Choreography using Contrastive Diffusion (Part 5)

In previous part, we have discovered the experimental setups and quality analysis. In this part, let investigate consistency and diversity factors in dance generation.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Diversity and Consistency Analysis

Table.1 presents an in-depth analysis of our method's performance across seven evaluation metrics by adjusting the parameter γ\gamma to control the consistency and diversity of the generated group choreographies. The findings reveal that our GCD with high consistency setting (γ=1\gamma = 1), performs better than other settings in terms of MMC, GMC, and TIR metrics, whereas the high diversity setting (γ=1\gamma=-1) achieves better results in the GenDiv metric. Meanwhile, the default model shows the best performance in both realism metrics (FID and GMR). {It can also be seen that the model is relatively robust to the physical plausibility score (PFC) as there are no noticeable differences among the metric in the three settings. This implies that our model is able to create group animation with different consistency or diversity levels without compromising the plausibility of the movements too much.} More interestingly, we found that there are indeed positive correlations between the two measures MMC, GMR, and the trade-off parameter. It is clear that these metrics are better when the consistency level increases. This is reasonable as we expect higher correspondence between the motion and the music (MMC) or higher correlation of the group motions (GMR) when the consistency level grows, which also agrees with the definition of these metrics.

Table. 1. Performance comparison. High Consistency: parameter γ=1\gamma=1; High Diversity: parameter γ=1\gamma=-1; Neutral: parameter γ=0\gamma=0

Consistency setups in GCD lead to more similar movements between dancers. As a result, this similarity contributes to high scores in MMC, GMR, and TIR metrics. In contrast, the diversity setups can synthesize more complex motions with greater variation between dancers as measured by the GenDiv metric, but this also makes it more challenging to reach high values of FID, MMC, and TIR, compared with other setups. In addition, Figure 1 shows an example of correlations between motion beats and music beats under high consistency and diversity settings. The music beats are extracted using the beat tracking algorithm from the Librosa library. Notably, the velocity curves in high consistency setting display relatively similar shapes, whereas in the high diversity scenario, the curves are clearly distinguished among dancers.

Figure. 1. Correlation between the motion and music beats. The solid curve represents the kinetic velocity of each dancer over time and the vertical dashed line depicts the music beats of the sequence. The motion beats can be detected as the local extrema from the kinetic velocity curve.

Despite the greater variations in high diversity setting, we can observe that the generated motions are matched with the music as the music beats are mostly located near the extrema of the motion curves in both settings. The experiment indicates that our model can faithfully capture different aspects of group choreography with different settings, including diversity and synchrony of the motions. This demonstrates the potential of our method towards various dance applications such as dance training or composing. Furthermore, our method can also produce distinctively different animation sequences under the same setting while adhering to the input music.
It is also important to note that all three setups of GCD significantly outperform other baseline models. This verifies the effectiveness of our proposed approach and shows that it can create high-fidelity group dance animations in any setting. For a more detailed visualization of the results, please refer to Figure 2.

Figure. 2. Consistency and diversity trade-off. High Consistency: parameter γ=1\gamma=1; High Diversity: parameter γ=1\gamma=-1; Neutral: parameter γ=0\gamma=0.

2. Number of Dancers Analysis

Table 2 provides insights into results obtained when generating arbitrarily different numbers of dancers using our proposed GCD in the neutral setting. In general, FID, GMR, and GMC metrics do not exhibit a clearly strong correlation with the number of generated dancers but display diverse and varied results. The MMC metric consistently shows its stability across all setups.

Table. 2. Performance of group dance generation methods when we increase the number of generated dancers, compared with GDanceR. In GCD setup, Neutral mode with γ=0\gamma = 0 is used.

As the number of generated dancers increases, the generation diversity (GenDiv) decreases while the trajectory intersection frequency (TIF) increases. However, it is worth noting that the differences observed in these metrics are relatively minor compared to those produced by GDanceR. This implies that our method can effectively control consistency and diversity, significantly reducing the chances of collisions between dancers and maintaining the overall quality of generated group dance motions.

For a detailed visualization of results, please refer to Figure 3. These results underscore the robustness and flexibility of our method in generating group dance motions across varying numbers of dancers while ensuring consistency, diversity, and avoiding collisions between performers.

Figure. 3. Group dance generation results of GCD in terms of different numbers of dancers.

3. Long-term Analysis

To evaluate the efficacy of the guidance signal in GCD for creating long-term group dance sequences, we conducted a comparative analysis between GCD and the baseline model GDanceR. The experiment involved musical pieces of different durations: 15 seconds, 30 seconds and 60 seconds. We show the results with the guidance parameter γ=0.5\gamma = 0.5 to enforce consistency with the music over a long duration. For more detailed information, please refer to Figure 5 and our supplementary video.

Figure. 5. Long-term results of the 60-second clip. For clearer visualization, please visit our demo video.

While both methods produce satisfactory results in the first few seconds of the animations (e.g., about 5-6 seconds), GDanceR starts to exhibit floating and unrealistic movements or freeze into a mean pose in the later period of the sequence. Figure 4 shows the Motion changes comparison between our GCD method and GdanceR. The motion change magnitudes are calculated as average differences of the kinetic features between consecutive frames. It is evident that the motion change magnitude of GDanceR is gradually lower and approaching zero in the later half of the 60-second music piece, whereas our method can preserve high magnitudes and variations over time. This is because GDanceR generates almost frozen dance choreographies during this period. In contrast, the group dance motions produced by GCD remain natural with diverse movements throughout the entire duration of all music samples.

Figure. 4. Motion changes comparison between our proposed GCD and GDanceR. The experiment is conducted on generated group dance results of 60-second music pieces..

These findings confirm that our approach can effectively address the problem of motion generation in long-horizon group dance scenarios. It maintains the motion quality and dynamics of the dance motions, ensuring that the created animations remain visually appealing throughout extended periods. This highlights the advantage of the contrastive strategy to enhance the consistency of the movements of dancers with their group and the music, resulting in significant improvements for long-term dance sequence generation compared to the baseline GDanceR.

Next

In the next post, we will consider Ablation Studies and User Studies.

Controllable Group Choreography using Contrastive Diffusion (Part 4)

In previous part, we have discovered the theory of Music-Motion Transformer. In this part, let investigate the experiment setups for its effectiveness verification, as well as qualyty experimental results.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Experiments

1.1 Implementation Details

The hidden layer of all the MLPs consists of 512512 units followed by GELU activation. The hidden dimension of all attention layers is set to d=512d=512, and the attention adopts a multi-head scheme with 88 attention heads. We also use a feature-wise linear modulation (FiLM) after each attention layer to strengthen the influence of the conditioning context. At the end of each attention block, we append a 22-layer feed-forward network with a feed-forward size of 10241024 to enhance the expressivity of the learned features. We extract the features from the raw audio signal by leveraging the representations from the frozen Jukebox, a pre-trained generative model for music, to enhance the model's generalization ability to several kinds of in-the-wild music. In total, the Group Diffusion Denoising Network is comprised of L=5L=5 stacked Music-Motion Transformer and Group Global Attention blocks, along with 22 transformer encoder layers to encode the music features. We implement the architecture of the Contrastive Encoder similarly to the Denoising Network but without cross attention since it does not take the music as input. The output sequence of the Contrastive Encoder is then averaged out and fed into an output layer with one unit. We also make the Contrastive Encoder aware of the current step in the diffusion chain by appending the diffusion timestep embedding to the motion sequence so that it can provide correct guidance signals in the sampling process. Overall, our model has approximately 62M62M trainable parameters.

1.2 Training

To train the denoising diffusion network, we use the "simple" objective as:

Lsimple=Ex0q(x0c),m[1,M][x0Gθ(xm,m,c)22]\mathcal{L}_{\rm simple} = \mathbb{E}_{x_0\sim q(x_0|c), m\sim [1,M]} \left[\Vert x_0 - \mathcal{G}_\theta(x_m,m,c) \Vert^2_2\right]

To improve the physical plausibility and prevent artifacts of the generated motion, we also utilize auxiliary geometric losses.

Lgeo=λposLpos+λvelLvel+λfootLfoot\mathcal{L}_{\rm geo} = \lambda_{\rm pos}\mathcal{L}_{\rm pos} + \lambda_{vel}\mathcal{L}_{\rm vel} + \lambda_{\rm foot}\mathcal{L}_{\rm foot}

In particular, geometric losses mainly consist of (i) a joint position loss Lpos\mathcal{L}_{\rm pos} to better constrain the global joint hierarchy via forward kinematics; (ii) a velocity loss Lvel\mathcal{L}_{\rm vel} to increase the smoothness and naturalness of the motion by penalizing the difference between the differences between the velocities of the ground-truth and predicted motions; and (iii) a foot contact loss Lfoot\mathcal{L}_{\rm foot} to mitigate foot skating artifacts and improve the realism of the generated motions by ensuring the feet to stay stationary when ground contact occurs.

Our total training objective is the combination of the "simple" diffusion objective, the auxiliary geometric losses, and the contrastive loss:

L=Lsimple+Lgeo+λnceLnce\mathcal{L} = \mathcal{L}_{\rm simple} + \mathcal{L}_{\rm geo} + \lambda_{\rm nce}\mathcal{L}_{\rm nce}

We train our model on 44 NVIDIA V100 GPUs using Adam optimizer with a learning rate of 1e41\mathrm{e}{-4} and a batch size of 6464 per GPU, which took about 77 days for 500500k iterations. The models are trained with M=1000M=1000 diffusion noising steps and a cosine noise schedule. During training, group dance motions are randomly sampled with sequence length T=150T=150 at 3030 Hz, which corresponds to 55-second pieces of music. For geometric losses, the loss weights are empirically set to λpos=1.0\lambda_{\rm pos}=1.0, λsmooth=1.0\lambda_{\rm smooth}=1.0, and λfoot=0.005\lambda_{\rm foot}=0.005, respectively. For the contrastive loss Lnce\mathcal{L}_{\rm nce}, its weight is λnce=0.001\lambda_{\rm nce} = 0.001, the probability of replacing dancers for negative sequences is 0.50.5, and the number of negative samples empirically is selected to 1010.

h~=S(w)hμ(h)σ(h)+b(w)\tilde{h} = S(w) * \frac{h-\mu(h)}{\sigma(h)}+ b(w)

where each channel of the whole activation sequence is first normalized separately by calculating the mean μ\mu and σ\sigma, and then scaled and biased using the outputted affine parameters S(w)S(w) and b(w)b(w). Intuitively, this operation shifts the activated hidden motion features of each individual motion towards a unified group representation to further encourage the association between them. Finally, the output features are then projected back to the original motion dimensions via a linear layer, to obtain the predicted outputs x^0\hat{x}_0.

1.3 Testing

At test time, we use the DDIM sampling technique with 5050 steps to accelerate the sampling speed of the reverse diffusion process. Accordingly, our model can achieve real-time generation at 3030 Hz on a single RTX 2080Ti GPU (excluding the music features extracting step), thanks to the parallelization of the Transformer architecture.

To enable long-term generation, we divide the input music sequence into multiple overlapping chunks, with each chunk having a maximum window size of 55 seconds and overlapped by half with the adjacent chunk. The group dance motions are then generated for each chunk along with the corresponding audio. Subsequently, we merge the outputs by blending the overlapped region between two consecutive chunks using spherical linear interpolation, with the interpolation weight gradually decaying from the current chunk to the next chunk. However, for group choreography synthesis, our model generates dance motions for each dancer in random order. Therefore, we need to establish correspondences between dancers across the chunks (i.e., identifying which one of the NN dancers in the next chunk corresponds to a dancer in the current chunk). To accomplish this, we organize all dancers in the current chunk into one set and the dancers in the next chunk into another set, forming a bipartite graph between the two chunks. We can then utilize the Hungarian algorithm to find the optimal matching, where the Euclidean distance between the two pose sequences serves as the matching weights. Our blending technique is applied at each step of the diffusion sampling process, starting from pure noise, thus it allows the model to gradually denoise the chunks to make them compatible for blending.

2. Experimental Settings

2.1 Dataset

We use AIOZ-GDance dataset in our experiments. AIOZ-GDance is a large-scale group dance dataset including paired music and 3D group motions captured from in-the-wild videos using a semi-automatic method, covering 7 dance styles and 16 music genres.

2.2 Evaluation Protocol

We use the following metrics to evaluate the quality of single dancing motion: Frechet Inception Distance (FID), Motion-Music Consistency (MMC), Generation Diversity (GenDiv), Physical Foot Contact score (PFC). Concretely, FID score measures the realism of individual dance movements against the ground-truth dance. The MMC evaluates the matching similarity between the motion and the music beats, i.e., how well generated dances follow the beat of the music. The generation diversity (GenDiv) is evaluated as the average pairwise distance of the kinetic features of the motions. The PFC evaluates the physical plausibility of the foot movements by calculating the agreement between the acceleration of the character's center of mass and the foot's velocity.

To evaluate the group dance quality, we follow three metrics: Group Motion Realism (GMR), Group Motion Correlation (GMC), and Trajectory Intersection Frequency (TIF).
In general, the GMR measures the realism between generated and ground-truth group motions by calculating Frechet Inception Distance on the extract group motion features. The GMC evaluates the synchrony between dancers within the generated group by calculating their cross-correlation. The TIF measures how often the generated dancers collide with each other in their dance movements.

2.3 Baselines

We compare our GCD method with several recent approaches on music-driven dance generation: FACT, Transflower, and EDGE, all of which are adapted for benchmarking in the context of group dance generation since the original methods were specifically designed for single-dance. We also evaluate against GDanceR, a recent model specifically designed for generating group choreography.

3. Quality Comparison

Table 1 shows a comparison among the baselines FACT, Transflower, EDGE, GDanceR, and our proposed GCD. The results clearly demonstrate that our default model setting with "neutral" mode outperforms the baselines significantly across all evaluations. We also observe that EDGE, a recent diffusion dance generation model, can yield very competitive performance on single-dance metrics (FID, MMC, GenDiv, and PFC). This suggests the advantages of diffusion approaches in motion generation tasks. However, it is still inferior to our model under several group dance metrics, showing the limitations of single dance methods in the context of group dance creation. Experimental results highlight the effectiveness of our approach in generating high-quality group dance motions.

Table. 1. Performance comparison. High Consistency: parameter γ=1\gamma=1; High Diversity: parameter γ=1\gamma=-1; Neutral: parameter γ=0\gamma=0

To complement the quantitative analysis, we present qualitative examples from FACT, GDanceR, and our GCD method in Figure 1. Notably, FACT struggles to deal with the intersection problem, which is reasonable given that it was not originally designed for group dance generation. As a result, the generated motions from FACT lack coordination and synchronization in most cases. While GDanceR shows improvements in terms of motion quality compared to FACT, the generated motions appear floating, unnatural, and sometimes unsynchronized in many cases. These drawbacks indicate that GDanceR's effort on generating group choreography would still require more refinement to produce consistent and cohesive movements among the dancers.

Figure. 1 Comparison between different dance generation methods when generating dancing in groups.

In contrast, our method excels in both controlling consistency and promoting diversity among the generated group dance motions. The outputs from our method demonstrate well-coordinated and realistic movements, implying that it can resolve the challenges of maintaining group coherence while delivering visually appealing results more effectively.

Overall, the conducted quantitative analysis and visual comparisons reaffirm the superior performance of our proposed GCD to generate high-quality, synchronized, and visually pleasing group dance motions.

Next

In the next post, we will mention our Consistency and Diversity Analysis.

Controllable Group Choreography using Contrastive Diffusion (Part 3)

In previous part, we have discovered Music-Motion Transformer. In this part, let investigate the Group Modulation and Contrastive Divergence Loss.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Group Modulation

To better apply the group information constraints to the learned hidden features of the dancers, we adopt a Group Modulation layer that learns to adaptively influence the output of the transformer attention block by applying an affine transformation to the intermediate features based on the group embedding ww. More specifically, we utilize two separate linear layers to learn the affine transformation parameters {S(w);b(w)}Rd\{S(w); b(w)\} \in \mathbb{R}^d from the group embedding ww. The predicted affine parameters are then used to modulate the activations sequence h={h11hT1;;h1NhTN}h = \{h^1_1\dots h^1_T;\dots;h^N_1\dots h^N_T\} as follows:

h~=S(w)hμ(h)σ(h)+b(w)\tilde{h} = S(w) * \frac{h-\mu(h)}{\sigma(h)}+ b(w)

where each channel of the whole activation sequence is first normalized separately by calculating the mean μ\mu and σ\sigma, and then scaled and biased using the outputted affine parameters S(w)S(w) and b(w)b(w). Intuitively, this operation shifts the activated hidden motion features of each individual motion towards a unified group representation to further encourage the association between them. Finally, the output features are then projected back to the original motion dimensions via a linear layer, to obtain the predicted outputs x^0\hat{x}_0.

2.Contrastive Diffusion

We learn the representations that encode the underlying shared information between the group embedding information ww and the group sequence xx. Specifically, we model a density ratio that preserves the mutual information between xx and ww as:

f(x,w)p(xw)p(x)f(x, w) \propto \frac{p({x}|{w})}{p({x})}

f()f(\cdot) is a model (i.e., a neural network) to predict a positive score (how well xx is related to ww) for a pair of (x,w)({x}, {w}).

To enhance the association between the generated group dance (data) and the group embedding (context), we aim to maximize their mutual information with a Contrastive Encoder f(x^,w)f(\hat{x},w) via the contrastive learning objective as in Equation below. The encoder takes both the generated group dance sequence x^\hat{x} and a group embedding ww as inputs, and it outputs a score indicating the correspondence between these two.

Lnce=E[logf(x^,w)f(x^,w)+ΣxjXf(x^j,w)]\mathcal{L}_{\rm nce} = - \mathbb{E} \left[ \log\frac{f(\hat{x},w)}{f(\hat{x},w) + \Sigma_{x^j \in X'}f(\hat{x}^j, w)}\right]

where XX' is a set of randomly constructed negative sequences. In general, this loss is similar to the cross-entropy loss for classifying the positive sample, and optimizing it leads to the maximization of the mutual information between the learned context representation and the data. Using the contrastive objective, we expect the Contrastive Encoder to learn to distinguish between the two quantities: consistency (the positive sequence) and diversity (the negative sequence). This is the key factor that enables the ability to control diversity and consistency in our framework.

Here, we will describe our strategy to construct contrastive samples to achieve our target. Recall that we use reverse distribution pθ(xt1xt)p_\theta(x_{t-1} | x_t) of Gaussian Diffusion with the mean as the prediction of the model while the variance is fixed to a scheduler (Equation~\ref{eq:approximateposterior}). To obtain the contrastive samples, given the true pair is (x0,w)(x_0,w), we first leverage forward diffusion process q(xmx0)q(x_m|x_0) to obtain the noised sample xmx_m. Then, our positive sample is $\theta(x{m-1} |xm, w).Subsequently,weconstructthenegativesamplefromthepositivepairbyrandomlyreplacingdancersfromothergroupdancesequences(. Subsequently, we construct the negative sample from the positive pair by randomly replacing dancers from other group dance sequences (x^j_0 \neq x_0) with some probabilities, feeding it through the forward process to obtain $x^j_m, then our negative sample is $\theta(x^j{m-1} | x^j_m, w). By constructing contrastive samples this way, the positive pair $(x_0,w) represents a group sequence with high consistency, whereas the negative one represents a high diversity sample. This is because mixing a sample with dancers from different groups is likely to result in substantially distinctive movements between each dancer, making it a group dance sample with high degree of diversity. Note that negative sequences should also match the music because they are motions generated by the network whose inputs are manipulated to increase diversity. Particularly, negative samples are acquired from outputs of the denoising network whose inputs are both the current music and the noised mixed group with some replaced dancers. As the network is trained to reconstruct only positive samples, its outputs will likely follow the music. Therefore, negative samples are not just random bad samples but are the valid group dance generated from the network that is trained to generate group dance conditioned on the music. This is because our main diffusion training objective is calculated only for ground-truth dances (positive samples) that are consistent with the music. Our proposed strategy also allows us to learn a more powerful group representation as it directly affects the reverse process, which is beneficial to maintaining consistency in long-term synthesis.

3. Diversity vs. Consistency

Using the Contrastive Encoder f(xm,w)f(x_m,w), we extend the classifier guidance to control the generation process. Accordingly, we incorporate f(xm,w)f(x_m,w) in the contrastive framework to replace the guiding classifier in the original formula, since it provides a score of how consistent the sample is with the group information. In particular, we shift the mean of the reverse diffusion process with the log gradient of the Contrastive Encoder with respect to the generated data as follows:

μ^θ(xm,m)=μθ(xm,m)+γΣθ(xm,m)xmlogf(xm,w)\hat{\mu}_\theta(x_m,m) = \mu_\theta(x_m,m) + \gamma \cdot \Sigma_{\theta}(x_m,m) \nabla_{x_m}\log f(x_m,w)

where γ\gamma is the control parameter that uses the encoder to enforce consistency and connection with the group embedding. Since the Contrastive Encoder is trained to classify between high-consistency and high-diversity samples, its gradients yield meaningful guidance signals to control the trade-off process. Intuitively, a positive value of γ\gamma encourages more consistency between dancers while a negative value (which corresponds to shifting the distribution with a negative gradient step) boosts the diversity between each individual dancer.

Next

In the next post, we will mention Experimental Setups and Analysis.

Controllable Group Choreography using Contrastive Diffusion (Part 2)

In previous part, we have discover the motivation and recent works that focus on generating dances and group dances. We also focus on analyze the pros and cons of each methods. We then dive into the detailed algorithm of our proposed GCD. Our methodology can be splited into three parts: Music-Motion Transformer and Group Global Attention, and Contrastive Diffusion Loss for Controllable Motions. In this part, let investigate the Music-Motion Transformer first.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Group Diffusion Denoising Network

Our model architecture is illustrated in Figure.1. We utilize a transformer based architecture to generate the whole sequence in one go. Compared with recent auto-regressive approach~\cite{le2023music}, our method does not suffer from the error accumulation problem (i.e., the prediction error accumulates over time since the current-frame outputs are used as inputs to the next frame in the auto-regressive fashion) and thus can generate arbitrary long motion dance sequences without freezing effects. The input of our network at each diffusion step mm is the noisy group sequence xm={xm,11,...,xm,T1;...;xm,1N,...,xm,TN}x_m =\{x^1_{m,1},..., x^1_{m,T}; ...;x^N_{m,1},...,x^N_{m,T}\} however, we skip the mm index for ease of notation from now on.

2. Music-Motion Transformer:

Given an input extracted audio sequence a={a1,a2,...,aT}a = \{a_1, a_2, ...,a_T\}, we employ a transformer encoder architecture to encode the music into the sequence of hidden audio representation {c1,c2,...,cT}\{c_1, c_2, ...,c_T\}, which will be used as the conditioning context to the diffusion denoising network. Specifically, we follow the encoder layer which consists of multi-head self-attention layers and feed-forward layers to effectively encode the multi-scale rhythmic patterns and long-term dependencies between music frames. The diffusion time-step mm is also projected to the transformer dimension through a separate Multi-layer Perceptron (MLP) with 3 hidden layers to get the embedding τemb\tau_{emb}, then concatenated with the music feature sequence to obtain the final conditioning context c={c1,c2,...,cT,τemb}c = \{c_1, c_2, ...,c_T, \tau_{emb}\}.

Figure 1. Detailed illustration of our method for group choreography generation.

Although group choreography incorporates the problem of learning the interaction between dancers, we still need to learn the correlation between the dance movements and the accompanying music audio for each dancer. Therefore, we design the Music-Motion Transformer to essentially focus on learning the direct connection between the motion and the music of each individual dancer (and not considering the interconnection among dancers yet). Each frame of the noised input motion xtix^i_t is projected into the transformer dimension by a linear layer followed by an additive positional encoding. Given the whole group sequence including all dancers {x11,...,xT1;...;x1n,...,xTn}\{x^1_1,..., x^1_T; ...;x^n_1,...,x^n_T\}, we separately encode the motion features of each individual dancer by utilizing the multi-head self-attention with masking strategy. We implement the masked self-attention (MSA) mechanism as follows:

MSA(Q,K,V)=softmax(QKdk+mlocal)V,Q=xWQ,K=xWK,V=xWV\text{MSA}(Q,K,V) = \text{softmax}\left(\frac{{Q}{K}^\top}{\sqrt{d_k}} + {m}_{local} \right) {V}, \\ {Q} = x {W}^{Q}, \quad {K} = x {W}^{K}, \quad {V} = x {W}^{V}

where WQ,WKRd×dk{W}^{Q}, {W}^{K} \in \mathbb{R}^{d \times d_k} and WVRd×dv{W}^{V} \in \mathbb{R}^{d \times d_v} are learnable projection matrices to transform the input to query, key, and value, respectively. mlocal{m}_{local} is the local attention mask illustrated in Figure2.a. This mask ensures each individual can only attend to their own motion. Subsequently, to incorporate the music conditioning context cc into each individual motion features, we adopt a transformer decoder architecture with cross-attention mechanism (CA), where the motion is the query and the music is the key/value.

CA(Q~,K~,V~)=softmax(Q~K~dk)V~,Q~=x~W~Q,K~=cW~K,V~=cW~V\text{CA}(\tilde{Q}, \tilde{K}, \tilde{V}) = \text{softmax}\left(\frac{{\tilde{Q}}{\tilde{K}}^\top}{\sqrt{d_k}} \right) {\tilde{V}}, \\ {\tilde{Q}} = \tilde{x} {\tilde{W}}^{Q}, \quad {\tilde{K}} = c {\tilde{W}}^{K}, \quad {\tilde{V}} = c {\tilde{W}}^{V}

where x~\tilde{x} is the output activation of the MSA block, and W~Q{\tilde{W}}^{Q}, W~K{\tilde{W}}^{K}, W~V{\tilde{W}}^{V} are the learnable projection matrices that have similar behavior to the MSA mechanism.

3. Group Global Attention

To ensure the coherency and non-collision in the movements of all dancers within the group, such that their dances should correlate with each other under the music condition instead of dancing asynchronously , we first perform global attention via a masked attention mechanism with a full masking strategy mglobalm_{global}. The attention mask is illustrated in Figure2.b. It allows a dancer to fully attend to all other dancers under the global receptive field. Then, we propose the Group Modulation to enforce the group constraints within the group embedding information.

Figure 2. The local attention mask mlocalm_{local} (a) and global attention mask mglobalm_{global} (b). The blue cell indicates where frames can attend to each other. Blue color represents zero value of the mask while gray color represents minus infinity. x[1:T]ix^i_{[1:T]} indicate motion sequence of ii-th dancer.

Since the synthesized image can be manipulated via a latent style vector, we aim to learn a group embedding information from the input music in order to control the group dance generation process. We first apply temporal average pooling to the encoded music feature sequence to obtain a compact representation of the input music cˉ=1Tt=1Tct\bar{c} = \frac{1}{T}\sum_{t=1}^T c_t. To increase the variation and diversity of the group information (i.e., avoid limiting the group embedding to only one style of the input music), we inject a random noise drawn from a standard gaussian distribution zN(0,I)z \sim \mathcal{N}(0,I) into cˉ\bar{c}. We use an 8-layer MLP to learn a mapping from the audio representation to the group embedding. We also add a learnable embedding token ene_{n} from a variable-size lookup table ERN×DE \in \mathbb{R}^{N\times D} up to NN maximum dancers, to represent the variation of dancers in the sequence since each sequence may contain different number of dancers. In summary, the process can be written as follows:

w=MLP(z+1Tt=1Tct)+en,zN(0,I)w = \text{MLP}\left(z + \frac{1}{T}\sum_{t=1}^T c_t\right) + e_{n}, \quad z \sim \mathcal{N}(0,I)

Next

In the next post, we will mention Group Global Attention.

Controllable Group Choreography using Contrastive Diffusion (Part 1)

Music-driven group choreography poses a considerable challenge but holds significant potential for a wide range of industrial applications. The ability to generate synchronized and visually appealing group dance motions that are aligned with music opens up opportunities in many fields such as entertainment, advertising, and virtual performances. However, most of the recent works are not able to generate high-fidelity long-term motions, or fail to enable controllable experience. In this work, we aim to address the demand for high-quality and customizable group dance generation by effectively governing the consistency and diversity of group choreographies. In particular, we utilize a diffusion-based generative approach to enable the synthesis of flexible {number of dancers} and long-term group dances, while ensuring coherence to the input music. Ultimately, we introduce a Group Contrastive Diffusion (GCD) strategy to enhance the connection between dancers and their group, presenting the ability to control the consistency or diversity level of the synthesized group animation via the classifier-guidance sampling technique. Through intensive experiments and evaluation, we demonstrate the effectiveness of our approach in producing visually captivating and consistent group dance motions. The experimental results show the capability of our method to achieve the desired levels of consistency and diversity, while maintaining the overall quality of the generated group choreography.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Introduction

With the widespread presence of digital social media platforms, the act of creating and editing dance videos has gained immense popularity among social communities. This surge in interest has resulted in the daily production and watching of millions of dancing videos across online platforms. Recently, researchers from computer vision, computer graphics, and machine learning communities have devoted considerable attention to developing techniques that can generate natural dance movements from music. These advancements have far-reaching implications and find applications in various domains, such as animation, the creation of virtual idols, the development of virtual meta-verse, and dance education. These techniques empower artists, animators, and educators alike, providing them with powerful tools to enhance their creative endeavors and enrich the dance experience for both performers and audiences.

While significant progress has been made in generating dancing motions for single dancer, the task of producing cohesive and expressive choreography for a group of dancers has received limited attention. The generation of synchronized group dance motions that are both realistic and aligned with music remains a challenging problem in the field of computer animation and motion synthesis. This is primarily due to the complex relationship between music and human motion, the diverse range of motions required for group performances, and the insufficient of a suitable dataset. At present, AIOZ-GDance stands as the most recent extensive dataset available to facilitate the task of generating group choreography. Besides, while current algorithms can generate individual movements and choreographic sequences, ensuring that these elements align seamlessly with the overall group performance is also paramount.

Different from solo dance, group dance involves coordination and interaction between dancers, making it crucial and challenging to establish correlations between motion series within a group. Besides, group dance can involve complex and diverse choreographies among participating dancers while still maintaining a semantic relationship between the motion and input music. Exploring the consistency and diversity between the movements of dancers of the synthesized group choreography is of vital importance to create a natural and expressive performance. The ability to control the consistency and diversity in group dance generation holds great potential across various applications. One such application is in the realm of entertainment and performance. Choreographers and creative teams can leverage this control ability to design captivating group dance routines that seamlessly blend synchronized movements with moments of individual expression. Second, in the context of animation and virtual metaverses, the control over consistency and diversity allows for the creation of visually stunning and immersive virtual dance performances. By balancing the synchronization of dancers' movements, while also introducing variations and unique flourishes, the generated group dances can captivate audiences and evoke a sense of realism and authenticity. Last but not least, in dance education and training, the ability to regulate consistency and diversity in group dance generation can be invaluable. It enables instructors to provide students with a diverse range of generated dance routines and samples that challenge their abilities, promote collaboration, and foster creativity. By dynamically adjusting the level of consistency and diversity, educators can cater to the unique need and skill level of each individual dancer, creating more inclusive instructions and enriching the learning environment. Although plenty of applications can be listed, due to some limitations of data establishment, investigating the consistency and diversity in group choreography has not been carefully explored.

In this paper, our goal is to develop a controllable technique for group dance generation. We present a Group Contrastive Diffusion (GCD) strategy that learns an encoder to capture the key targets between group dance movements. Diffusion modeling provides a flexible framework for manipulating the dance distribution, which allows us to modulate the degree of diversity and consistency in the generated dances. By using denoising diffusion probabilistic model as a key technique, we can effectively control the trade-off between diversity and consistency during the group dance generation, thanks to the guided sampling process. With this approach, we can guide the generation process toward a desired balance between diversity and consistency levels. Moreover, incorporating the encoder, which learns the association between the dancers and their group, can help to maintain the generated dance moves so that they are consistent with a specific dance style, music genre, or any long-term chorus. We empirically show that this approach has the potential to enhance the quality and naturalness of generated group dance performances, making it more appealing for various applications.

Figure 1. We present a contrastive diffusion method that controls the consistency (top row) and diversity (second row) in group choreography.

2. Overview

Music-driven Choreography. Creating natural and authentic human choreography from music is a complex task. One commonly employed technique involves using a motion graph derived from a vast motion database to generate new motions. This involves combining various motion segments and optimizing transition costs along the graph path. Alternatively, there are other methods that incorporate music-motion similarity matching constraints to ensure consistency between the motion and the accompanying music. Previous studies have extensively explored these methodologies. However, most of these approaches relied on heuristic algorithms to stitch together pre-existing dance segments sourced from a limited music-dance database. While these methods are successful in generating extended and realistic dance sequences, they face limitations when trying to create entirely novel dance fragments.

In recent years, several signs of progress have been made in the field of music-to-dance motion generation using Convolutional Network (CNN), Recurrent Network (RNN), Graph Neural Network (GNN), Generative Adversarial Network (GAN), or Transformer. Typically, these methods rely on multiple inputs such as the current music and a brief history of past dance movements to predict the future sequence of human poses. Recently, Gong et. al. propose an interesting task of generating dance by simultaneously utilizing both music and text instruction. A music-text feature fusion module was designed to fuse the inputs into a motion decoder to generate dance conditioned on both music and text. However, although these methods have the potential to produce natural and realistic dancing motion, they are often unable to create synchronized and harmonious movements between multiple dancers. Ensuring coordination and synchronization between dancers is a complicated problem, as it involves not only individual pose predictions but also the seamless integration of these poses within the context of a group. Achieving synchronized and harmonious group movements requires considering spatial and temporal relationships among dancers, their interactions, and the overall choreographic structure. Thus, further advances in the field are considered to address these challenges, including works that use deep learning approaches such as Variational Autoencoder (VAE), GAN, and Normalising Flow.

Unfortunately, most of these networks are limited by their ability to model long-term dance sequences ce may freeze or drift towards the end of the music. To facilitate long-term generation, many researchers apply a motion repeat constraint to predict future frames by attending to the historical motions. Nevertheless, this would limit the flexibility of the model by forcing it to always look into the past.

Group Choreography Group choreography and its related problem, multi-person motion prediction, have been an active research area with numerous studies addressing the challenges of predicting the behaviors of multiple individuals. Alahi et.al. utilize a Markov chain model to jointly analyze the trajectories of several pedestrians and predict their destinations in a given scene. Adel et. al. integrate social interactions and the visual context of the environment to forecast the future motion of multiple individuals. Multi-Range Transformers has proposed to predict the movements of groups with more than ten people engaging in social interactions. These aforementioned methods leverage various techniques to capture social interaction, spatial dependencies, and temporal dynamics, generally aiming to predict accurate and socially plausible future motions for multiple individuals in different scenarios. However, despite the notable advancements achieved, there remains a demand for further investigation of the correlation between the consistency and diversity of motions within group context. A deeper understanding of how to attain the optimal balance between consistency and diversity holds the potential to unlock new possibilities for creating group choreographies that benefit the users in many circumstances.

Diffusion for Music-driven Choreography.Recently, diffusion-based approaches have shown remarkable results on several generative tasks ranging from image generation, audio synthesis, pose estimation, natural language generation, and motion synthesis, to point cloud generation, 3D object synthesis, and scene creation. Diffusion models have shown that they can achieve high mode coverage, unlike GANs, while still maintaining high sample quality. Most existing diffusion-based approaches for human motion/dance synthesis only focus on generating motion sequences for a single character, conditioned on information such as text, audio, or both audio and text. Different from these prior works, we aim to create group of dancing motions from music, which includes coordinating multiple characters, avoiding collisions, and maintaining coherence between them. In addition to the vanilla diffusion loss term used for training in previous works, our method employs a contrastive learning strategy that directly influences the training of the diffusion reverse process, enhancing the association within the dance group.

A prominent issue of the diffusion approach for motion synthesis is that although it is highly effective in generating diverse samples, injecting a large amount of noise during the sampling process can lead to inconsistent results. This issue is particularly problematic for group dance paradigm. Therefore, our desideratum is to design a group dance generation model with the ability to address both diversity and consistency problem.

3. Preliminaries

3.1 Background

Given an input music sequence {a1,at,...,aT}\{a_1, a_t, ...,a_T\} with t={1,...,T}t = \{1,..., T\} indicates the index of the music frames, our goal is to generate the group motion sequences of NN dancers: {x11,...,xT1;...;x1N,...,xTN}\{x^1_1,..., x^1_T; ...;x^N_1,...,x^N_T\} where xtix^i_t is the pose of ii-th dancer at frame tt. We represent dances as sequences of poses in the 24-joint of the SMPL model, using the 6D continuous rotation for every joint, along with a single 3D root translation. This rotation representation ensures the uniqueness and continuity of the rotation vector, which is more beneficial to the training of deep neural networks. We tackle the group dance generation task by using a diffusion-based framework to synthesize the motions from a random noise distribution, given the music conditioning. Thanks to the sampling process of the diffusion model, we can effectively control the consistency and diversity in the generated sequences.

3.2 Forward Process of Diffusion Model.

Given an original sample from the real data distribution x0q(x0){x_0} \sim q({x_0}), following \cite{ho2020ddpm}, the forward diffusion process is defined as a Markov process that gradually adds Gaussian noise to the data under a pre-defined noise schedule up to MM steps.

q(xmxm1)=N(xm;1βmxm1,βmI),m{1,2,...,M}q(x_m | x_{m-1}) = \mathcal{N}(x_m; \sqrt{1-\beta_m} {x_{m-1}}, \beta_mI), \forall m \in \{1,2, ... ,M\}

If the noise variance schedule βm\beta_m is small and the number of diffusion step MM is large enough, the distribution q(xM)q(x_M) at the end of the process is well-approximated by a standard normal distribution N(0,I)\mathcal{N}(0,I), which is easy to sample from. Thanks to the nice property of the forward diffusion, we can directly obtain the noised sample at any arbitrary step mm without traversing through the whole chain:

q(xmx0)=N(xm;αˉmx0,(1αˉm)I),xm=αˉmx0+1αˉmϵ,ϵN(0,I)q(x_m | x_{0}) = \mathcal{N}(x_m; \sqrt{\bar{\alpha}_m} x_{0} , (1-\bar{\alpha}_m)I), x_m = \sqrt{\bar{\alpha}_m} x_{0} + \sqrt{1-\bar{\alpha}_m} \epsilon, \epsilon \sim \mathcal{N}(0, I)

where αm=1βm\alpha_m = 1-\beta_m and αˉm=s=0mαs\bar{\alpha}_m = \prod_{s=0}^m \alpha_s.

3.3 Reverse Process.

By additionally conditioning on x0x_0, the posterior of the reverse process is tractable and becomes a Gaussian distribution:

q(xm1xm,x0)=N(xm1;μ~m,β~mI),q(x_{m-1} | x_m, x_0) = \mathcal{N} (x_{m-1}; \tilde{\mu}_m, \tilde{\beta}_mI),

where μ~m\tilde{\mu}_m and β~m\tilde{\beta}_m are the posterior mean and variance that depend on both xmx_m and x0x_0, respectively. To obtain a sample from the original data distribution, we start by sampling from the noise distribution q(xM)q(x_M) and then gradually remove the noise until we reach x0x_0, following the reverse process. Therefore, our goal is to train a neural network to approximate the posterior q(xm1xm)q(x_{m-1} | x_m) of the reverse process as:

pθ(xm1xm)=N(xm1;μθ(xm,m),Σθ(xm,m))p_{\theta}(x_{m-1} | x_m) = \mathcal{N}(x_{m-1}; \mu_{\theta}(x_m, m), \Sigma_{\theta}(x_m,m) )

We model only the mean μθ(xm,m)\mu_{\theta}(x_m, m) of the reverse distribution while keeping the variance Σθ(xm,m)\Sigma_{\theta}(x_m,m) fixed according to the noise schedule. However, instead of predicting the noise ϵm\epsilon_m at any arbitrary step mm as in their approach, we train the network to learn to predict the original noiseless signal x0x_0. The sample at the previous step m1m-1 can be obtained by noising back the predicted x0x_0. For conditional generation setting, the network is additionally conditioned with the conditioning signal cc as x0Gθ(xm,m,c)x_0 \approx \mathcal{G}_\theta(x_m,m,c) with model parameters θ\theta.

Next

In the next post, we will mention our proposal in details.