AIOZ AI / Feb 7th / 70 min read

Controllable Group Choreography using Contrastive Diffusion (Part 2)

Dive into the methodology of controllable group choreography

In previous part, we have discover the motivation and recent works that focus on generating dances and group dances. We also focus on analyze the pros and cons of each methods. We then dive into the detailed algorithm of our proposed GCD. Our methodology can be splited into three parts: Music-Motion Transformer and Group Global Attention, and Contrastive Diffusion Loss for Controllable Motions. In this part, let investigate the Music-Motion Transformer first.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Group Diffusion Denoising Network

Our model architecture is illustrated in Figure.1. We utilize a transformer based architecture to generate the whole sequence in one go. Compared with recent auto-regressive approach~\cite{le2023music}, our method does not suffer from the error accumulation problem (i.e., the prediction error accumulates over time since the current-frame outputs are used as inputs to the next frame in the auto-regressive fashion) and thus can generate arbitrary long motion dance sequences without freezing effects. The input of our network at each diffusion step $m$ is the noisy group sequence $x_m =\{x^1_{m,1},..., x^1_{m,T}; ...;x^N_{m,1},...,x^N_{m,T}\}$ however, we skip the $m$ index for ease of notation from now on.

2. Music-Motion Transformer:

Given an input extracted audio sequence $a = \{a_1, a_2, ...,a_T\}$ , we employ a transformer encoder architecture to encode the music into the sequence of hidden audio representation $\{c_1, c_2, ...,c_T\}$ , which will be used as the conditioning context to the diffusion denoising network. Specifically, we follow the encoder layer which consists of multi-head self-attention layers and feed-forward layers to effectively encode the multi-scale rhythmic patterns and long-term dependencies between music frames. The diffusion time-step $m$ is also projected to the transformer dimension through a separate Multi-layer Perceptron (MLP) with 3 hidden layers to get the embedding $\tau_{emb}$ , then concatenated with the music feature sequence to obtain the final conditioning context $c = \{c_1, c_2, ...,c_T, \tau_{emb}\}$ .

Figure 1. Detailed illustration of our method for group choreography generation.

Although group choreography incorporates the problem of learning the interaction between dancers, we still need to learn the correlation between the dance movements and the accompanying music audio for each dancer. Therefore, we design the Music-Motion Transformer to essentially focus on learning the direct connection between the motion and the music of each individual dancer (and not considering the interconnection among dancers yet). Each frame of the noised input motion $x^i_t$ is projected into the transformer dimension by a linear layer followed by an additive positional encoding. Given the whole group sequence including all dancers $\{x^1_1,..., x^1_T; ...;x^n_1,...,x^n_T\}$ , we separately encode the motion features of each individual dancer by utilizing the multi-head self-attention with masking strategy. We implement the masked self-attention (MSA) mechanism as follows:

\text{MSA}(Q,K,V) = \text{softmax}\left(\frac{{Q}{K}^\top}{\sqrt{d_k}} + {m}_{local} \right) {V}, \\ {Q} = x {W}^{Q}, \quad {K} = x {W}^{K}, \quad {V} = x {W}^{V}

where ${W}^{Q}, {W}^{K} \in \mathbb{R}^{d \times d_k}$ and ${W}^{V} \in \mathbb{R}^{d \times d_v}$ are learnable projection matrices to transform the input to query, key, and value, respectively. ${m}_{local}$ is the local attention mask illustrated in Figure2.a. This mask ensures each individual can only attend to their own motion. Subsequently, to incorporate the music conditioning context $c$ into each individual motion features, we adopt a transformer decoder architecture with cross-attention mechanism (CA), where the motion is the query and the music is the key/value.

\text{CA}(\tilde{Q}, \tilde{K}, \tilde{V}) = \text{softmax}\left(\frac{{\tilde{Q}}{\tilde{K}}^\top}{\sqrt{d_k}} \right) {\tilde{V}}, \\ {\tilde{Q}} = \tilde{x} {\tilde{W}}^{Q}, \quad {\tilde{K}} = c {\tilde{W}}^{K}, \quad {\tilde{V}} = c {\tilde{W}}^{V}

where $\tilde{x}$ is the output activation of the MSA block, and ${\tilde{W}}^{Q}$ , ${\tilde{W}}^{K}$ , ${\tilde{W}}^{V}$ are the learnable projection matrices that have similar behavior to the MSA mechanism.

3. Group Global Attention

To ensure the coherency and non-collision in the movements of all dancers within the group, such that their dances should correlate with each other under the music condition instead of dancing asynchronously , we first perform global attention via a masked attention mechanism with a full masking strategy $m_{global}$ . The attention mask is illustrated in Figure2.b. It allows a dancer to fully attend to all other dancers under the global receptive field. Then, we propose the Group Modulation to enforce the group constraints within the group embedding information.

Figure 2. The local attention mask $m_{local}$ (a) and global attention mask $m_{global}$ (b). The blue cell indicates where frames can attend to each other. Blue color represents zero value of the mask while gray color represents minus infinity. $x^i_{[1:T]}$ indicate motion sequence of $i$ -th dancer.

Since the synthesized image can be manipulated via a latent style vector, we aim to learn a group embedding information from the input music in order to control the group dance generation process. We first apply temporal average pooling to the encoded music feature sequence to obtain a compact representation of the input music $\bar{c} = \frac{1}{T}\sum_{t=1}^T c_t$ . To increase the variation and diversity of the group information (i.e., avoid limiting the group embedding to only one style of the input music), we inject a random noise drawn from a standard gaussian distribution $z \sim \mathcal{N}(0,I)$ into $\bar{c}$ . We use an 8-layer MLP to learn a mapping from the audio representation to the group embedding. We also add a learnable embedding token $e_{n}$ from a variable-size lookup table $E \in \mathbb{R}^{N\times D}$ up to $N$ maximum dancers, to represent the variation of dancers in the sequence since each sequence may contain different number of dancers. In summary, the process can be written as follows: