19 posts tagged with "type: research"

View All Tags

Controllable Group Choreography using Contrastive Diffusion (Part 6)

In previous part, we have discovered the qualitative and quantitative expriments. In this part, let investigate ablation studies and user studies.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Ablation Analysis

Loss Terms. The contribution of the geometric loss Lgeo\mathcal{L}_{\rm geo} and contrastive loss Lnce\mathcal{L}_{\rm nce} in GCD is thoroughly analyzed and presented in Table 1. The results demonstrate that both losses play a crucial role in enhancing the overall performance across all four evaluation protocols. In particular, it can be seen that the effect of Lgeo\mathcal{L}_{\rm geo} on realism metrics (FID, GMR, and PFC) is significant. This observation can be attributed to the fact that this loss improves the physical plausibility and naturalness of the dance motions, empirically mitigating common artifacts such as jittery motion or foot skating. By enforcing the geometric constraints, GCD can generate faithful motions that are on par with real dances. Moreover, the contrastive loss Lnce\mathcal{L}_{\rm nce} contributes positively to the favorable results in the synchrony measures (GMR and GMC). This loss term encourages the model to synchronize the movements of multiple dancers within a group, thus improving the harmonious coordination and cohesion of the generated choreographies. In general, the results of Lgeo\mathcal{L}_{\rm geo} and Lnce\mathcal{L}_{\rm nce} validate their importance across various evaluation metrics.

Table. 1. Global module contribution and loss analysis. Experiments are conducted on GCD with γ=0\gamma = 0 (neutral mode).

Group Global Attention. Results presented in first two lines of Table 1 demonstrate substantial improvements obtained by incorporating Group Global Attention into GCD. It clearly shows that without the Group Global Attention, the performance on the group dance metrics (GMR and GMC) is significantly degraded. We also observe that the removal of this block resulted in inconsistent movements across dancers, where they seem to dance in freestyle without any group choreographic rules and collide with each other in many cases, although they may still follow the rhythm of the music. Results suggest the vital importance of ensuring coherency and regulating collisions for visually appealing group dance animations.

2. User Study

Qualitative user studies are important for evaluating generative models as the perception of users tends to be the most relevant metric for many downstream applications. Therefore, we conduct user studies to evaluate our approach in terms of group choreography generation. We organized two separate studies and enlisted roughly 50 individuals with diverse backgrounds to participate in our experiment. Each participant should have some relevant experience in music and dance (at least 1 month of studying or working in dance-related professions). The age of participants varied between 20 and 50, with approximately 55\% female and 45\% male.

In the initial study, we requested the participants to evaluate the dancing animations based on three criteria: the naturalness of the dancing motions (Realism), how well the movements match the music (Music-Motion Correspondence), and how well the dancers interact or synchronize with each other (Synchronization between Dancers). Participants were asked to rate scores from 0 to 10 for each criterion, ranging from 0-very poor, 5-acceptable, to 10-very good. The collected scores were then normalized to range [0,1][0,1].

This user study encompassed a total of 1893189 * 3 samples with songs that are not present in the train set, including those generated from GDanceR, real dance clips from the dataset, and generated results from our proposed method in neutral mode. Figure.1 shows average scores for all mentioned targets across three experiments. Notably, the ratings of our method are significantly higher than GDanceR across all three criteria. We also perform Tukey honest significance tests to determine the significant differences among the three methods. For the first two criteria (Realism and Music-Motion Correspondence), we observe that the mean scores of all methods are significantly different with p<0.05p < 0.05. For synchronization critera, the differences are significant except for the scores between our method and real dances (p0.07p\approx0.07). This highlights that our method can even achieve comparable scores with real dances, especially in the synchronization evaluation. This can be attributed to the proposed contrastive diffusion strategy, which can effectively maintain a balance between the consistency of the movements and the group/audio context, as well as diversity in generated dances.

Figure. 1. User study results in three criteria: Realism; Music-Motion Correspondence; and Synchronization between Dancers

In the second study, we aim to assess the diversity and consistency of the generated dance outputs and determine if they met the expectations of the users. Specifically, participants were asked to assign scores ranging from 0 to 10 to evaluate the consistency and diversity of each dance clip, i.e., how synchronized or how distinctive movements between dancers does the group dance present. A lower score indicated higher consistency, while a higher score indicated greater diversity. These scores were subsequently normalized to [1,1][-1,1] to align with the studying range of the control parameter employed in our proposed method.

Figure 2 depicts a scatter plot illustrating the relationship between the scores provided by the participants and the γ\gamma parameter that was used to generate the dance samples. The parameter values were randomly drawn from a uniform distribution with range [1,1][-1,1] to create the animations along with randomly sampled musical pieces. The survey shows a strong correlation between the user scores and the control parameter, in which we calculated the correlation coefficient to be approximately 0.88. The results indicate that the diversity and consistency level of the generated group choreography samples is mostly in agreement with the user evaluation, as indicated by the scores obtained.

Figure. 2. Correlation between the controlling consistency/diversity and the scores provided by the users.

3. DISCUSSION AND CONCLUSION

While controlling consistency and diversity in group dance generation by using our proposed GCD has numerous advantages and potentials, there are certain limitations. Firstly, it requires tuning the parameters and a complex system that is not trivial to train, to ensure that the generated dance motions can produce the desired level of similarity among dancers while still presenting enough variation to avoid repetitive or monotonous movements. This may involve long inference processes and may require significant computational resources in both the training and testing phases.

Secondly, over-controlling consistency and diversity may introduce constraints on the creative freedom of generated dances. While enforcing consistency can lead to synchronized and harmonious group movements, it may limit the possibility of exploring unconventional or new experimental dance styles. On the other hand, promoting diversity results in unique and innovative dance sequences, but it may sacrifice coherence and coordination among dancers.

{Although our model can synthesize semantically faithful group dance animation with effective coordination among dancers, it does not capture clear physical contact between dancers such as hand touching. This is because the data we used in training does not contain such detailed hand motion information. We think that exploring group dance with realistic physical hand interactions is a promising area for future work. Additionally, while our method offers a trade-off between diversity and consistency, achieving perfect alignment between high-diversity movements and music remains a challenging task. The diversity level among dancers and the alignment with music are also heavily influenced by the training data. We believe further efforts are required to reach this.}

Lastly, the subjective nature of evaluating consistency and diversity poses a challenge. Metrics for measuring these aspects may not be best fitted. We believe it is essential to consider diverse perspectives and demand domain experts to validate the effectiveness and quality of the generated dance motions.

To conclude, we have introduced GCD, a new method for audio-driven group dance generation that effectively controls the consistency and diversity of generated choreographies. By using contrastive diffusion along with the guidance technique, our approach enables the generation of a flexible number of dancers and long-term group dances without compromising fidelity. Through our experiments, we have demonstrated the capability of GCD to produce visually appealing and synchronized group dance motions. The results of our evaluation, including comparisons with existing methods, highlight the superior performance of our method across various metrics including realism and synchronization. By enabling control over the desired levels of consistency and diversity while preserving fidelity, our work has the potential for applications in entertainment, virtual performances, and artistic expression, advancing the effectiveness of deep learning in generative choreography.

Controllable Group Choreography using Contrastive Diffusion (Part 5)

In previous part, we have discovered the experimental setups and quality analysis. In this part, let investigate consistency and diversity factors in dance generation.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Diversity and Consistency Analysis

Table.1 presents an in-depth analysis of our method's performance across seven evaluation metrics by adjusting the parameter γ\gamma to control the consistency and diversity of the generated group choreographies. The findings reveal that our GCD with high consistency setting (γ=1\gamma = 1), performs better than other settings in terms of MMC, GMC, and TIR metrics, whereas the high diversity setting (γ=1\gamma=-1) achieves better results in the GenDiv metric. Meanwhile, the default model shows the best performance in both realism metrics (FID and GMR). {It can also be seen that the model is relatively robust to the physical plausibility score (PFC) as there are no noticeable differences among the metric in the three settings. This implies that our model is able to create group animation with different consistency or diversity levels without compromising the plausibility of the movements too much.} More interestingly, we found that there are indeed positive correlations between the two measures MMC, GMR, and the trade-off parameter. It is clear that these metrics are better when the consistency level increases. This is reasonable as we expect higher correspondence between the motion and the music (MMC) or higher correlation of the group motions (GMR) when the consistency level grows, which also agrees with the definition of these metrics.

Table. 1. Performance comparison. High Consistency: parameter γ=1\gamma=1; High Diversity: parameter γ=1\gamma=-1; Neutral: parameter γ=0\gamma=0

Consistency setups in GCD lead to more similar movements between dancers. As a result, this similarity contributes to high scores in MMC, GMR, and TIR metrics. In contrast, the diversity setups can synthesize more complex motions with greater variation between dancers as measured by the GenDiv metric, but this also makes it more challenging to reach high values of FID, MMC, and TIR, compared with other setups. In addition, Figure 1 shows an example of correlations between motion beats and music beats under high consistency and diversity settings. The music beats are extracted using the beat tracking algorithm from the Librosa library. Notably, the velocity curves in high consistency setting display relatively similar shapes, whereas in the high diversity scenario, the curves are clearly distinguished among dancers.

Figure. 1. Correlation between the motion and music beats. The solid curve represents the kinetic velocity of each dancer over time and the vertical dashed line depicts the music beats of the sequence. The motion beats can be detected as the local extrema from the kinetic velocity curve.

Despite the greater variations in high diversity setting, we can observe that the generated motions are matched with the music as the music beats are mostly located near the extrema of the motion curves in both settings. The experiment indicates that our model can faithfully capture different aspects of group choreography with different settings, including diversity and synchrony of the motions. This demonstrates the potential of our method towards various dance applications such as dance training or composing. Furthermore, our method can also produce distinctively different animation sequences under the same setting while adhering to the input music.
It is also important to note that all three setups of GCD significantly outperform other baseline models. This verifies the effectiveness of our proposed approach and shows that it can create high-fidelity group dance animations in any setting. For a more detailed visualization of the results, please refer to Figure 2.

Figure. 2. Consistency and diversity trade-off. High Consistency: parameter γ=1\gamma=1; High Diversity: parameter γ=1\gamma=-1; Neutral: parameter γ=0\gamma=0.

2. Number of Dancers Analysis

Table 2 provides insights into results obtained when generating arbitrarily different numbers of dancers using our proposed GCD in the neutral setting. In general, FID, GMR, and GMC metrics do not exhibit a clearly strong correlation with the number of generated dancers but display diverse and varied results. The MMC metric consistently shows its stability across all setups.

Table. 2. Performance of group dance generation methods when we increase the number of generated dancers, compared with GDanceR. In GCD setup, Neutral mode with γ=0\gamma = 0 is used.

As the number of generated dancers increases, the generation diversity (GenDiv) decreases while the trajectory intersection frequency (TIF) increases. However, it is worth noting that the differences observed in these metrics are relatively minor compared to those produced by GDanceR. This implies that our method can effectively control consistency and diversity, significantly reducing the chances of collisions between dancers and maintaining the overall quality of generated group dance motions.

For a detailed visualization of results, please refer to Figure 3. These results underscore the robustness and flexibility of our method in generating group dance motions across varying numbers of dancers while ensuring consistency, diversity, and avoiding collisions between performers.

Figure. 3. Group dance generation results of GCD in terms of different numbers of dancers.

3. Long-term Analysis

To evaluate the efficacy of the guidance signal in GCD for creating long-term group dance sequences, we conducted a comparative analysis between GCD and the baseline model GDanceR. The experiment involved musical pieces of different durations: 15 seconds, 30 seconds and 60 seconds. We show the results with the guidance parameter γ=0.5\gamma = 0.5 to enforce consistency with the music over a long duration. For more detailed information, please refer to Figure 5 and our supplementary video.

Figure. 5. Long-term results of the 60-second clip. For clearer visualization, please visit our demo video.

While both methods produce satisfactory results in the first few seconds of the animations (e.g., about 5-6 seconds), GDanceR starts to exhibit floating and unrealistic movements or freeze into a mean pose in the later period of the sequence. Figure 4 shows the Motion changes comparison between our GCD method and GdanceR. The motion change magnitudes are calculated as average differences of the kinetic features between consecutive frames. It is evident that the motion change magnitude of GDanceR is gradually lower and approaching zero in the later half of the 60-second music piece, whereas our method can preserve high magnitudes and variations over time. This is because GDanceR generates almost frozen dance choreographies during this period. In contrast, the group dance motions produced by GCD remain natural with diverse movements throughout the entire duration of all music samples.

Figure. 4. Motion changes comparison between our proposed GCD and GDanceR. The experiment is conducted on generated group dance results of 60-second music pieces..

These findings confirm that our approach can effectively address the problem of motion generation in long-horizon group dance scenarios. It maintains the motion quality and dynamics of the dance motions, ensuring that the created animations remain visually appealing throughout extended periods. This highlights the advantage of the contrastive strategy to enhance the consistency of the movements of dancers with their group and the music, resulting in significant improvements for long-term dance sequence generation compared to the baseline GDanceR.

Next

In the next post, we will consider Ablation Studies and User Studies.

Controllable Group Choreography using Contrastive Diffusion (Part 4)

In previous part, we have discovered the theory of Music-Motion Transformer. In this part, let investigate the experiment setups for its effectiveness verification, as well as qualyty experimental results.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Experiments

1.1 Implementation Details

The hidden layer of all the MLPs consists of 512512 units followed by GELU activation. The hidden dimension of all attention layers is set to d=512d=512, and the attention adopts a multi-head scheme with 88 attention heads. We also use a feature-wise linear modulation (FiLM) after each attention layer to strengthen the influence of the conditioning context. At the end of each attention block, we append a 22-layer feed-forward network with a feed-forward size of 10241024 to enhance the expressivity of the learned features. We extract the features from the raw audio signal by leveraging the representations from the frozen Jukebox, a pre-trained generative model for music, to enhance the model's generalization ability to several kinds of in-the-wild music. In total, the Group Diffusion Denoising Network is comprised of L=5L=5 stacked Music-Motion Transformer and Group Global Attention blocks, along with 22 transformer encoder layers to encode the music features. We implement the architecture of the Contrastive Encoder similarly to the Denoising Network but without cross attention since it does not take the music as input. The output sequence of the Contrastive Encoder is then averaged out and fed into an output layer with one unit. We also make the Contrastive Encoder aware of the current step in the diffusion chain by appending the diffusion timestep embedding to the motion sequence so that it can provide correct guidance signals in the sampling process. Overall, our model has approximately 62M62M trainable parameters.

1.2 Training

To train the denoising diffusion network, we use the "simple" objective as:

Lsimple=Ex0q(x0c),m[1,M][x0Gθ(xm,m,c)22]\mathcal{L}_{\rm simple} = \mathbb{E}_{x_0\sim q(x_0|c), m\sim [1,M]} \left[\Vert x_0 - \mathcal{G}_\theta(x_m,m,c) \Vert^2_2\right]

To improve the physical plausibility and prevent artifacts of the generated motion, we also utilize auxiliary geometric losses.

Lgeo=λposLpos+λvelLvel+λfootLfoot\mathcal{L}_{\rm geo} = \lambda_{\rm pos}\mathcal{L}_{\rm pos} + \lambda_{vel}\mathcal{L}_{\rm vel} + \lambda_{\rm foot}\mathcal{L}_{\rm foot}

In particular, geometric losses mainly consist of (i) a joint position loss Lpos\mathcal{L}_{\rm pos} to better constrain the global joint hierarchy via forward kinematics; (ii) a velocity loss Lvel\mathcal{L}_{\rm vel} to increase the smoothness and naturalness of the motion by penalizing the difference between the differences between the velocities of the ground-truth and predicted motions; and (iii) a foot contact loss Lfoot\mathcal{L}_{\rm foot} to mitigate foot skating artifacts and improve the realism of the generated motions by ensuring the feet to stay stationary when ground contact occurs.

Our total training objective is the combination of the "simple" diffusion objective, the auxiliary geometric losses, and the contrastive loss:

L=Lsimple+Lgeo+λnceLnce\mathcal{L} = \mathcal{L}_{\rm simple} + \mathcal{L}_{\rm geo} + \lambda_{\rm nce}\mathcal{L}_{\rm nce}

We train our model on 44 NVIDIA V100 GPUs using Adam optimizer with a learning rate of 1e41\mathrm{e}{-4} and a batch size of 6464 per GPU, which took about 77 days for 500500k iterations. The models are trained with M=1000M=1000 diffusion noising steps and a cosine noise schedule. During training, group dance motions are randomly sampled with sequence length T=150T=150 at 3030 Hz, which corresponds to 55-second pieces of music. For geometric losses, the loss weights are empirically set to λpos=1.0\lambda_{\rm pos}=1.0, λsmooth=1.0\lambda_{\rm smooth}=1.0, and λfoot=0.005\lambda_{\rm foot}=0.005, respectively. For the contrastive loss Lnce\mathcal{L}_{\rm nce}, its weight is λnce=0.001\lambda_{\rm nce} = 0.001, the probability of replacing dancers for negative sequences is 0.50.5, and the number of negative samples empirically is selected to 1010.

h~=S(w)hμ(h)σ(h)+b(w)\tilde{h} = S(w) * \frac{h-\mu(h)}{\sigma(h)}+ b(w)

where each channel of the whole activation sequence is first normalized separately by calculating the mean μ\mu and σ\sigma, and then scaled and biased using the outputted affine parameters S(w)S(w) and b(w)b(w). Intuitively, this operation shifts the activated hidden motion features of each individual motion towards a unified group representation to further encourage the association between them. Finally, the output features are then projected back to the original motion dimensions via a linear layer, to obtain the predicted outputs x^0\hat{x}_0.

1.3 Testing

At test time, we use the DDIM sampling technique with 5050 steps to accelerate the sampling speed of the reverse diffusion process. Accordingly, our model can achieve real-time generation at 3030 Hz on a single RTX 2080Ti GPU (excluding the music features extracting step), thanks to the parallelization of the Transformer architecture.

To enable long-term generation, we divide the input music sequence into multiple overlapping chunks, with each chunk having a maximum window size of 55 seconds and overlapped by half with the adjacent chunk. The group dance motions are then generated for each chunk along with the corresponding audio. Subsequently, we merge the outputs by blending the overlapped region between two consecutive chunks using spherical linear interpolation, with the interpolation weight gradually decaying from the current chunk to the next chunk. However, for group choreography synthesis, our model generates dance motions for each dancer in random order. Therefore, we need to establish correspondences between dancers across the chunks (i.e., identifying which one of the NN dancers in the next chunk corresponds to a dancer in the current chunk). To accomplish this, we organize all dancers in the current chunk into one set and the dancers in the next chunk into another set, forming a bipartite graph between the two chunks. We can then utilize the Hungarian algorithm to find the optimal matching, where the Euclidean distance between the two pose sequences serves as the matching weights. Our blending technique is applied at each step of the diffusion sampling process, starting from pure noise, thus it allows the model to gradually denoise the chunks to make them compatible for blending.

2. Experimental Settings

2.1 Dataset

We use AIOZ-GDance dataset in our experiments. AIOZ-GDance is a large-scale group dance dataset including paired music and 3D group motions captured from in-the-wild videos using a semi-automatic method, covering 7 dance styles and 16 music genres.

2.2 Evaluation Protocol

We use the following metrics to evaluate the quality of single dancing motion: Frechet Inception Distance (FID), Motion-Music Consistency (MMC), Generation Diversity (GenDiv), Physical Foot Contact score (PFC). Concretely, FID score measures the realism of individual dance movements against the ground-truth dance. The MMC evaluates the matching similarity between the motion and the music beats, i.e., how well generated dances follow the beat of the music. The generation diversity (GenDiv) is evaluated as the average pairwise distance of the kinetic features of the motions. The PFC evaluates the physical plausibility of the foot movements by calculating the agreement between the acceleration of the character's center of mass and the foot's velocity.

To evaluate the group dance quality, we follow three metrics: Group Motion Realism (GMR), Group Motion Correlation (GMC), and Trajectory Intersection Frequency (TIF).
In general, the GMR measures the realism between generated and ground-truth group motions by calculating Frechet Inception Distance on the extract group motion features. The GMC evaluates the synchrony between dancers within the generated group by calculating their cross-correlation. The TIF measures how often the generated dancers collide with each other in their dance movements.

2.3 Baselines

We compare our GCD method with several recent approaches on music-driven dance generation: FACT, Transflower, and EDGE, all of which are adapted for benchmarking in the context of group dance generation since the original methods were specifically designed for single-dance. We also evaluate against GDanceR, a recent model specifically designed for generating group choreography.

3. Quality Comparison

Table 1 shows a comparison among the baselines FACT, Transflower, EDGE, GDanceR, and our proposed GCD. The results clearly demonstrate that our default model setting with "neutral" mode outperforms the baselines significantly across all evaluations. We also observe that EDGE, a recent diffusion dance generation model, can yield very competitive performance on single-dance metrics (FID, MMC, GenDiv, and PFC). This suggests the advantages of diffusion approaches in motion generation tasks. However, it is still inferior to our model under several group dance metrics, showing the limitations of single dance methods in the context of group dance creation. Experimental results highlight the effectiveness of our approach in generating high-quality group dance motions.

Table. 1. Performance comparison. High Consistency: parameter γ=1\gamma=1; High Diversity: parameter γ=1\gamma=-1; Neutral: parameter γ=0\gamma=0

To complement the quantitative analysis, we present qualitative examples from FACT, GDanceR, and our GCD method in Figure 1. Notably, FACT struggles to deal with the intersection problem, which is reasonable given that it was not originally designed for group dance generation. As a result, the generated motions from FACT lack coordination and synchronization in most cases. While GDanceR shows improvements in terms of motion quality compared to FACT, the generated motions appear floating, unnatural, and sometimes unsynchronized in many cases. These drawbacks indicate that GDanceR's effort on generating group choreography would still require more refinement to produce consistent and cohesive movements among the dancers.

Figure. 1 Comparison between different dance generation methods when generating dancing in groups.

In contrast, our method excels in both controlling consistency and promoting diversity among the generated group dance motions. The outputs from our method demonstrate well-coordinated and realistic movements, implying that it can resolve the challenges of maintaining group coherence while delivering visually appealing results more effectively.

Overall, the conducted quantitative analysis and visual comparisons reaffirm the superior performance of our proposed GCD to generate high-quality, synchronized, and visually pleasing group dance motions.

Next

In the next post, we will mention our Consistency and Diversity Analysis.

Controllable Group Choreography using Contrastive Diffusion (Part 3)

In previous part, we have discovered Music-Motion Transformer. In this part, let investigate the Group Modulation and Contrastive Divergence Loss.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Group Modulation

To better apply the group information constraints to the learned hidden features of the dancers, we adopt a Group Modulation layer that learns to adaptively influence the output of the transformer attention block by applying an affine transformation to the intermediate features based on the group embedding ww. More specifically, we utilize two separate linear layers to learn the affine transformation parameters {S(w);b(w)}Rd\{S(w); b(w)\} \in \mathbb{R}^d from the group embedding ww. The predicted affine parameters are then used to modulate the activations sequence h={h11hT1;;h1NhTN}h = \{h^1_1\dots h^1_T;\dots;h^N_1\dots h^N_T\} as follows:

h~=S(w)hμ(h)σ(h)+b(w)\tilde{h} = S(w) * \frac{h-\mu(h)}{\sigma(h)}+ b(w)

where each channel of the whole activation sequence is first normalized separately by calculating the mean μ\mu and σ\sigma, and then scaled and biased using the outputted affine parameters S(w)S(w) and b(w)b(w). Intuitively, this operation shifts the activated hidden motion features of each individual motion towards a unified group representation to further encourage the association between them. Finally, the output features are then projected back to the original motion dimensions via a linear layer, to obtain the predicted outputs x^0\hat{x}_0.

2.Contrastive Diffusion

We learn the representations that encode the underlying shared information between the group embedding information ww and the group sequence xx. Specifically, we model a density ratio that preserves the mutual information between xx and ww as:

f(x,w)p(xw)p(x)f(x, w) \propto \frac{p({x}|{w})}{p({x})}

f()f(\cdot) is a model (i.e., a neural network) to predict a positive score (how well xx is related to ww) for a pair of (x,w)({x}, {w}).

To enhance the association between the generated group dance (data) and the group embedding (context), we aim to maximize their mutual information with a Contrastive Encoder f(x^,w)f(\hat{x},w) via the contrastive learning objective as in Equation below. The encoder takes both the generated group dance sequence x^\hat{x} and a group embedding ww as inputs, and it outputs a score indicating the correspondence between these two.

Lnce=E[logf(x^,w)f(x^,w)+ΣxjXf(x^j,w)]\mathcal{L}_{\rm nce} = - \mathbb{E} \left[ \log\frac{f(\hat{x},w)}{f(\hat{x},w) + \Sigma_{x^j \in X'}f(\hat{x}^j, w)}\right]

where XX' is a set of randomly constructed negative sequences. In general, this loss is similar to the cross-entropy loss for classifying the positive sample, and optimizing it leads to the maximization of the mutual information between the learned context representation and the data. Using the contrastive objective, we expect the Contrastive Encoder to learn to distinguish between the two quantities: consistency (the positive sequence) and diversity (the negative sequence). This is the key factor that enables the ability to control diversity and consistency in our framework.

Here, we will describe our strategy to construct contrastive samples to achieve our target. Recall that we use reverse distribution pθ(xt1xt)p_\theta(x_{t-1} | x_t) of Gaussian Diffusion with the mean as the prediction of the model while the variance is fixed to a scheduler (Equation~\ref{eq:approximateposterior}). To obtain the contrastive samples, given the true pair is (x0,w)(x_0,w), we first leverage forward diffusion process q(xmx0)q(x_m|x_0) to obtain the noised sample xmx_m. Then, our positive sample is $\theta(x{m-1} |xm, w).Subsequently,weconstructthenegativesamplefromthepositivepairbyrandomlyreplacingdancersfromothergroupdancesequences(. Subsequently, we construct the negative sample from the positive pair by randomly replacing dancers from other group dance sequences (x^j_0 \neq x_0) with some probabilities, feeding it through the forward process to obtain $x^j_m, then our negative sample is $\theta(x^j{m-1} | x^j_m, w). By constructing contrastive samples this way, the positive pair $(x_0,w) represents a group sequence with high consistency, whereas the negative one represents a high diversity sample. This is because mixing a sample with dancers from different groups is likely to result in substantially distinctive movements between each dancer, making it a group dance sample with high degree of diversity. Note that negative sequences should also match the music because they are motions generated by the network whose inputs are manipulated to increase diversity. Particularly, negative samples are acquired from outputs of the denoising network whose inputs are both the current music and the noised mixed group with some replaced dancers. As the network is trained to reconstruct only positive samples, its outputs will likely follow the music. Therefore, negative samples are not just random bad samples but are the valid group dance generated from the network that is trained to generate group dance conditioned on the music. This is because our main diffusion training objective is calculated only for ground-truth dances (positive samples) that are consistent with the music. Our proposed strategy also allows us to learn a more powerful group representation as it directly affects the reverse process, which is beneficial to maintaining consistency in long-term synthesis.

3. Diversity vs. Consistency

Using the Contrastive Encoder f(xm,w)f(x_m,w), we extend the classifier guidance to control the generation process. Accordingly, we incorporate f(xm,w)f(x_m,w) in the contrastive framework to replace the guiding classifier in the original formula, since it provides a score of how consistent the sample is with the group information. In particular, we shift the mean of the reverse diffusion process with the log gradient of the Contrastive Encoder with respect to the generated data as follows:

μ^θ(xm,m)=μθ(xm,m)+γΣθ(xm,m)xmlogf(xm,w)\hat{\mu}_\theta(x_m,m) = \mu_\theta(x_m,m) + \gamma \cdot \Sigma_{\theta}(x_m,m) \nabla_{x_m}\log f(x_m,w)

where γ\gamma is the control parameter that uses the encoder to enforce consistency and connection with the group embedding. Since the Contrastive Encoder is trained to classify between high-consistency and high-diversity samples, its gradients yield meaningful guidance signals to control the trade-off process. Intuitively, a positive value of γ\gamma encourages more consistency between dancers while a negative value (which corresponds to shifting the distribution with a negative gradient step) boosts the diversity between each individual dancer.

Next

In the next post, we will mention Experimental Setups and Analysis.

Controllable Group Choreography using Contrastive Diffusion (Part 2)

In previous part, we have discover the motivation and recent works that focus on generating dances and group dances. We also focus on analyze the pros and cons of each methods. We then dive into the detailed algorithm of our proposed GCD. Our methodology can be splited into three parts: Music-Motion Transformer and Group Global Attention, and Contrastive Diffusion Loss for Controllable Motions. In this part, let investigate the Music-Motion Transformer first.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Group Diffusion Denoising Network

Our model architecture is illustrated in Figure.1. We utilize a transformer based architecture to generate the whole sequence in one go. Compared with recent auto-regressive approach~\cite{le2023music}, our method does not suffer from the error accumulation problem (i.e., the prediction error accumulates over time since the current-frame outputs are used as inputs to the next frame in the auto-regressive fashion) and thus can generate arbitrary long motion dance sequences without freezing effects. The input of our network at each diffusion step mm is the noisy group sequence xm={xm,11,...,xm,T1;...;xm,1N,...,xm,TN}x_m =\{x^1_{m,1},..., x^1_{m,T}; ...;x^N_{m,1},...,x^N_{m,T}\} however, we skip the mm index for ease of notation from now on.

2. Music-Motion Transformer:

Given an input extracted audio sequence a={a1,a2,...,aT}a = \{a_1, a_2, ...,a_T\}, we employ a transformer encoder architecture to encode the music into the sequence of hidden audio representation {c1,c2,...,cT}\{c_1, c_2, ...,c_T\}, which will be used as the conditioning context to the diffusion denoising network. Specifically, we follow the encoder layer which consists of multi-head self-attention layers and feed-forward layers to effectively encode the multi-scale rhythmic patterns and long-term dependencies between music frames. The diffusion time-step mm is also projected to the transformer dimension through a separate Multi-layer Perceptron (MLP) with 3 hidden layers to get the embedding τemb\tau_{emb}, then concatenated with the music feature sequence to obtain the final conditioning context c={c1,c2,...,cT,τemb}c = \{c_1, c_2, ...,c_T, \tau_{emb}\}.

Figure 1. Detailed illustration of our method for group choreography generation.

Although group choreography incorporates the problem of learning the interaction between dancers, we still need to learn the correlation between the dance movements and the accompanying music audio for each dancer. Therefore, we design the Music-Motion Transformer to essentially focus on learning the direct connection between the motion and the music of each individual dancer (and not considering the interconnection among dancers yet). Each frame of the noised input motion xtix^i_t is projected into the transformer dimension by a linear layer followed by an additive positional encoding. Given the whole group sequence including all dancers {x11,...,xT1;...;x1n,...,xTn}\{x^1_1,..., x^1_T; ...;x^n_1,...,x^n_T\}, we separately encode the motion features of each individual dancer by utilizing the multi-head self-attention with masking strategy. We implement the masked self-attention (MSA) mechanism as follows:

MSA(Q,K,V)=softmax(QKdk+mlocal)V,Q=xWQ,K=xWK,V=xWV\text{MSA}(Q,K,V) = \text{softmax}\left(\frac{{Q}{K}^\top}{\sqrt{d_k}} + {m}_{local} \right) {V}, \\ {Q} = x {W}^{Q}, \quad {K} = x {W}^{K}, \quad {V} = x {W}^{V}

where WQ,WKRd×dk{W}^{Q}, {W}^{K} \in \mathbb{R}^{d \times d_k} and WVRd×dv{W}^{V} \in \mathbb{R}^{d \times d_v} are learnable projection matrices to transform the input to query, key, and value, respectively. mlocal{m}_{local} is the local attention mask illustrated in Figure2.a. This mask ensures each individual can only attend to their own motion. Subsequently, to incorporate the music conditioning context cc into each individual motion features, we adopt a transformer decoder architecture with cross-attention mechanism (CA), where the motion is the query and the music is the key/value.

CA(Q~,K~,V~)=softmax(Q~K~dk)V~,Q~=x~W~Q,K~=cW~K,V~=cW~V\text{CA}(\tilde{Q}, \tilde{K}, \tilde{V}) = \text{softmax}\left(\frac{{\tilde{Q}}{\tilde{K}}^\top}{\sqrt{d_k}} \right) {\tilde{V}}, \\ {\tilde{Q}} = \tilde{x} {\tilde{W}}^{Q}, \quad {\tilde{K}} = c {\tilde{W}}^{K}, \quad {\tilde{V}} = c {\tilde{W}}^{V}

where x~\tilde{x} is the output activation of the MSA block, and W~Q{\tilde{W}}^{Q}, W~K{\tilde{W}}^{K}, W~V{\tilde{W}}^{V} are the learnable projection matrices that have similar behavior to the MSA mechanism.

3. Group Global Attention

To ensure the coherency and non-collision in the movements of all dancers within the group, such that their dances should correlate with each other under the music condition instead of dancing asynchronously , we first perform global attention via a masked attention mechanism with a full masking strategy mglobalm_{global}. The attention mask is illustrated in Figure2.b. It allows a dancer to fully attend to all other dancers under the global receptive field. Then, we propose the Group Modulation to enforce the group constraints within the group embedding information.

Figure 2. The local attention mask mlocalm_{local} (a) and global attention mask mglobalm_{global} (b). The blue cell indicates where frames can attend to each other. Blue color represents zero value of the mask while gray color represents minus infinity. x[1:T]ix^i_{[1:T]} indicate motion sequence of ii-th dancer.

Since the synthesized image can be manipulated via a latent style vector, we aim to learn a group embedding information from the input music in order to control the group dance generation process. We first apply temporal average pooling to the encoded music feature sequence to obtain a compact representation of the input music cˉ=1Tt=1Tct\bar{c} = \frac{1}{T}\sum_{t=1}^T c_t. To increase the variation and diversity of the group information (i.e., avoid limiting the group embedding to only one style of the input music), we inject a random noise drawn from a standard gaussian distribution zN(0,I)z \sim \mathcal{N}(0,I) into cˉ\bar{c}. We use an 8-layer MLP to learn a mapping from the audio representation to the group embedding. We also add a learnable embedding token ene_{n} from a variable-size lookup table ERN×DE \in \mathbb{R}^{N\times D} up to NN maximum dancers, to represent the variation of dancers in the sequence since each sequence may contain different number of dancers. In summary, the process can be written as follows:

w=MLP(z+1Tt=1Tct)+en,zN(0,I)w = \text{MLP}\left(z + \frac{1}{T}\sum_{t=1}^T c_t\right) + e_{n}, \quad z \sim \mathcal{N}(0,I)

Next

In the next post, we will mention Group Global Attention.

Controllable Group Choreography using Contrastive Diffusion (Part 1)

Music-driven group choreography poses a considerable challenge but holds significant potential for a wide range of industrial applications. The ability to generate synchronized and visually appealing group dance motions that are aligned with music opens up opportunities in many fields such as entertainment, advertising, and virtual performances. However, most of the recent works are not able to generate high-fidelity long-term motions, or fail to enable controllable experience. In this work, we aim to address the demand for high-quality and customizable group dance generation by effectively governing the consistency and diversity of group choreographies. In particular, we utilize a diffusion-based generative approach to enable the synthesis of flexible {number of dancers} and long-term group dances, while ensuring coherence to the input music. Ultimately, we introduce a Group Contrastive Diffusion (GCD) strategy to enhance the connection between dancers and their group, presenting the ability to control the consistency or diversity level of the synthesized group animation via the classifier-guidance sampling technique. Through intensive experiments and evaluation, we demonstrate the effectiveness of our approach in producing visually captivating and consistent group dance motions. The experimental results show the capability of our method to achieve the desired levels of consistency and diversity, while maintaining the overall quality of the generated group choreography.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Introduction

With the widespread presence of digital social media platforms, the act of creating and editing dance videos has gained immense popularity among social communities. This surge in interest has resulted in the daily production and watching of millions of dancing videos across online platforms. Recently, researchers from computer vision, computer graphics, and machine learning communities have devoted considerable attention to developing techniques that can generate natural dance movements from music. These advancements have far-reaching implications and find applications in various domains, such as animation, the creation of virtual idols, the development of virtual meta-verse, and dance education. These techniques empower artists, animators, and educators alike, providing them with powerful tools to enhance their creative endeavors and enrich the dance experience for both performers and audiences.

While significant progress has been made in generating dancing motions for single dancer, the task of producing cohesive and expressive choreography for a group of dancers has received limited attention. The generation of synchronized group dance motions that are both realistic and aligned with music remains a challenging problem in the field of computer animation and motion synthesis. This is primarily due to the complex relationship between music and human motion, the diverse range of motions required for group performances, and the insufficient of a suitable dataset. At present, AIOZ-GDance stands as the most recent extensive dataset available to facilitate the task of generating group choreography. Besides, while current algorithms can generate individual movements and choreographic sequences, ensuring that these elements align seamlessly with the overall group performance is also paramount.

Different from solo dance, group dance involves coordination and interaction between dancers, making it crucial and challenging to establish correlations between motion series within a group. Besides, group dance can involve complex and diverse choreographies among participating dancers while still maintaining a semantic relationship between the motion and input music. Exploring the consistency and diversity between the movements of dancers of the synthesized group choreography is of vital importance to create a natural and expressive performance. The ability to control the consistency and diversity in group dance generation holds great potential across various applications. One such application is in the realm of entertainment and performance. Choreographers and creative teams can leverage this control ability to design captivating group dance routines that seamlessly blend synchronized movements with moments of individual expression. Second, in the context of animation and virtual metaverses, the control over consistency and diversity allows for the creation of visually stunning and immersive virtual dance performances. By balancing the synchronization of dancers' movements, while also introducing variations and unique flourishes, the generated group dances can captivate audiences and evoke a sense of realism and authenticity. Last but not least, in dance education and training, the ability to regulate consistency and diversity in group dance generation can be invaluable. It enables instructors to provide students with a diverse range of generated dance routines and samples that challenge their abilities, promote collaboration, and foster creativity. By dynamically adjusting the level of consistency and diversity, educators can cater to the unique need and skill level of each individual dancer, creating more inclusive instructions and enriching the learning environment. Although plenty of applications can be listed, due to some limitations of data establishment, investigating the consistency and diversity in group choreography has not been carefully explored.

In this paper, our goal is to develop a controllable technique for group dance generation. We present a Group Contrastive Diffusion (GCD) strategy that learns an encoder to capture the key targets between group dance movements. Diffusion modeling provides a flexible framework for manipulating the dance distribution, which allows us to modulate the degree of diversity and consistency in the generated dances. By using denoising diffusion probabilistic model as a key technique, we can effectively control the trade-off between diversity and consistency during the group dance generation, thanks to the guided sampling process. With this approach, we can guide the generation process toward a desired balance between diversity and consistency levels. Moreover, incorporating the encoder, which learns the association between the dancers and their group, can help to maintain the generated dance moves so that they are consistent with a specific dance style, music genre, or any long-term chorus. We empirically show that this approach has the potential to enhance the quality and naturalness of generated group dance performances, making it more appealing for various applications.

Figure 1. We present a contrastive diffusion method that controls the consistency (top row) and diversity (second row) in group choreography.

2. Overview

Music-driven Choreography. Creating natural and authentic human choreography from music is a complex task. One commonly employed technique involves using a motion graph derived from a vast motion database to generate new motions. This involves combining various motion segments and optimizing transition costs along the graph path. Alternatively, there are other methods that incorporate music-motion similarity matching constraints to ensure consistency between the motion and the accompanying music. Previous studies have extensively explored these methodologies. However, most of these approaches relied on heuristic algorithms to stitch together pre-existing dance segments sourced from a limited music-dance database. While these methods are successful in generating extended and realistic dance sequences, they face limitations when trying to create entirely novel dance fragments.

In recent years, several signs of progress have been made in the field of music-to-dance motion generation using Convolutional Network (CNN), Recurrent Network (RNN), Graph Neural Network (GNN), Generative Adversarial Network (GAN), or Transformer. Typically, these methods rely on multiple inputs such as the current music and a brief history of past dance movements to predict the future sequence of human poses. Recently, Gong et. al. propose an interesting task of generating dance by simultaneously utilizing both music and text instruction. A music-text feature fusion module was designed to fuse the inputs into a motion decoder to generate dance conditioned on both music and text. However, although these methods have the potential to produce natural and realistic dancing motion, they are often unable to create synchronized and harmonious movements between multiple dancers. Ensuring coordination and synchronization between dancers is a complicated problem, as it involves not only individual pose predictions but also the seamless integration of these poses within the context of a group. Achieving synchronized and harmonious group movements requires considering spatial and temporal relationships among dancers, their interactions, and the overall choreographic structure. Thus, further advances in the field are considered to address these challenges, including works that use deep learning approaches such as Variational Autoencoder (VAE), GAN, and Normalising Flow.

Unfortunately, most of these networks are limited by their ability to model long-term dance sequences ce may freeze or drift towards the end of the music. To facilitate long-term generation, many researchers apply a motion repeat constraint to predict future frames by attending to the historical motions. Nevertheless, this would limit the flexibility of the model by forcing it to always look into the past.

Group Choreography Group choreography and its related problem, multi-person motion prediction, have been an active research area with numerous studies addressing the challenges of predicting the behaviors of multiple individuals. Alahi et.al. utilize a Markov chain model to jointly analyze the trajectories of several pedestrians and predict their destinations in a given scene. Adel et. al. integrate social interactions and the visual context of the environment to forecast the future motion of multiple individuals. Multi-Range Transformers has proposed to predict the movements of groups with more than ten people engaging in social interactions. These aforementioned methods leverage various techniques to capture social interaction, spatial dependencies, and temporal dynamics, generally aiming to predict accurate and socially plausible future motions for multiple individuals in different scenarios. However, despite the notable advancements achieved, there remains a demand for further investigation of the correlation between the consistency and diversity of motions within group context. A deeper understanding of how to attain the optimal balance between consistency and diversity holds the potential to unlock new possibilities for creating group choreographies that benefit the users in many circumstances.

Diffusion for Music-driven Choreography.Recently, diffusion-based approaches have shown remarkable results on several generative tasks ranging from image generation, audio synthesis, pose estimation, natural language generation, and motion synthesis, to point cloud generation, 3D object synthesis, and scene creation. Diffusion models have shown that they can achieve high mode coverage, unlike GANs, while still maintaining high sample quality. Most existing diffusion-based approaches for human motion/dance synthesis only focus on generating motion sequences for a single character, conditioned on information such as text, audio, or both audio and text. Different from these prior works, we aim to create group of dancing motions from music, which includes coordinating multiple characters, avoiding collisions, and maintaining coherence between them. In addition to the vanilla diffusion loss term used for training in previous works, our method employs a contrastive learning strategy that directly influences the training of the diffusion reverse process, enhancing the association within the dance group.

A prominent issue of the diffusion approach for motion synthesis is that although it is highly effective in generating diverse samples, injecting a large amount of noise during the sampling process can lead to inconsistent results. This issue is particularly problematic for group dance paradigm. Therefore, our desideratum is to design a group dance generation model with the ability to address both diversity and consistency problem.

3. Preliminaries

3.1 Background

Given an input music sequence {a1,at,...,aT}\{a_1, a_t, ...,a_T\} with t={1,...,T}t = \{1,..., T\} indicates the index of the music frames, our goal is to generate the group motion sequences of NN dancers: {x11,...,xT1;...;x1N,...,xTN}\{x^1_1,..., x^1_T; ...;x^N_1,...,x^N_T\} where xtix^i_t is the pose of ii-th dancer at frame tt. We represent dances as sequences of poses in the 24-joint of the SMPL model, using the 6D continuous rotation for every joint, along with a single 3D root translation. This rotation representation ensures the uniqueness and continuity of the rotation vector, which is more beneficial to the training of deep neural networks. We tackle the group dance generation task by using a diffusion-based framework to synthesize the motions from a random noise distribution, given the music conditioning. Thanks to the sampling process of the diffusion model, we can effectively control the consistency and diversity in the generated sequences.

3.2 Forward Process of Diffusion Model.

Given an original sample from the real data distribution x0q(x0){x_0} \sim q({x_0}), following \cite{ho2020ddpm}, the forward diffusion process is defined as a Markov process that gradually adds Gaussian noise to the data under a pre-defined noise schedule up to MM steps.

q(xmxm1)=N(xm;1βmxm1,βmI),m{1,2,...,M}q(x_m | x_{m-1}) = \mathcal{N}(x_m; \sqrt{1-\beta_m} {x_{m-1}}, \beta_mI), \forall m \in \{1,2, ... ,M\}

If the noise variance schedule βm\beta_m is small and the number of diffusion step MM is large enough, the distribution q(xM)q(x_M) at the end of the process is well-approximated by a standard normal distribution N(0,I)\mathcal{N}(0,I), which is easy to sample from. Thanks to the nice property of the forward diffusion, we can directly obtain the noised sample at any arbitrary step mm without traversing through the whole chain:

q(xmx0)=N(xm;αˉmx0,(1αˉm)I),xm=αˉmx0+1αˉmϵ,ϵN(0,I)q(x_m | x_{0}) = \mathcal{N}(x_m; \sqrt{\bar{\alpha}_m} x_{0} , (1-\bar{\alpha}_m)I), x_m = \sqrt{\bar{\alpha}_m} x_{0} + \sqrt{1-\bar{\alpha}_m} \epsilon, \epsilon \sim \mathcal{N}(0, I)

where αm=1βm\alpha_m = 1-\beta_m and αˉm=s=0mαs\bar{\alpha}_m = \prod_{s=0}^m \alpha_s.

3.3 Reverse Process.

By additionally conditioning on x0x_0, the posterior of the reverse process is tractable and becomes a Gaussian distribution:

q(xm1xm,x0)=N(xm1;μ~m,β~mI),q(x_{m-1} | x_m, x_0) = \mathcal{N} (x_{m-1}; \tilde{\mu}_m, \tilde{\beta}_mI),

where μ~m\tilde{\mu}_m and β~m\tilde{\beta}_m are the posterior mean and variance that depend on both xmx_m and x0x_0, respectively. To obtain a sample from the original data distribution, we start by sampling from the noise distribution q(xM)q(x_M) and then gradually remove the noise until we reach x0x_0, following the reverse process. Therefore, our goal is to train a neural network to approximate the posterior q(xm1xm)q(x_{m-1} | x_m) of the reverse process as:

pθ(xm1xm)=N(xm1;μθ(xm,m),Σθ(xm,m))p_{\theta}(x_{m-1} | x_m) = \mathcal{N}(x_{m-1}; \mu_{\theta}(x_m, m), \Sigma_{\theta}(x_m,m) )

We model only the mean μθ(xm,m)\mu_{\theta}(x_m, m) of the reverse distribution while keeping the variance Σθ(xm,m)\Sigma_{\theta}(x_m,m) fixed according to the noise schedule. However, instead of predicting the noise ϵm\epsilon_m at any arbitrary step mm as in their approach, we train the network to learn to predict the original noiseless signal x0x_0. The sample at the previous step m1m-1 can be obtained by noising back the predicted x0x_0. For conditional generation setting, the network is additionally conditioned with the conditioning signal cc as x0Gθ(xm,m,c)x_0 \approx \mathcal{G}_\theta(x_m,m,c) with model parameters θ\theta.

Next

In the next post, we will mention our proposal in details.

Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 4)

In previous post, we have mentioned multigraph parsing proccess, how to train a multigraph under decentralized federated learning, and experimental setups for Multigraph. In the post, we will mention the effectiveness and efficiency of multigraph topology design under different configurations.

Our code can be found at: https://github.com/aioz-ai/MultigraphFL

1. Cycle Time Comparison

Table 1 shows the cycle time of our method in comparison with other recent approaches. This table illustrates that our proposed method significantly reduces the cycle time in all setups with different networks and datasets. In particular, compared to the state-of-the-art RING, our method reduces the cycle time by 2.182.18, 1.51.5, 1.741.74 times in average in the FEMNIST, iNaturalist, Sentiment140 dataset, respectively. Our method also clearly outperforms MACHA, MACHA(+), and MST by a large margin. The results confirm that our multigraph with isolated nodes helps reduce the cycle time in federated learning.

From Table1, our multigraph achieves the minimum improvement under the Amazon network in all three datasets. This can be explained that, under the Amazon network, our proposed topology does not generate many isolated nodes. Hence, the improvement is limited. Intuitively, when there are no isolated nodes, our multigraph will become the overlay, and the cycle time of our multigraph will be equal to the cycle time of the overlay in RING.

Tab-1

2. Isolated Node Analysis

Isolated Nodes vs. Network Configuration. The numbers of isolated nodes vary based on the network configuration (Amazon, Gaia, Exodus, etc.). The parameter tt (maximum number of edges between two nodes), and the delay time which is identified by many factors (geometry distance, model size, computational cost based on tasks, bandwidth, etc also affect the process of generating isolated nodes. Table 2 illustrates the effectiveness of isolated nodes in different network configurations. Specifically, we conduct experiments on the FEMNIST dataset using five network configurations (Gaia, Amazon, Geant, Exodus, Ebone). We can see that our cycle time compared with RING is reduced significantly when more communication rounds or graph states have isolated nodes. Tab-2

Table 2: The effectiveness of isolated nodes under different network configurations. All experiments are trained with 6400 communication rounds on FEMNIST dataset.We then record the number of states and rounds that have the appearance of isolated nodes and compare our cycle time with RING.

Isolated Nodes vs. RING vs. Random Strategy. Isolated nodes play an important role in our method as we can skip the model aggregation step in the isolated nodes. In practice, we can have a trivial solution to create isolated nodes by randomly removing some nodes from the overlay of RING. Table 3 shows the experiment results in two scenarios on FEMNIST dataset and Exodus Network: i} Randomly remove some silos in the overlay of RING, and ii} Remove the most inefficient silos (i.e., silos with the longest delay) in the overlay of RING. From Table 3, the cycle time reduces significantly when two aforementioned scenarios are applied. However, the accuracy of the model also drops significantly. This experiment shows that although randomly removing some nodes from the overlay of RING is a trivial solution, it can not maintain model accuracy. On the other hand, our multigraph not only reduces the cycle time of the model but also preserves the accuracy. This is because our multigraph can skip the aggregation step of the isolated nodes in a communication round. However, in the next round, the delay time of these isolated nodes will be updated, and they can become normal nodes and contribute to the final model.

Tab-3

Table 3: The cycle time and accuracy of our multigraph vs. RING with different criteria.

Isolated Nodes Illustration. Figure belows shows a detailed illustration of our algorithm with the isolated nodes in a real-world training scenario. The experiment is conducted on Gaia network geometry and their corresponding hardware for supporting link latency computation. The image classification task is chosen for this benchmarking by using FEMNIST dataset and CNN backbone provided by Marfod \etal. Hence, we keep the model transmitted size at 4.624.62 Mb, all access links have 1010 Gbps traffic capacity, the number of local updates is set to 11, and the maximum number of edges tt is set to 33. As shown in this Figure, although there are no isolated nodes in the initialized state, the number of isolated nodes increases in the next consequence states with a vast number (4 nodes). This circumstance leads to a 4\sim 4 times reduction in cycle time compared to the initialized state. The appearance of isolated nodes also greatly reduces the connection between silos by 3.6\sim 3.6 times, from 1111 down to 33 connections, and also discarded ones all have high latency.

Fig-1

3. Multigraph Ablation Study

Accuracy Analysis. In federated learning, improving the model accuracy is not the main focus of topology designing methods. However, preserving the accuracy is also important to ensure model convergence. Table 4 shows the accuracy of different topologies after 6,4006,400 communication training rounds on the FEMNIST dataset. This table illustrates that our proposed method achieves competitive accuracy with other topology designs. This confirms that our topology can maintain the accuracy of the model, while significantly reducing the training time.

Tab-4

Table 4: Accuracy comparison between different topologies. The experiment is conducted using the FEMNIST dataset. The accuracy is reported after 6,4006,400 communication rounds in all methods.

Convergence Analysis. Figure belows shows the training loss versus the number of communication rounds and the wall-clock time under Exodus network using the FEMNIST dataset. This figure illustrates that our proposed topology converges faster than other methods while maintaining the model accuracy. We observe the same results in other datasets and network setups.

Cycle Time and Accuracy Trade-off. In our method, the maximum number of edges between two nodes tt mainly affects the number of isolated nodes. This leads to a trade-off between the model accuracy and cycle time. Table 5 illustrates the effectiveness of this parameter. When t=1t = 1, we technically consider there are no weak connections and isolated nodes. Therefore, our method uses the original overlay from RING. When tt is set higher, we can increase the number of isolated nodes, hence decreasing the cycle time. In practice, too many isolated nodes will limit the model weights to be exchanged between silos. Therefore, models at isolated nodes are biased to their local data and consequently affect the final accuracy.

Tab-5

Table 5: Cycle time and accuracy trade-off with different value of tt, i.e., the maximum number of edges between two nodes.

4. Conclusion

We proposed a new multigraph topology for cross-silo federated learning. Our method first constructs the multigraph using the overlay. Different graph states are then parsed from the multigraph and used in each communication round. Our method significantly reduces the cycle time by allowing the isolated nodes in the multigraph to do model aggregation without waiting for other nodes. The intensive experiments on three datasets show that our proposed topology achieves new state-of-the-art results in all network and dataset setups.

Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 3)

In previous part, we have investigated that how delay time and cycle time is affected by the modification of multigraph in the design of the topology. Also, we will explore how multigraph can be constructed. In this post, we will mention multigraph parsing proccess, how to train a multigraph under decentralized federated learning, and experimental setups for Multigraph.

Our code can be found at: https://github.com/aioz-ai/MultigraphFL

1. Multigraph Parsing

In Algorithm~\ref{alg:state_form}, we parse multigraph Gm\mathcal{G}_m into multiple graph states Gms\mathcal{G}_m^s. Graph states are essential to identify the connection status of silos in a specific communication round to perform model aggregation. In each graph state, our goal is to identify the isolated nodes. During the training, isolated nodes update their weights internally and ignore all weakly-connected edges that connect to them.

To parse the multigraph into graph states, we first identify the maximum of states in a multigraph smaxs_{\max} by using the least common multiple (LCM). We then parse the multigraph into smaxs_{\max} states. The first state is always the overlay since we want to make sure all silos have a reliable topology at the beginning to ease the training. The reminding states are parsed so there is only one connection between two nodes. Using our algorithm, some states will contain isolated nodes. During the training process, only one graph state is used in a communication round. Figure below illustrates the training process in each communication round using multiple graph states.

2. Multigraph Training

In each communication round, a state graph Gms\mathcal{G}_m^s is selected in a sequence that identifies the topology design used for training. We then collect all strongly-connected edges in the graph state Gms\mathcal{G}_m^s in such a way that nodes with strongly-connected edges need to wait for neighbors, while the isolated ones can update their models. We train our multigraph with DPASGD algorithm:

wi(k+1)={jNi++{i}Ai,jwj(kh),k0&Ni++>1,wi(k)αk1bh=1bLi(wi(k),ξi(h)(k)),otherwise.w_{i}\left(k + 1\right) = \begin{cases} \sum_{j \in \mathcal{N}_{i}^{++} \cup{\{i\}}}A_{i,j}w_{j}\left(k - h\right), \forall k \equiv 0 \& \left|\mathcal{N}_{i}^{++}\right| > 1 ,\\ w_{i}\left(k\right)-\alpha_{k}\frac{1}{b}\sum^b_{h=1}\nabla L_i\left(w_{i}\left(k\right),\xi_i^{\left(h\right)}\left(k\right)\right), otherwise. \end{cases}

where (kh)(k- h) is the index of the considered weights; hh is initialized to 00 and h=h+1ekh(i,j)=0h = h + 1 \forall e_{k-h}(i,j) = 0. Through Equation above, at each state, if a silo is not an isolated node, it must wait for the model from its neighbor to update its weight. If a silo is an isolated node, it can use the model in its neighbor from the (kh)(k-h) round to update its weight immediately. The training procedure is described as below:

3. Algorithm Complexity

It is trivial to see that the complexity of training procedure is O(n2)\mathcal{O}(n^2). In practice, since the cross-silo federated learning setting has only a few hundred silos (n<500n<500), the time to execute our algorithms is just a tiny fraction of training time. Therefore, our proposed topology still can significantly reduce the overall wall-clock training time.

4. Experimental Setups

Datasets. We use three datasets in our experiments: Sentiment140, iNaturalist, and FEMNIST. All datasets and the pre-processing process are conducted by following recent works. Table below shows the dataset setups in detail.

Network. We consider five distributed networks in our experiments: Exodus, Ebone, Géant, Amazon and Gaia. The Exodus, Ebone, and Géant are from the Internet Topology Zoo. The Amazon and Gaia network are synthetic and are constructed using the geographical locations of the data centers.

Baselines. We compare our multigraph topology with recent state-of-the-art topology designs for federated learning: STAR, MATCHA, MATCHA(+), MST, and RING.

Hardware Setup. Since measuring the cycle time is crucial to compare the effectiveness of all topologies in practice, we test and report the cycle time of all baselines and our method on the same NVIDIA Tesla P100 16Gb GPUs. No overclocking is used.

Time Simulator. We adapted PyTorch with the MPI backend to run DPASGD and DPASGD++ on a GPU cluster. We take advantage of the network simulator, the Time Simulator, which uses an arbitrary topology and computation times of silos as input to calculate the time instants at which local models are computed. The wall-clock time is reconstructed by this time simulator needs thorough understanding of the topology, including all factors mentioned in Delay Equations in each network configuration. The related configuration information is already provided in GAIA Network, and the simulator is created by Marfod \etal.

Next

In the next post, we will mention the effectiveness and efficiency of multigraph topology design under different configurations.

Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 1)

Federated learning is an active research topic since it enables several participants to jointly train a model without sharing local data. Currently, cross-silo federated learning is a popular training setting that utilizes a few hundred reliable data silos with high-speed access links to training a model. While this approach has been widely applied in real-world scenarios, designing a robust topology to reduce the training time remains an open problem. In this paper, we present a new multigraph topology for cross-silo federated learning. We first construct the multigraph using the overlay graph. We then parse this multigraph into different simple graphs with isolated nodes. The existence of isolated nodes allows us to perform model aggregation without waiting for other nodes, hence effectively reducing the training time. Intensive experiments on three public datasets show that our proposed method significantly reduces the training time compared with recent state-of-the-art topologies while maintaining the accuracy of the learned model.

Our code can be found at: https://github.com/aioz-ai/MultigraphFL

1. Introduction

Federated learning involves training models using remote devices or isolated data centers while keeping the data localized to respect user privacy policies. According to available literature, there are two prominent training scenarios: the "cross-device" scenario, which includes numerous unreliable edge devices with limited computational capacity and slow connection speeds, and the "cross-silo" scenario, which features a smaller number of reliable data silos with powerful computing resources and high-speed access links. Recently, the cross-silo scenario has gained traction in various federated learning applications.

In practical terms, federated learning represents a promising research avenue that allows us to harness the capabilities of machine learning techniques while upholding user privacy. Key obstacles in federated learning encompass issues like model convergence, communication bottlenecks, and disparities in data distributions across different silos. A commonly employed federated training approach involves establishing a central node responsible for overseeing the training process and aggregating contributions from all clients. However, a drawback of this client-server approach is the potential for communication bottlenecks, especially when dealing with a large number of clients. To mitigate this limitation, recent research has explored the concept of decentralized or peer-to-peer federated learning, where communication occurs via a peer-to-peer network topology, eliminating the need for a central node. Nevertheless, a major challenge in decentralized federated learning remains achieving rapid training while ensuring model convergence and preserving model accuracy.

In federated learning, the structure of communication networks holds significant importance. Specifically, an efficient network design contributes to quicker convergence, resulting in reduced training duration and energy consumption, as measured by worst-case convergence bounds within the topology's framework. Additionally, the topology's design has direct implications for other training-related challenges, including network congestion, overall model accuracy, and energy efficiency. The development of a resilient network structure capable of minimizing training time while preserving model accuracy remains an ongoing challenge in federated learning. Our paper is dedicated to devising a novel network design tailored for cross-silo federated learning, a prevalent scenario in practical applications.

Figure 1. We conducted a comparative analysis of various network structures using the FEMNIST dataset and the Exodus network. After completing 6,400 communication rounds, we measured and reported both the accuracy and the total wall-clock training time (or overhead time). Notably, our approach resulted in a substantial reduction in training duration while upholding model accuracy.

Lately, various network configurations have emerged for cross-silo federated learning. For instance, the STAR topology involves an orchestrator averaging all models during each communication round. Another approach, known as MATCHA, divides potential communications into pairs of clients, with random selection for model transmission in each round. Additionally, the RING topology employs max-plus linear systems. Despite progress in this field, challenges persist, including access link congestion, straggler effects, and the establishment of diverse topologies across communication rounds.

In this paper, we introduce a novel multigraph topology inspired by recent advancements in federated learning. Our aim is to enhance the efficiency of cross-silo federated learning. Our approach involves constructing a multigraph based on the overlay of existing network topologies. Subsequently, we decompose this multigraph into simpler graphs, each featuring only a single edge connecting two nodes. These individual graphs are referred to as "states" within the multigraph. Importantly, each state can involve isolated nodes that perform model aggregation independently, reducing the cycle time in each communication round significantly. Our intensive experiments demonstrate that our proposed topology outperforms existing state-of-the-art methods by a wide margin in terms of training time for cross-silo federated learning, as illustrated in Figure.1.

2. Overview

Federated Learning is recognized for its capacity to safeguard data privacy. In its modern incarnation, federated learning adopts a centralized network design, where a central node collects gradients from client nodes to update a global model. Early contributions in federated learning research include pioneering work and seminal papers by various researchers. Subsequent extensions and developments in federated learning and related distributed optimization algorithms have been proposed. Federated Averaging (FedAvg), initially introduced by one group, has inspired variations and other recent state-of-the-art model aggregation techniques, addressing convergence and the non-IID (non-identically and independently distributed) data challenge. Despite its simplicity, the client-server approach faces communication and computational bottlenecks at the central node, particularly when dealing with a large number of clients.

Decentralized Federated Learning flips the traditional federated learning model, enabling direct interactions between siloed data nodes, eliminating the necessity for a central coordinating node. While this approach mitigates communication congestion at a central point, optimizing a fully peer-to-peer network presents substantial challenges. The decentralized periodic averaging stochastic gradient descent method has demonstrated convergence rates comparable to centralized algorithms, making large-scale model training feasible. Furthermore, previous research has conducted systematic analyses of decentralized federated learning. A recent advancement involves leveraging a knowledge distillation mechanism to facilitate collaboration among silos in decentralized federated scenarios while preserving privacy among neighboring nodes.

Communication Topology plays a fundamental role in influencing the complexity and convergence behavior of federated learning. Numerous efforts have been dedicated to improving the efficiency of communication topologies, including star-shaped topologies and optimized-shaped topologies. In particular, a spanning tree topology has been introduced to reduce training time.

The STAR topology is designed for orchestrating the averaging of model updates in each communication round. Meanwhile, the MATCHA approach focuses on accelerating the training process through decomposition sampling. Recognizing the impact of straggler effects on communication round duration, methods for selecting the degree of a regular topology have been explored.

The RING topology is tailored for cross-silo federated learning and leverages the principles of max-plus linear systems. A sample-induced topology has been introduced, capable of effectively recovering the performance of existing SGD-based algorithms and their corresponding convergence rates. In a recent comprehensive survey, various models, frameworks, and algorithms related to network topologies in federated learning have been explored.

Multigraph is a concept that originates from traditional mathematics. In conventional terms, a "graph" typically denotes a simple graph without loops or multiple edges between two nodes. In contrast, a multigraph allows for the presence of multiple edges between two nodes. In the realm of deep learning, multigraphs have found utility across various domains, including clustering, medical image processing, traffic flow prediction, activity recognition, recommendation systems, and cross-domain adaptation. In this research, we employ a multigraph construction to facilitate isolated nodes and expedite training in cross-silo federated learning.

3. Preliminaries

3.1 Federated Learning

In federated learning, silos do not share their local data, but still periodically transmit model updates between them. Given NN siloed data centers, the objective function for federated learning is:

minwRdi=1NpiEξi[Li(w,ξi)],\min_{\textbf{w} \in \mathbb R^d} \sum^{N}_{i=1}p_i E_{\xi_i}\left[ L_{i}\left(\textbf{w}, \xi_i\right)\right],

where Li(w,ξi)L_{i}(\textbf{w}, \xi_i) is the loss of model parameterized by the weight wRd\textbf{w} \in \mathbb R^d, ξi\xi_i is an input sample drawn from data at silo ii, and the coefficient pi>0p_i>0 specifies the relative importance of each silo. Recently, different distributed algorithms have been proposed to optimize the equation. In this work, DPASGD is used to update the weight of silo ii in each training round as follows:

wi(k+1)={jNi+{i}Ai,jwj(k),if k0(mod u+1),wi(k)αk1bh=1bLi(wi(k),ξi(h)(k)),otherwise.\textbf{w}_{i}\left(k + 1\right) = \\ \begin{cases} \sum_{j \in \mathcal{N}_i^{+} \cup{\{i\}}}\textbf{A}_{i,j}\textbf{w}_{j}\left(k\right), \\\qquad\qquad\qquad\qquad\qquad \text{if k} \equiv 0 \left(\text{mod }u + 1\right),\\ \textbf{w}_{i}\left(k\right)-\alpha_{k}\frac{1}{b}\sum^b_{h=1}\nabla L_i\left(\textbf{w}_{i}\left(k\right),\xi_i^{\left(h\right)}\left(k\right)\right), \\\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\text{otherwise.} \end{cases}

where bb is the batch size, i,ji,j denote the silo, uu is the number of local updates, αk>0\alpha_k > 0 is a potentially varying learning rate at kk-th round, ARN×N\textbf{A} \in R^{N \times N} is a consensus matrix with non-negative weights, and Ni+\mathcal{N}_i^{+} is the in-neighbors set that silo ii has the connection to.

3.2 Multigraph for Federated Learning

Connectivity and Overlay. We consider the \textit{connectivity} Gc=(V,Ec)\mathcal{G}_c = (\mathcal{V}, \mathcal{E}_c) as a graph that captures possible direct communications among silos. Based on its definition, the connectivity is often a fully connected graph and is also a directed graph. % whenever the upload and download are set during learning. The \textit{overlay} Go\mathcal{G}_o is a connected subgraph of the connectivity graph, i.e., Go=(V,Eo)\mathcal{G}_o = (\mathcal{V}, \mathcal{E}_o), where EoEc\mathcal E_o \subset \mathcal E_c. Only nodes directly connected in the overlay graph Go\mathcal{G}_o will exchange the messages during training.

Multigraph. While the connectivity and overlay graph can represent different topologies for federated learning, one of their drawbacks is that there is only one connection between two nodes. In our work, we construct a \textit{multigraph} Gm=(V,Em)\mathcal{G}_m = (\mathcal{V}, \mathcal{E}_m) from the overlay Go\mathcal{G}_o. The multigraph can contain multiple edges between two nodes. In practice, we parse this multigraph to different graph states, each state is a simple graph with only one edge between two nodes.

In the multigraph Gm\mathcal{G}_m, the connection edge between two nodes has two types: \textit{strongly-connected} edge and \textit{weakly-connected} edge. Under both strong and weak connections, the participated nodes can transmit their trained models to their out-neighbours Ni\mathcal{N}_i^{-} or download models from their in-neighbours Ni+\mathcal{N}_i^{+}. However, in a strongly-connected edge, two nodes in the graph must wait until all upload and download processes between them are finished to do model aggregation. On the other hand, in a weakly-connected edge, the model aggregation process in each node can be established whenever the previous training process is finished by leveraging up-to-date models which have not been used before from the in-neighbours of that node.

State of Multigraph Given a multigraph Gm\mathcal{G}_m, we can parse this multigraph into different simple graphs with only one connection between two nodes (either strongly-connected or weakly-connected). We denote each simple graph as a state Gms\mathcal{G}_m^s of the multigraph.

Isolated Node. A node is called isolated when all of its connections to other nodes are weakly-connected edges. Figure.2 shows the graph concepts and isolated nodes.

Figure 2. Example of connectivity, overlay, multigraph, and a state of our multigraph. Blue node is an isolated node. Dotted line denotes a weakly-connected edge.

Next

In the next post, we will mention the delay time and cylce time also how multigraph can be constructed.

Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 2)

In previous paart, we how explore decentralized federated learning, how and why multigraph is proposed to improve training process. In this part, we will investigate that how delay time and cycle time is affected by the modification of multigraph in the design of the topology. Also, we will explore how multigraph can be constructed.

Our code can be found at: https://github.com/aioz-ai/MultigraphFL

1. Delay time in multigraph

A delay to an edge e(i,j)e(i, j) is the time interval when node jj receives the weight sending by node ii, which can be defined by:

d(i,j)=u×Tc(i)+l(i,j)+MO(i,j),d(i,j) = u \times T_c(i) + l(i,j) + \frac{M}{O(i,j)},

where Tc(i)T_{c}(i) denotes the time to compute one local update of the model; uu is the number of local updates; l(i,j)l(i,j) is the link latency; MM is the model size; O(i,j)O(i, j) is the total network traffic capacity. However, unlike other communication infrastructures, the multigraph only contains connections between silos without other nodes such as routers or amplifiers. Thus, the total network traffic capacity O(i,j)=min(CUP(i)Ni,CDN(j)Ni+)O(i,j) = \text{min}\left(\frac{C_{\rm{UP}}(i)}{\left|{\mathcal{N}_{i}^{-}}\right|}, \frac{C_{\rm{DN}}(j)}{\left|\mathcal{N}_{i}^{+}\right|}\right) where CUPC_{\rm{UP}} and CDNC_{\rm{DN}} denote the upload and download link capacity. Note that the upload and download processes can happen in parallel.

Since multigraph can contain multiple edges between two nodes, we extend the definition of the delay in the previous equation to dk(i,j)d_k(i,j), with kk is the kk-th communication round during the training process, as:

dk+1(i,j)={dk(i,j),if ek+1(i,j)=1 and ek(i,j)=1max(u×Tc(j),dk(i,j)dk1(i,j)),if ek+1(i,j)=1 and ek(i,j)=0τk(Gm)+dk1(i,j)),if ek+1(i,j)=0 and ek(i,j)=1τk(Gm),otherwised_{k+1}(i,j) = \begin{cases} d_k(i,j), \\\qquad \qquad \text{if } e_{k+1}(i,j) = 1\text{ and }e_{k}(i,j) = 1\\ \text{max}( u \times T_c(j),d_{k}(i,j) - d_{k-1}(i,j)), \\\qquad\qquad\text{if }e_{k+1}(i,j) = 1\text{ and }e_{k}(i,j) = 0\\ \tau_k(\mathcal{G}_m) + d_{k-1}(i,j)), \\\qquad\qquad\text{if } e_{k+1}(i,j) = 0\text{ and }e_{k}(i,j) = 1\\ \tau_k(\mathcal{G}_m), \\\qquad\qquad\text{otherwise} \end{cases}

where e(i,j)e(i,j)==0\mathbb{0} is weakly-connected edge, e(i,j)e(i,j)==1\mathbb{1} is strongly-connected edge; $ \tau_k(\mathcal{G}_m)$ is the cycle time at the kk-th computation round during the training process.

In general, using introduced equation, \textit{the delay of the next communication round dk+1d_{k+1} is updated based on the delay of the previous rounds} and other factors, depending on the edge type connection.

2. Cycle time in multigraph

The cycle time per round is the time required to complete a communication round. In this work, we define the cycle time per round is the maximum delay between all silo pairs with strongly-connected edges. Therefore, the average cycle time of the entire training is:

τ(Gm)=1kk=0k1(maxjNi++{i},iV(dk(j,i))),\tau(\mathcal{G}_m) =\frac{1}{k }\sum^{k-1}_{k=0} \left(\underset{j \in \mathcal{N}^{++}_{i} \cup\{i\}, \forall i \in \mathcal{V}}{\text{max}} \left(d_k\left(j,i\right)\right)\right),

where Ni++\mathcal{N}_{i}^{++} is an in-neighbors silo set of ii whose edges are strongly-connected.

3. Multigraph Construction

Algorithm 1 describes our methods to generate the multigraph Gm\mathcal{G}_m with multiple edges between silos. The algorithm takes the overlay Go\mathcal{G}_o as input. Similar to~\cite{marfoq2020throughput}, we use the Christofides algorithm to obtain the overlay. In Algorithm 1, we establish multiple edges that indicate different statuses (strongly-connected or weakly-connected). To identify the total edges between a silo pair, we divide the delay d(i,j)d(i,j) by the smallest delay dmind_{\min} overall silo pairs, and compare it with the maximum number of edges parameter tt (t=5t=5 in our experiments). \textit{We assume that the silo pairs with longer delay will have more weakly-connected edges, hence potentially becoming isolated nodes}. Overall, we aim to increase the number of weakly-connected edges, which generate more isolated nodes to speed up the training process. Note that, from Algorithm 1, each silo pair in the multigraph should have one strongly-connected edge and multiple weakly-connected edges. The role of the strongly-connected edge is to make sure that two silos have a good connection in at least one communication round.

Next

In the next post, we will mention multigraph parsing proccess and how to train a multigraph under decentralized federated learning.

Deep Federated Learning for Autonomous Driving (Part 2)

In previous part, we have discussed about Autonomous driving FADNetwork. In this post, we will verify the effectiveness and efficiency of it.

Our source code can be found at: https://github.com/aioz-ai/FADNet

1. Experimental Setup

Udacity. We use the popular Udacity dataset to evaluate our results. We only use front-forwarded images in this dataset in our experiment. We use 55 sequences for training and 11 for testing. The training sequences are assigned randomly to different silos depending on the federated topology (i.e., Gaia or NWS).

Carla. Since the Udacity dataset is collected in the real-world environment, changing the weather or lighting conditions is not easy. To this end, we collect more simulated data in the Carla simulator. We have applied different lighting (morning, noon, night, sunrise, sunset) and weather conditions (cloudy, rain, heavy rain, wet streets, windy, snowy) when collecting the data. We have generated 73,23573,235 samples distributed over 1111 sequences of scenes.

Gazebo. Since both the Udacity and Carla datasets are collected in outdoor environments, we also employ Gazebo to collect data for autonomous navigation in indoor scenes. We use a simulated mobile robot and the built-in scenes to collect data. Table.1 shows the statistics of three datasets. We use 80%80\% of the collected data in Gazebo and Carla data for training, and the rest 20%20\% for testing.

Figure 1. Visualization of sample images in three datasets: Udacity (first row), Gazebo (second row), and Carla (third row).

DatasetTotal samplesAverage samples in each silo (Gaia)Average samples in each silo (NWS)
Udacity39,0873,5531,777
Gazebo66,8066,0733,037
Carla73,2356,6583,329

Table 1. The Statistic of Datasets in Our Experiments.

Network Topology. We conduct experiments on two topologies: the Internet Topology Zoo (Gaia), and the North America data centers (NWS). We use Gaia topology in our main experiment and provide the comparison of two topologies in our ablation study.

Training. The model in a silo is trained with a batch size of 3232 and a learning rate of 0.0010.001 using Adam optimizer. We follow the training process to obtain a global weight of all silos. The training process is conducted with 30003000 communication rounds and each silo has one NVIDIA 1080 11 GB GPU for training. Note that, one communication round is counted each time all silos have finished updating their model weights.

Baselines. We compare our results with various recent methods, including Random baseline and Constant baseline, Inception-V3, MobileNet-V2, VGG-16, and Dronet. All these methods use the Centralized Local Learning (CLL) strategy (i.e., the data are collected and trained in one local machine.) For distributed learning, we compare our Deep Federated Learning (DFL) approach with the Server-based Federated Learning (SFL) strategy. As the standard practice, we use the root-mean-square error (RMSE) metric to evaluate the results.

2. Results

Table 2 summarises the performance of our method and recent state-of-the-art approaches. We notice that our FADNet is trained using the proposed peer-to-peer DFL using the Gaia topology with 11 silos. This table clearly shows our FDANet + DFL outperforms other methods by a fair margin. In particular, our FDANet + DFL significantly reduces the RMSE in Gazebo and Carla datasets, while slightly outperforms DroNet in the Udacity dataset. These results validate the robustness of our FADNet while is being trained in a fully decentralized setting. Table 3 also shows that with a proper deep architecture such as our FADNet, we can achieve state-of-the-art accuracy when training the deep model in FL. Fig. 2 illustrates the spatial support regions when our FADNet making the prediction. Particularly, we can see that FADNet focuses on the ``line-like" patterns in the input frame, which guides the driving direction.

ArchitectureLearning MethodUdacityGazeboCarla#Params
Random-0.3010.1170.464-
Constant-0.2090.0920.348-
InceptionCLL0.1540.0850.29721,787,617
MobileNetCLL0.1420.0830.2862,225,153
VGG-16CLL0.1210.0830.3167,501,587
DroNetCLL0.1100.0820.333314,657
FADNet (ours)DFL0.1070.0690.203317,729

Table 2. Performance comparison of different architectures on the Udacity, Gazebo, and Carla datasets. The number of parameters (#Params) is also provided.

Figure 2. Spatial support regions for predicting steering angle in three datasets. In most cases, we can observe that our FADNet focuses on ``line-like” patterns to predict the driving direction.

3. Ablation Studies

Effectiveness of our DFL.

Table 3 summarises the accuracy of DroNet and our FADNet when we train them using different learning methods: CLL, SFL, and our peer-to-peer DFL. From this table, we can see that training both DroNet and FADNet with our peer-to-peer DFL clearly improves the accuracy compared with the SFL approach. This confirms the robustness of our fully decentralized approach and removes a need of a central server when we train a deep network with FL. Compared with the traditional CLL approach, our DFL also shows a competitive performance. However, we note that training a deep architecture using CLL is less complicated than with SFL or DFL. Furthermore, CLL is not a federated learning approach and does not take into account the privacy of the user data.

ArchitectureLearning MethodUdacityGazeboCarla
DroNetCLL0.1100.0820.333
SFL0.1760.0810.297
DFL (ours)0.1520.0730.244
FADNet (ours)CLL0.1420.0810.303
SFL0.1510.0710.211
DFL (ours)0.1070.0690.203

Table 3. Performance comparison of different methods.

Effectiveness of our FADNet.

Table 3 shows that apart from the learning method, the deep architectures also affect the final results. This table illustrates that our FADNet combined with DFL outperforms DroNet in all configurations. We notice that DroNet achieves competitive results when being trained with CLL. However DroNet is not designed for federated training, hence it does not achieve good accuracy when being trained with SFL or DFL. On the other hand, our introduced FADNet is particularly designed with dedicated layers to handle the data imbalance and model convergence problem in federated training. Therefore, FADNet achieves new state-of-the-art results in all three datasets.

Network Topology Analysis.

Table 4 illustrates the performance of DroNet and our FADNet when we train them using DFL under two distributed network topologies: Gaia and NWS. This table shows that the results of DroNet and FADNet under DFL are stable in both Gaia and NWS distributed networks. We note that the NWS topology has 22-silos while the Gaia topology has only 11 silos. This result validates that our FADNet and DFL do not depend on the distributed network topology. Therefore, we can potentially use them in practice with more silo data.

Network TopologyArchitectureUdacityGazeboCarla
Gaia (11 silos)DroNet0.1520.0730.244
FADNet (ours)0.1070.0690.203
NWS (22 silos)DroNet0.1570.0750.239
FADNet (ours)0.1090.0700.200

Table 4. Performance comparison of different network topologies.

Convergence Analysis.

The effectiveness of federated learning algorithms is identified through the convergence ability, including accuracy and training speed, especially when dealing with the increasing number of silos in practice. Fig.3 shows the convergence ability of our FADNet with DFL using two topologies: Gaia with 11 silos, and NWS with 22 silos. This figure shows that our proposed DFL achieves the best results in Gaia and NWS topology and converges faster than the SFL approach in both Gazebo and Carla datasets. We also notice that the performance of our DFL is stable when there is an increase in the number of silos. Specifically, training our FADNet with DFL reaches the converged point after approximately 150150s, 180180s on the NWS and Gaia topology, respectively. Fig.3 validates the convergence ability of our FADNet and DFL, especially when dealing with the increasing number of silos.

In practice, compared with the traditional CLL approach, federated learning methods such as SFL or DFL can leverage more GPUs remotely. Therefore, we can reduce the total training time significantly. However, the drawback of federated learning is we would need more GPUs in total (ideally one for each silo), and deep architecture also should be carefully designed to ensure model convergence.

Figure 3. The convergence ability of our FADNet and DFL under Gaia and NWS topology. Wall-clock time or elapsed real-time is the actual time taken from the start of the whole training process to the end, including the synchronization time of the weight aggregation process. All experiments are conducted with 3,0003,000 communication rounds.

Deployment

To verify the effectiveness of our FADNet in practice, we deploy the model trained on the Gazebo dataset on a mobile robot. The robot is equipped with a RealSense camera to capture the front RGB images. Our FADNet is deployed on a Qualcomm RB5 board to make the prediction of the steering angle for the robot. The processing time of our FADNet on the Qualcomm RB5 board is approximately 1212 frames per second. Overall, we observe that the robot can navigate smoothly in an indoor environment without colliding with obstacles. More qualitative results can be found in our supplementary material.

Conclusion

We propose a new approach to learn an autonomous driving policy from sensory data without violating the user's privacy. We introduce a peer-to-peer deep federated learning (DFL) method that effectively utilizes the user data in a fully distributed manner. Furthermore, we develop a new deep architecture - FADNet that is well suitable for distributed training. The intensive experimental results on three datasets show that our FADNet with DFL outperforms recent state-of-the-art methods by a fair margin. Currently, our deployment experiment is limited to a mobile robot in an indoor environment. In the future, we would like to test our approach with more silos and deploy the trained model using an autonomous car on man-made roads.

Deep Federated Learning for Autonomous Driving (Part 1)

Autonomous driving is an active research topic in both academia and industry. However, most of the existing solutions focus on improving the accuracy by training learnable models with centralized large-scale data. Therefore, these methods do not take into account the user's privacy. In this paper, we present a new approach to learn autonomous driving policy while respecting privacy concerns. We propose a peer-to-peer Deep Federated Learning (DFL) approach to train deep architectures in a fully decentralized manner and remove the need for central orchestration. We design a new Federated Autonomous Driving network (FADNet) that can improve the model stability, ensure convergence, and handle imbalanced data distribution problems while is being trained with federated learning methods. Intensively experimental results on three datasets show that our approach with FADNet and DFL achieves superior accuracy compared with other recent methods. Furthermore, our approach can maintain privacy by not collecting user data to a central server.

Our source code can be found at: https://github.com/aioz-ai/FADNet

1. Introduction

In this paper, our goal is to develop an end-to-end driving policy from sensory data while maintaining the user's privacy by utilizing FL. We address the key challenges in FL to make sure our deep network can achieve competitive performance when being trained in a fully decentralized manner. Fig.1 shows an overview of different learning approaches for autonomous driving. In Centralized Local Learning (CLL), the data are collected and trained in one local machine. Hence, the CLL approach does not take into account the user's privacy. The Server-based Federated Learning (SFL) strategy requires a central server to orchestrate the training process and receive the contributions of all clients. The main limitation of SFL is communication congestion when the number of clients is large. Therefore, we follow the peer-to-peer federated learningto set up the training. Our peer-to-peer Deep Federated Learning (DFL) is fully decentralized and can reduce communication congestion during training. We also propose a new Federated Autonomous Driving network (FADNet) to address the problem of model convergence and imbalanced data distribution. By training our FADNet using DFL, our approach outperforms recent state-of-the-art methods by a fair margin while maintaining user data privacy.

Figure 1. An overview of different learning methods for autonomous driving. (a) Centralized Local Learning, (b) Server-based Federated Learning, and (c) our peer-to-peer Deep Federated Learning. Red arrows denote the aggregation process between silos. Yellow lines with a red cross indicate the non-sharing data between silos.

Our contributions can be summarized as follows:

  • We propose a fully decentralized, peer-to-peer Deep Federated Learning framework for training autonomous driving solutions.
  • We introduce a Federated Autonomous Driving network that is well suitable for federated training.
  • We introduce two new datasets and conduct intensive experiments to validate our results.

2. Problem Formulation

We consider a federated network with NN siloed data centers (e.g., autonomous cars) Di\mathcal{D}_{i}, with i[1,N]i \in [1,N]. Our goal is to collaboratively train a global driving policy θ\theta by aggregating all local learnable weights θi\theta_i of each silo. Note that, unlike the popular centralized local training setup, in FL training, each silo does not share its local data, but periodically transmits model updates to other silos.

In practice, each silo has the training loss Li(ξi,θi)\mathcal{L}_i(\xi_i, \theta_i). ξi\xi_i is the ground-truth in each silo ii. Li(ξi,θi)\mathcal{L}_i(\xi_i, \theta_i) is calculated as the regression loss. This regression loss is modeled by a deep network that takes RGB images as inputs and predicts the associated steering angles.

3. Deep Federated Learning for Autonomous Driving

A popular training method in FL is to set up a central server that orchestrates the training process and receives the contributions of all clients (Server-based Federated Learning - SFL). The limitation of SFL is the server potentially represents a single point of failure in the system. We also may have communication congestion between the server and clients when the number of clients is massive. Therefore, in this work, we utilize the peer-to-peer FL to set up the training scenario. In peer-to-peer FL, there is no centralized orchestration, and the communication is via peer-to-peer topology. However, the main challenge of peer-to-peer FL is to assure model convergence and maintain accuracy in a fully decentralized training setting.

Figure 2. An overview of our peer-to-peer Deep Federated Learning method. (a) A simplified version of an overlay graph. (b) The training methodology in the overlay graph. Note that blue arrows denote the local training process in each silo; red arrows denote the aggregation process between silos controlled by the overlay graph; yellow lines with a red cross indicate the non-sharing data between silos; the arrow indicates that the process is parallel.

Fig.2 illustrates our Deep Federated Learning (DFL) method. Our DFL follows the peer-to-peer FL setup with the goal to integrate a deep architecture into a fully decentralized setting that ensures convergence while achieving competitive results compared to the traditional Centralized Local Learning or SFL approach. In practice, we can consider a silo as an autonomous car. Each silo maintains a local learnable model and does not share its data with other silos. We represent the silos as vertices of a communication graph and the FL is performed on an overlay, which is a sub-graph of this communication graph.

Designing the Overlay

Let Gc=(V,Ec)\mathcal{G}_c = (\mathcal{V}, \mathcal{E}_c) is the connectivity graph that captures the possible direct communications among NN silos. V\mathcal{V} is the set of vertices (silos), while Ec\mathcal{E}_c is the set of communication links between vertices. Ni+\mathcal{N}_i^{+} and Ni\mathcal{N}_i^{-} are in-neighbors and out-neighbors of a silo ii, respectively. As in~\cite{marfoq2020throughput}, we note that it is unnecessary to use all the connections of the connectivity graph for FL. Indeed, a sub-graph called an overlay, Go=(V,Eo)\mathcal{G}_o = (\mathcal{V}, \mathcal{E}_o) can be generated from Gc\mathcal{G}_c. In our work, Go\mathcal{G}_o is the result of Christofides’ Algorithm~\cite{monnot2003approximation}, which yields a strong spanning sub-graph of Gc\mathcal{G}_c with minimal cycle time. One cycle time or time per communication round, in general, is the time that a vertex waits for messages from the other vertices to do a computational update.

In practice, one block cycle time of an overlay Go\mathcal{G}_o depends on the delay of each link (i,j)(i, j), denoted as do(i,j)d_o(i, j), which is the time interval between the beginning of a local computation at node ii, and the receiving of ii's messages by jj. Furthermore, without concerns about access links delays between vertices, our graph is treated as an edge-capacitated network with:

do(i,j)=s×Tc(i)+l(i,j)+MB(i,j)d_o(i,j) = s \times T_c(i) + l(i,j) + \frac{M}{B(i,j)}

where Tc(i)T_c(i) is the time to compute one local update of the model; ss is the number of local computational steps; l(i,j)l(i,j) is the link latency; MM is the model size; B(i,j)B(i,j) is available bandwidth of the path (i,j)(i,j). As in~\cite{marfoq2020throughput}, we set s=1s=1.

Training Algorithm

At each silo ii, the optimization problem to be solved is:

θi=argminθiEξDi[L(ξi,θi)]\theta_i^{*} = \underset{\theta_i}{\arg\min} \underset{\xi \sim \mathcal{D}_i}{\mathbb{E}}[\mathcal{L}(\xi_i, \theta_i)]

We apply the distributed federated learning algorithm, DPASGD, to solve the optimizations of all the silos. In fact, after waiting one cycle time, each silo ii will receive parameters θj\theta_j from its in-neighbor Ni+\mathcal{N}_i^{+} and accumulate these parameters multiplied with a non-negative coefficient from the consensus matrix A\mathbf{A}. It then performs ss mini-batch gradient updates before sending θi\theta_i to its out-neighbors Ni\mathcal{N}_i^{-}, and the algorithm keeps repeating. Formally, at each iteration kk, the updates are described as:

θi(k+1)={jNi+iAi,jθj(k), if k0(mods+1),θi(k)αk1mh=1mL(θi(k),ξi(h)(k)),otherwise.\theta_{i}\left(k + 1\right) = \begin{cases} \sum_{j \in \mathcal{N}_i^{+} \cup{i}}\textbf{A}_{i,j}{\theta}_{j}\left(k\right), \textit{ if k} \equiv 0 \pmod{s + 1},\\ {\theta}_{i}\left(k\right)-\alpha_{k}\frac{1}{m}\sum^m_{h=1}\nabla \mathcal{L}\left({\theta}_{i}\left(k\right),\xi_i^{\left(h\right)}\left(k\right)\right), \text{otherwise.} \end{cases}

where mm is the mini-batch size and αk>0\alpha_k > 0 is a potentially varying learning rate.

Federated Averaging

To compute the prediction of models in all silos, we compute the average model θ\theta using weight aggregation from all the local model θi\theta_i. The federated averaging process is conducted as follow:

θ=1i=0Nλii=0Nλiθi\theta = \frac{1}{\sum^N_{i=0}{\lambda_i}} \sum^N_{i=0}\lambda_{{i}} \theta_{{i}}

where NN is the number of silos; λi={0,1}\lambda_i = \{0,1\}. Note that λi=1\lambda_i = 1 indicates that silo ii joins the inference process and λi=0\lambda_i = 0 if not. The aggregated weight θ\theta is then used for evaluation on the testing set Dtest\mathcal{D}_{test}.

4. Network Architecture

One of the main challenges when training a deep network in FL is the imbalanced and non-IID (identically and independently distributed) problem in data partitioning across silos. To overcome this problem, the learning architecture should have an appropriate design to balance the trade-off between convergence ability and accuracy performance. In practice, the deep architecture has to deal with the high variance between silo weights when the accumulation process for all silos is conducted. To this end, we design a new Federated Autonomous Driving Network, which is based on ResNet8, as shown in Fig.3.

Figure 3. Human Tracking.

In particular, our proposed FADNet first comprises an input layer normalization to improve the stability of the abstract layer. This layer aims to handle different distributions of input images in each silo. Then, a convolution layer following by a max-pooling layer is added to encode the input. To handle the vanishing gradient problem, three residual blocks are appended with a following FC layer to extract ResBlock features. However, using residual blocks increases the variance of silo weights during the aggregation process and affects the convergence ability of the model. To address this problem, we add a Global Average Pooling layer (GAP) associated with each residual block. GAP is a non-weight pooling layer which sums out the spatial information from each residual block. Thus, it is not affected by the weighted variance problem. The output of each GAP layer is passed through an Accumulation layer to accrue the Support feature. The ResBlock feature and the Support feature from GAP layers are fed into the Aggregation layer to calculate the model loss in each silo.

In our design, the Accumulation and Aggregation layers aim to reduce the variance of the global model since we need to combine multiple model weights produced by different silos. In particular, the Accumulation layer is a variant of the fully connected (FC) layer. Instead of weighting the contribution of input nodes as in FC, the Accumulation layer weights the contribution of multiple features from input layers. The Accumulation layer has a learnable weight matrix wRnw \in \mathbb{R}^\text{n}. Its number of nodes is equal to the \text{n} number of input layers. Note that the support feature from the Accumulation layer has the same size as the input. Let F={f1,f2,...,fn},fhRdF = \{f_\text{1}, f_\text{2}, ..., f_\text{n}\}, \forall f_\text{h} \in \mathbb{R}^\text{d} be the collection of n\text{n} number of the features extracted from n\text{n} input GAP layers; d\text{d} is the unified dimension. The Accumulation outputs a feature fcRdf_\text{c} \in \mathbb{R}^\text{d} in each silo ii, and is computed as:

fc=Accumulation(F)i=h=1n(whfh)if_\text{c} = Accumulation(F)_i = \sum^{\text{n}}_{\text{h}=1}(w_\text{h}f_\text{h})_i

The Aggregation layer is a fusion between the ResBlock feature extracted from the backbone and the support feature from the Accumulation layer. For simplicity, we use the Hadamard product to compute the aggregated feature. This feature is then averaged to predict the steering angle. Let fsRdf_\text{s} \in \mathbb{R}^\text{d} be the ResBlock features extracted from the backbone. The output driving policy θi\theta_i of silo ii can be calculated as:

θi=Aggregation(fs,fc)i=(fsfc)ˉi\theta_i = Aggregation(f_\text{s}, f_\text{c})_i = \bar{(f_\text{s} \odot f_\text{c})}_i

where \odot denotes Hadamard product; ()ˉ\bar{(*)} denotes the mean and we set d=6,272\text{d} = 6,272.

Next

In the next post, we will show the effectiveness and efficiency of FADNet during Federated Learning proccess.

Music-Driven Group Choreography (Part 3)

This is the final part of the series group dance choreography, In this part, we will provide detailed analyses of our proposed group dance generation method.

Experiments

AIOZ-GDANCE Statistics

Figure 1. Distribution (%) of music genres (a) and dance styles (b) in our dataset.

In Figure 1, we show the distribution of music genres and dance styles in our dataset. As illustrated in Figure 1 (Left), Pop and Electronic are popular music genres while other music genres nearly share the same distribution. Meanwhile, on the right of Figure 1, Zumba, Aerobic, and Commercial are the dominant dance styles.

Figure 2. The correlation between dance styles and number of dancers (a); and between dance styles and music genres (b).

Figure 2 (Left) shows the number of dancers in each dance style. Naturally, we see that Zumba, Aerobic, and Commercial have more dancers. On the right of this figure, we illustrate the correlation between music genres and dance styles.

Evaluation Metrics

Similar to prior works on single-dance generation, we evaluate the generated motion quality by calculating the distribution distance between the generated and the ground-truth motions using Frechet Inception Distance (FID)[1, 2]. To evaluate how well the generated 3D motion correlates to the input music, we use the Motion-Music Consistency metric (MMC) [2,3]. We also evaluate our model's ability to generate diverse dance motions when given various input music by measuring Generation Diversity (GenDiv) [2,3].

To evaluate the group dancing quality, we propose three new metrics: Group Motion Realism (GMR), Group Motion Correlation (GMC), and Trajectory Intersection Frequency (TIF). Detailed calculations of these metrics are described as follows:

Group Motion Realism (GMR). To calculate the realism between generated and ground-truth group motion, we need to find a single unified representation for all dancers' motions in the scene. Based on the kinetic features of a single motion sequence [4], we propose to calculate Group Motion Realism (GMR), smaller is better. Specifically, for each entity, we compute the velocity of each element jj of the pose vector: vtn=yt+1nytnΔtv^n_t = \frac{y^n_{t+1} - y^n_t}{\Delta t} where Δt\Delta t is the time period between two consecutive frames. Note that the pose vector of each entity at each frame consists of the root orientation, root position and joint angles. The group kinetic features of a sequence is approximated by taking the logarithm of the total kinetic energy of all group entities as:

ej=log(1+1T1Nt=1Tn=1Nmj(vt,jn)2)e_j = \log \left(1 + \frac{1}{T}\frac{1}{N} \sum_{t=1}^T \sum_{n=1}^N m_j (v^n_{t,j})^2\right)

where mjm_j is the moment of inertia or mass of each joint. We assume that mjm_j is constant with respect to time and entity. Then, we split the sequence into smaller chunks and calculate the features of these chunks. This process is identical for both the generated and ground-truth sequences. Finally, we utilize these sets of features (from generated and ground-truth group dance) to calculate the GMR using the standard FID formulation as in [1].

Group Motion Correlation (GMC). We also evaluate the synchrony and the correlation between dancers within the generated group. We assume that the correlation of movements between individuals is likely to reflect their interaction in the choreography. For every pair of motions within a group, we first align the two motion sequences using Dynamic Time Warping algorithm based on the Euclidean distance in the joint position space (obtained by SMPL joint regressor). We then calculate the mean cross-correlations between the time-aligned motion pairs using the kinetic features [4]. The generated group motion correlation degree is then calculated as the average of all motion pairs.

Trajectory Intersection Frequency (TIF). For the generated group sequences, the intersection rate is calculated over all FF frames as:

TIR=Fi,j:ijI[intersect(M(yi),M(yj))]F,\text{TIR} = \frac{\sum_{F}\sum_{i,j : i\neq j} \mathbb{I}[\text{intersect}(M(y^i),M(y^j))]}{F},

where MM is the SMPL skinning function [5] which can output a 6890-vertices human mesh from the input pose parameters yy. intersect(x,y)\text{intersect}(x,y) is a function that returns 1 if the two meshes are intersect with each other and 0 otherwise. For TIF, smaller value is better and indicates less intersection of the generated group.

Cross-entity Attention Analysis

We compare our method with FACT [2]. FACT is a recent state-of-the-art method designed for single dance generation, thus giving our method an advantage. However, it is still the closest competing method as we propose a new group dance dataset that is not available for benchmarking before. We also analyse our method with and without using Cross-entity Attention. We train all methods with mini-batch containing all dancers within the group instead of sampling each dancer independently as in FACT’s original implementation.

Figure 3. Comparison between FACT and our GDanceR. Our method handles better the consistency and cross-body intersection problem between dancers.

Table 1 shows the method comparison between the baseline FACT[2] and our proposed GDanceR with and without Cross-entity Attention. The results show that GDanceR, especially with the Cross-entity Attention, outperforms the baseline by a large margin in all metrics. In Figure 3, we also visualize the example outputs of FACT and GDanceR. It is clear that FACT does not handle well the intersection problem. This is understandable as FACT is not designed for group dance generation, while our method with the Cross-entity Attention can deal with this problem better.

Table 1. Generation results comparison on AIOZ-GDANCE dataset. w/o CA denotes without using Cross-entity Attention.

Number of Dancers Analysis.

Table 2 demonstrates the generation results of our method when we want to generate different numbers of dancers. In general, the FID, GMR, and GMC metrics do not show much correlation with the numbers of generated dancers since the results are varied. On the other hand, MMC shows its stability among all setups (0.248\sim 0.248), which indicates that our network is robust in generating motion from given music regardless of the changing of initial positions. The generation diversity (GenDiv) decreases while the intersection frequency (TIF) increases when more dancers are generated. These results show that dealing with the collision during the group generation process is worth further investigation.

Table 2. Performance of our proposed method when increasing the number of generated dancers.

Dance Style Analysis

Figure 4. Examples of generated group motions from our method.

Different dance styles exhibit different challenges in group dance generation. As shown in Table 3, Aerobic and Zumba are quite similar for generating choreography as they usually focus on workout and sporty movements. Besides, while Commercial and Irish are easier for the model to reproduce the motions, Bollywood and Samba contain highly skilled movements that are challenging to capture and represent accurately. In Figure 4, we show the generated results of GDanceR with different dance styles. Our Supplementary Material and Demonstration Video also provide more examples.

Table 3. The results of different dance styles. These results are obtained by training the model on each dance style.

Ablation on Latent Motion Fusion.

We investigate different fusion strategies between the local motion hih^i and global-aware motion gig^i to obtain the final motion representation ziz^i. Specifically, we experiment with three settings: (i) No Fusion: the final motion is the global-aware motion obtained from our Cross-entity Attention (zi=giz^i = g^i); (ii) Concatenate: the final motion is the concatenation of the local and global-aware motion (zi=[hi;gi]z^i = [h^i; g^i]); (iii) Add: the final motion is the addition between local and global (zi=hi+giz^i = h^i + g^i). Table 4 summarizes the results. We find that fusing the motion by adding both the local and global motion features achieves the best results. In this strategy, the global information between entities is encoded to the local motion in an effective way so that the final motion retain the comprehensive information of their own past motion as well as the motion of every other entity. While in the concatenation, the model is prone to overfitting due to the redundant information of both the local and global representation. On the other hand, No Fusion can degrade the amount of information of the past motion, leading to insufficient input information and the Decoder may fail to generate the temporally-coherent motion aligned with the music.

Table 4. Ablation study on different fusion strategies for the latent motion representation.

Conclusion

In summary, we have introduced AIOZ-GDANCE, the largest dataset for audio-driven group dance generation. Our dataset contains in-the-wild videos and covers different dance styles and music genres. We then propose a strong baseline along with new evaluation metrics for group dance generation task. We also perform extensive experiments to validate our method on this interesting yet unexplored problem, using our new dataset and evaluation protocols. We hope that the release of our dataset will foster more research on audio-driven group choreography.

References

[1] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS 2017.

[2] Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music-conditioned 3d dance generation with aist++. ICCV 2021

[3] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Dancing to music. NIPS 2019.

[4] Kensuke Onuma, Christos Faloutsos, and Jessica K Hodgins. Fmdistance: A fast and effective distance function for motion capture data. Eurographics 2008.

[5] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics, 2015

Music-Driven Group Choreography (Part 2)

In the previous post, we have introduced AIOZ-GDANCE, a new largescale in-the-wild dataset for music-driven group dance generation. On the basis of the new dataset, we introduce the first strong baseline for group dance generation that can jointly generate multiple dancing motions expressively and coherently.

Figure 1. The overall architecture of GDanceR. Our model takes in a music sequence and a set of initial positions, and then auto-regressively generates coherent group dance motions that are attuned to the input music.

Music-driven Group Dance Generation Method

Problem Formulation

Given an input music audio sequence {m1,m2,...,mT}\{m_1, m_2, ...,m_T\} with t={1,...,T}t = \{1,..., T\} indicates the index of music segments, and the initial 3D positions of NN dancers {τ01,τ02,...,τ0N}\{\tau^1_0, \tau^2_0, ..., \tau^N_0 \}, τ0iR3\tau^i_0 \in \mathbb{R}^{3}, our goal is to generate the group motion sequences {y11,...,yT1;...;y1n,...,yTn}\{y^1_1,..., y^1_T; ...;y^n_1,...,y^n_T\} where ytiy^i_t is the generated pose of ii-th dancer at time step tt. Specifically, we represent the human pose as a 72-dimensional vector y=[τ;θ]y = [\tau; \theta] where τ\tau, θ\theta represent the root translation and pose parameters of the SMPL model [1], respectively.

In general, the generated group dance motion should meet the two conditions: (i) consistency between the generated dancing motion and the input music in terms of style, rhythm, and beat; (ii) the motions and trajectories of dancers should be coherent without cross-body intersection between dancers. To that end, we propose the first baseline method, for group dance generation that can jointly generate multiple dancing motions expressively and coherently. Figure 1 shows the architecture of our proposed Music-driven 3D Group Dance generatoR (GDanceR), which consists of three main components:

  • Transformer Music Encoder.
  • Initial Pose Generator.
  • Group Motion Generator.

Transformer Music Encoder

From the raw audio signal of the input music, we first extract music features using the available audio processing library Librosa. Concretely, we extract the mel frequency cepstral coefficients (MFCC), MFCC delta, constant-Q chromagram, tempogram, onset strength and one-hot beat, which results in a 438-dimensional feature vector. We then encode the music sequence M={m1,m2,...,mT}M =\{m_1, m_2, ...,m_T\}, mtR438m_t \in \mathbb{R}^{438} into a sequence of hidden representation {a1,a2,...,aT}\{a_1, a_2,..., a_T\}, atRdaa_t \in \mathbb{R}^{d_a}. In practice, we utilize the self-attention mechanism of transformer [2] to effectively encode the multi-scale information and the long-term dependency between music frames. The hidden audio at each time step is expected to contain meaningful structural information to ensure that the generated dancing motion is coherent across the whole sequence.

Specifically, we first embed the music features mtm_t using a Linear layer followed by Positional Encoding to encode the time ordering of the sequence.

U=PE(MWau)U = \text{PE}({M} W^u_a)

where PE\text{PE} denotes the Positional Encoding, and WauR438×daW^u_a \in \mathbb{R}^{438 \times d_a} is the parameters of the linear projection layer. Then, the hidden audio information can be calculated using self-attention mechanism:

A=FF(softmax((UqUk)Tdk)Uv),Uq=UWaq,Uk=UWak,Uv=UWav\\ \mathbb{A} = \text{FF}(\text{softmax}\left(\frac{(U^q U^k)^T}{\sqrt{d_{k}}} \right) U^v ), \\ U^q = U W^q_a, \quad U^k = U W^k_a, \quad U^v = UW^v_a

where Waq,WakRda×dkW^q_a, W^k_a \in \mathbb{R}^{d_a \times d_k}, and WavRda×dvW^v_a \in \mathbb{R}^{d_a \times d_v} are the parameters that transform the linear embedding audio UU into a query UqU^q, a key UkU^k, and a value UvU^v respectively. dad_a is the dimension of the hidden audio representation while dkd_k is the dimension of the query and key, and dvd_v is the dimension of value. FF\text{FF} is a feed-forward neural network.

Initial Pose Generator

Figure 2.The Transformer Music Encoder encodes the acoustic and rhythmic information to generate the initial poses from the input positions.

Given the initial positions of all dancers, we generate the initial poses by combing the audio feature with the starting positions. We aggregate the audio representation by taking an average over the audio sequence. The aggregated audio is then concatenated with the input position and fed to a multilayer perceptron (MLP) to predict the initial pose for each dancer:

y0i=MLP([1Tt=1Tat;τ0i]),y^i_0 = \text{MLP}\left( \left[\frac{1}{T}\sum_{t=1}^T a_t ; \tau^i_0 \right] \right),

where [;][;] is the concatenation operator, τ0i\tau^i_0 is the initial position of the ii-th dancer.

Group Motion Generator

Figure 3. The Group Motion Generator auto-regressively generates coherent group dance motions based on the encoded acoustic information.

To generate the group dance motion, we aim to synthesize the coherent motion of each dancer such that it aligns well with the input music. Furthermore, we also need to maintain global consistency between all dancers. As shown in Figure 3, our Group Generator comprises a Group Encoder to encode the group sequence information and an MLP Decoder to decode the hidden representation back to the human pose space. To effectively extract both the local motion and global information of the group dance sequence through time, we design our Group Encoder based on two factors: Recurrent Neural Network [3] to capture the temporal motion dynamics of each dancer, and Attention mechanism [2] to encode the spatial relationship of all dancers.

Specifically, at each time step, the pose of each dancer in the previous frame yt1iy^i_{t-1} is sent to an LSTM unit to encode the hidden local motion representation htih^i_t:

hti=LSTM(yt1i,ht1i){h^i_t=\text{LSTM}(y^i_{t-1},h^i_{t-1})}

To ensure the motions of all dancers have global coherency and discourage strange effects such as cross-body intersection, we introduce the Cross-entity Attention mechanism. In particular, each individual motion representation is first linearly projected into a key vector kik^i, a query vector qiq^i and a value vector viv^i as follows: \begin{equation} k^i = h^i W^{k}, \quad q^i = h^i W^{q}, \quad v^i = h^i W^{v}, \end{equation} where Wq,WkRdh×dkW^q, W^k \in \mathbb{R}^{d_h \times d_k}, and WvRdh×dvW^v \in \mathbb{R}^{d_h \times d_v} are parameters that transform the hidden motion hh into a query, a key, and a value, respectively. dkd_k is the dimension of the query and key while dvd_v is the dimension of the value vector. To encode the relationship between dancers in the scene, our Cross-entity Attention also utilizes the Scaled Dot-Product Attention as in the Transformer [3].

Figure 4. The Group Encoder learns to encode the relations among dancers through our proposed Cross-entity Attention mechanism.

In practice, we find that people having closer positions to each other tend to have higher correlation in their movement. Therefore, we adopt Spacial Encoding strategy to encode the spacial relationship between each pair of dancers. The Spacial Encoding between two entities based on their distance in the 3D space is defined as follows:

eij=exp(τiτj2dτ),e_{ij} = \exp\left(-\frac{\Vert \tau^i - \tau^j \Vert^2}{\sqrt{d_{\tau}}}\right),

where dτd_{\tau} is the dimension of the position vector τ\tau. Considering the query qiq^i, which represents the current entity information, and the key kjk^j, which represents other entity information, we inject the spatial relation information between these two entities onto their cross attention coefficient:

αij=softmax((qi)kjdk+eij).\alpha_{ij} = \text{softmax}\left(\frac{(q^i)^\top k^j}{\sqrt{d_k}} + e_{ij}\right).

To preserve the spatial relative information in the attentive representation, we also embed them into the hidden value vector and obtain the global-aware representation gig^i of the ii-th entity as follows:

gi=j=1Nαij(vj+eijγ),g^i = \sum_{j=1}^N\alpha_{ij}(v^j + e_{ij}\gamma),

where γRdv\gamma \in \mathbb{R}^{d_v} is the learnable bias and scaled by the Spacial Encoding. Intuitively, the Spacial Encoding acts as the bias in the attention weight, encouraging the interactivity and awareness to be higher between closer entities. Our attention mechanism can adaptively attend to each dancer and others in both temporal and spatial manner, thanks to the encoded motion as well as the spatial information.

We then fuse both the local and global motion representation by adding hih^i and gig^i to obtain the final latent motion ziz^i. Our final global-local representation of each entity is expected to carry the comprehensive information of their own past motion as well as the motion of every other entity, enabling the MLP Decoder to generate coherent group dancing sequences. Finally, we generate the next movement yti{y}^i_t based on the final motion representation ztiz^i_t as well as the hidden audio representation ata_t, and thus can capture the fine-grained correspondence between music feature sequence and dance movement sequence:

yti=MLP([zti;at]).y^i_t = \text{MLP}([z^i_t; a_t]).

Built upon these components, our model can effectively learn and generate coherent group dance animation given several pieces of music. In the next part, we will go through the experiments and detailed studies of the method.

References

[1] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multiperson linear model. ACM Trans. Graphics, 2015

[2] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. NIPS 2017.

[3] Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997 Nov 15;9(8):1735-80.

Music-Driven Group Choreography (Part 1)

Dancing is an important part of human culture and remains one of the most expressive physical art and communication forms. With the rapid development of digital social media platforms, creating dancing videos has gained significant attention from social communities. As a result, millions of dancing videos are created and watched daily on online platforms. Recently, studies of how to create natural dancing motion from music have attracted great attention in the research community.

Figure 1. We demonstrate the AIOZ-GDANCE dataset with in-the-wild videos, music audio, and 3D group dance motion.

Nevertheless, generating dance motion for a group of dancers remains an open problem and have not been well-investigated by the community yet. Motivated by these shortcomings and to foster research on group choreography, we establish AIOZ-GDANCE, a new largescale in-the-wild dataset for music-driven group dance generation. Unlike existing datasets that only support single dance, our new dataset contains group dance videos as shown in Figure 1, hence supporting the study of group choreography. On the basis of the new dataset, we propose the first strong baseline for group dance generation that can jointly generate multiple dancing motions expressively and coherently.

Dataset Construction

Figure 2. The pipeline of making our AIOZ-GDANCE dataset.

In this section, we will elaborate and describe the process to build our dataset from a large variety of videos available on the internet. Because our main goal is to develop a large-scale dataset with in-the-wild videos, setting up a MoCap system as in many classical approaches is not feasible. However, manually creating 3D groundtruth for millions of frames from dancing videos is also an extremely costly and tedious job. To that end, we propose a semi-automatic labeling method with humans in the loop to obtain the 3D ground truth for our dataset. The process to construct the data includes the five following key steps:

  1. Video collection
  2. Human Tracking
  3. Human Pose Estimation
  4. Local Fitting for Invidual Motions
  5. Global Scene Optimization

Data Collection and Preprocessing

Figure 3. Human Tracking.

Video Collection. We collect the in-the-wild, public domain group dancing videos along with the music from Youtube, Tiktok, and Facebook. All group dance videos are processed at 1920 × 1080 resolution and 30FPS.

Human Tracking. We perform tracking for all humans in the videos using the state-of-the-art multi-object tracker [1] to obtain the tracking bounding boxes. Note that although the tracker can produce reasonable results, there are failure cases in some frames. Therefore, we manually correct the bounding box of the incorrect cases. This tracking correction is crucial since we want the trajectory of each person to be accurately tracked in order to reconstruct their motion in latter stages.

Pose Estimation. Given the bounding boxes of each person in the video, we leverage a state-of-the-art 2D pose estimation method [2] to generate the initial 2D poses for each person. In practice, there exist some inaccurately detected keypoints due to motion blur and partial occlusion. We manually fix the incorrect cases to obtain the 2D keypoints of each human bounding box.

Local Mesh Fitting

Figure 4. Local Mesh Fitting.

To construct 3D group dance motion, we first reconstruct the full body motion for each dancer by fitting the 3D mesh. We then jointly optimize all dancer motions to construct the globally-coherent group motion. Finally, we post-process and remove wrong cases from the optimization results.

We use SMPL model [3] to represent the 3D human. The SMPL model is a differentiable function that maps the pose parameters θ\mathbf{\theta}, the shape parameters β\mathbf{\beta}, and the root translation τ\mathbf{\tau} into a set of 3D human body mesh vertices VR6890×3\mathbf{V}\in \mathbb{R}^{6890\times3} and 3D joints XRJ×3\mathbf{X}\in \mathbb{R}^{J\times3}, where JJ is the number of body joints.

Our optimizing motion variables for each individual dancer consist of a sequence of SMPL joint angles {θt}t=1T\{\mathbf{\theta}_t\}_{t=1}^T, a sequence of the root translation {τt}t=1T\{\mathbf{\tau}_t\}_{t=1}^T, and a single SMPL shape parameter β\mathbf{\beta}. We fit the sequence of SMPL motion variables to the tracked 2D keypoints by extending SMPLify-X framework [4] across the whole video sequence:

Elocal=EJ+λθEθ+λβEβ+λSES+λFEFE_{\rm local} = E_{\rm J} + \lambda_{\theta}E_{\theta} + \lambda_{\beta} E_{\beta} + \lambda_{\rm S}E_{\rm S} + \lambda_{\rm F}E_{\rm F}

where:

  • EJE_{\rm J} is the 2D reprojection term between the 2D keypoints and the 2D projection of the corresponding 3D poses.
  • EθE_{\theta} is the pose prior term from the latent space of the VPoser model [4] to encourage plausible human pose.
  • EβE_{\beta} is the shape prior term to regularize the body shape towards the mean shape of the SMPL body model.
  • ES=t=1T1θt+1θt2+j=1Jt=1T1Xj,t+1Xj,t2E_{\rm S} = \sum_{t=1}^{T-1}\Vert \mathbf{\theta}_{t+1} - \mathbf{\theta}_{t} \Vert^2 + \sum_{j=1}^J\sum_{t=1}^{T-1}\Vert \mathbf{X}_{j,t+1} - \mathbf{X}_{j,t} \Vert^2 is the smoothness term to encourage the temporal smoothness of the motion.
  • EF=t=1T1jFcj,tXj,t+1Xj,t2E_{\rm F} = \sum_{t=1}^{T-1} \sum_{j \in \mathcal{F}} c_{j,t}\Vert \mathbf{X}_{j,t+1} - \mathbf{X}_{j,t} \Vert^2 is to ensure feet joints to stay stationary when in contact (zero velocity). Where F\mathcal{F} is the set of feet joint indexes, cj,tc_{j,t} is the feet contact of joint jj at time tt.

Global Optimization

Figure 5. Global Optimization.

Given the 3D motion sequence of each dancer pp: {θtp,τtp}\{\mathbf{\theta}^p_t, \mathbf{\tau}^p_t\}, we further resolve the motion trajectory problems in group dance by solving the following objective:

Eglobal=EJ+λpenEpen+λregpEreg(p)+λdepp,p,tEdep(p,p,t)+λgcpEgc(p)E_{\rm global} = E_{\rm J} + \lambda_{\rm pen}E_{\rm pen} + \lambda_{\rm reg}\sum_{p}E_{\rm reg}(p) + \lambda_{\rm dep}\sum_{p,p',t}E_{\rm dep}(p,p',t) + \lambda_{\rm gc}\sum_{p}E_{\rm gc}(p)

EpenE_{\rm pen} is the Signed Distance Function penetration term to prevent the overlapping of reconstructed motions between dancers.

Ereg(p)=t=1Tθtpθ^tp2{E_{\rm reg}(p) =\sum_{t=1}^T\Vert \mathbf{\theta}^p_t - \hat{\mathbf{\theta}}^p_t\Vert^2} is the regularization term that prevents the motion from deviating too much from the prior optimized individual motion {θ^tp}\{\hat{\mathbf{\theta}}^p_t\} obtained by optimizing the local mesh for dancer pp.

In practice, we find that the relative depth ordering of dancers in the scene can be inconsistent due to the ambiguity of the 2D projection. To ensure the group motion quality, we watch the videos and manually provide the ordinal depth relation information of all dancers in the scene at each frame tt as follows:

rt(p,p)={1,if dancer p is closer than p1,if dancer p is farther than p0,if their depths are roughly equalr_t(p,p') = \begin{cases} 1, &\text{if dancer } p \text{ is closer than } p' \\ -1, &\text{if dancer } p \text{ is farther than } p' \\ 0, &\text{if their depths are roughly equal} \end{cases}

Given the relative depth information, we derive the depth relation term EdepE_{\rm dep}. This term encourages consistent ordinal depth relation between the motion trajectories of multiple dancers, especially when dancers partially occlude each other:

Edep(p,p,t)={log(1+exp(ztpztp)),rt(p,p)=1log(1+exp(ztp+ztp)),rt(p,p)=1(ztpztp)2,rt(p,p)=0E_{\rm dep}(p,p',t) = \begin{cases} \log(1+\exp(z^p_t - z^{p'}_t)), &r_t(p,p')=1 \\ \log(1+\exp(-z^p_t + z^{p'}_t)), &r_t(p,p')=-1 \\ (z^p_t - z^{p'}_t)^2, &r_t(p,p')=0 \\ \end{cases}

where ztpz^p_t is the depth component of the root translation τtp\mathbf{\tau}^p_t of the person pp at frame tt. Intuitively, for r(p,p)=1r(p,p')=1, zpz_p should be smaller than zpz_{p'} and otherwise.

Finally, we apply the global ground contact constraint EgcE_{\rm gc} to further ensure consistency between the motion of every person and the environment based on the ground contact information. This contact term is also needed to reduce the artifacts such as foot-skating, jittering, and penetration under the ground.

Egc(p)=t=1T1jFcj,tpXj,t+1pXj,tp2+cj,tp(Xj,tpf)n2E_{\rm gc}(p) = \sum_{t=1}^{T-1} \sum_{j \in \mathcal{F}} c^p_{j,t}\Vert \mathbf{X}^p_{j,t+1} - \mathbf{X}^p_{j,t} \Vert^2 + c^p_{j,t} \Vert (\mathbf{X}^p_{j,t} - \mathbf{f})^\top \mathbf{n}^* \Vert^2

where F\mathcal{F} is the set of feet joint indexes, n\mathbf{n}^* is the estimated plane normal and f\mathbf{f} is a 3D fixed point on the ground plane. The first term in Equation~\ref{eq_Egc} is the zero velocity constraint when the feet are in contact with the ground, while the second term encourages the feet position to stay near the ground when in contact. To obtain the ground plane parameters, we initialize the plane point f\mathbf{f} as the weighted median of all contact feet positions. The plane normal n\mathbf{n}^* is obtained by optimizing a robust Huber objective:

n=argminnXfeetH((Xfeetf)nn)+nn12,\mathbf{n}^* = \arg\min_{\mathbf{n}} \sum_{\mathbf{X}_{\rm feet}} \mathcal{H}\left((\mathbf{X}_{\rm feet} - \mathbf{f})^\top \frac{\mathbf{n}}{\Vert\mathbf{n}\Vert}\right) + \Vert \mathbf{n}^\top\mathbf{n} - 1 \Vert^2,

where H\mathcal{H} is the Huber loss function, Xfeet\mathbf{X}_{\rm feet} is the 3D feet positions of all dancers across the whole sequence that are labelled as in contact (i.e., cj,tp=1c^p_{j,t} = 1) .

How will AIOZ-GDANCE be useful to the community?

We bring up some interesting research directions that can be benefited from our dataset:

  • Group Dance Generation
  • Human Pose Tracking
  • Dance Education
  • Dance style transfer
  • Human behavior analysis

While single-person choreography is a hot research topic recently, group dance generation has not yet well investigated. We hope that the release of our dataset will foster more this research direction.

References

[1] Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and Ping Luo. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In CVPR, 2022

[2] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. RMPE: Regional multi-person pose estimation. In ICCV, 2017.

[3] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multiperson linear model. ACM Trans. Graphics, 2015

[4] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.

Global-Local Attention for Context-aware Emotion Recognition (Part 2)

In this part, we will conduct experiements for validating the effectiveness of our proposed Global-Local Attention for Context-aware Emotion Recognition. Here, we only focus on static images with background context as our input. Therefore, we choose the static CAER (CAER-S) dataset [2] to validate our method. However, while experimenting with the CAER-S dataset, we observe that there is a correlation between images in the training and the test sets, which can make the model less robust to changes in data and may not generalize well on unseen samples. More specifically, many images in the training and the test set of the CAER-S dataset are extracted from the same video, hence making them look very similar to each other. To properly evaluate the models, we propose a new way to extract static frames from the CAER video clips to create a new static image dataset called Novel CAER-S (NCAER-S). In particular, for each video in the original CAER dataset, we split the video into multiple parts. Then we randomly select one frame of each part to include in the new NCAER-S dataset. Any original video that provides frames for the training set will be removed from the testing set. This process assures the new dataset is novel while the training frames and testing frames are never from one original input video.

Context-aware emotion recognition results

Table 1. Comparison with recent methods on the CAER-S dataset.

Table 1 summarizes the results of our network and other recent state-of-the-art methods on the CAER-S dataset [2]. This table clearly shows that integrating our GLA module can significantly improve the accuracy performance of the recent CAER-Net. In particular, our GLAMOR-Net (original) achieves 77.90% accuracy, which is a +4.38% improvement over the CAER-Net-S. When compared with other recent state-of-the-art approaches, the table clearly demonstrates that our GLAMOR-Net (ResNet-18) outperforms all those methods and achieves a new state-of-the-art performance with an accuracy of 89.88%. This result confirms our global-local attention mechanism can effectively encode both facial information and context information to improve the human emotion classification results.

Component Analysis

To further analyze the contribution of each component in our proposed method, we experiment with 4 different input settings on the NCAER-S dataset: (i) face only, (ii) context only with the facial region being masked, (iii) context only with the facial region visible, and (iv) both face and context (with masked face). When the context information is used, we compare the performance of the model with different context attention approaches. Note that to compute the saliency map with the proposed GLA in the (ii) and (iii) setting, we extract facial features using the Facial Encoding Module, however, these features are only used as the input of the GLA module to guide the context attention map learning process and not as the input of the Fusion Network to predict the emotion category. The results of these settings are summarized in Table 2.

Table 2. Ablation study of our proposed method on the NCAER-S dataset. w/F, w/mC, w/fC, w/CA, w/GLA denote using the output of the Facial Encoding Module, the Context Encoding Module with masked faces as input, the Context Encoding Module with visible faces as input, the standard attention in [2] and our proposed GLA, respectively, as input to the Fusion Network.

The results clearly show that our GLA consistently helps improve performance in all settings. Specifically, in setting (ii), using our GLA achieves an improvement of 1.06\% over method without attention. Our GLA also improves the performance of the model when both facial and context information is used to predict emotion. Specifically, our model with GLA achieves the best result with an accuracy of 46.91\%, which is higher than the method with no attention 3.72\%. The results from Table 2 show the effectiveness of our Global-Local Attention module for the task of emotion recognition. They also verify that the use of both the local face region and global context information is essential for improving emotion recognition accuracy.

Qualitative Analysis

Figure 5 shows the qualitative visualization with learned attention maps obtained by our method GLAMOR-Net in comparison with CAER-Net-S. It can be seen that our Global-Local Attention mechanism produces better saliency maps and helps the model attend to the right discriminative regions in the surrounding background than the attention map produced by CAER-Net-S. As we can see, our model is able to focus on the gesture of the person (Figure 5f) and also the face of surrounding people (Figure 5c and 5d) to infer the emotion accurately.

Figure 5.Visualization of the attention maps. From top to bottom: original image in the NCAER-S dataset, image with masked face, attention map of the CAER-Net-S, and attention map of our GLAMOR-Net.

Figure 6 shows some emotion recognition results of different approaches on the CAER-S dataset. More specifically, the first two rows (i) and (ii) contain predictions of the CAER-Net-S while the last two rows (iii) and (iv) show the results of our GLAMOR-Net. In some cases, our model was able to exploit the context effectively to perform inference accurately. For instance, with the same sad image input (shown on the (i) and (iii) rows), the CAER-Net-S misclassified it as neutral while the GLAMOR-Net correctly recognized the true emotion category. It might be because our model was able to identify that the man was hugging and appeasing the woman and inferred that they were sad. Another example is shown on the (i) and (iii) rows of the fear column. Our model classified the input accurately, while the CAER-Net-S is confused between the facial expression and the wedding surrounding, thus incorrectly predicted the emotion as happy.

Figure 5. Example predictions on the test set. The first two rows (i) and (ii) show the results of the CAER-Net-S while the last two rows (iii) and (iv) demonstrate predictions of our GLAMOR-Net. The columns names from (a) to (g) denote the ground-truth emotion of the images.

Conclusion

We have presented a novel method to exploit context information more efficiently by using the proposed globallocal attention model. We have shown that our approach can noticeably improve the emotion classification accuracy compared to the current state-of-the-art results in the context-aware emotion recognition task. The results on the context-aware emotion recognition datasets consistently demonstrate the effectiveness and robustness of our method.

References

[1] Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In NIPS, 2015.

[2] Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition networks. In ICCV, 2019.

Global-Local Attention for Context-aware Emotion Recognition (Part 1)

Automatic emotion recognition has been a longstanding problem in both academia and industry. It enables a wide range of applications in various domains, ranging from healthcare, surveillance to robotics and human-computer interaction. Recently, significant progress has been made in the field and many methods have demonstrated promising results. However, recent works mainly focus on facial regions while ignoring the surrounding context, which is shown to play an important role in the understanding of the perceived emotion, especially when the emotions on the face are ambiguous or weakly expressed (see the examples in Figure 1).

Figure 1. Facial expression information is not always sufficient to infer people's emotions, especially when facial regions can not be seen clearly or are occluded.

We hypothesize that the local information (i.e., facial region) and global information (i.e., context background) have a correlative relationship, and by simultaneously learning the attention using both of them, the accuracy of the network can be improved. This is based on the fact that the emotion of one person can be indicated by not only the face’s emotion (i.e., local information) but also other context information such as the gesture, pose, or emotion/pose of a nearby person. To that end, we propose a new deep network, namely, Global-Local Attention for Emotion Recognition Network (GLAMOR-Net), to effectively recognize human emotions using a novel global-local attention mechanism. Our network is designed to extract features from both facial and context regions independently, then learn them together using the attention module. In this way, both the facial and contextual information is used altogether to infer human emotions.

Overview

Figure 2. The architecture of our proposed network. The whole process includes three steps. We extract the facial information (local) and context information (global) using two Encoding Modules. We then perform attention inference on the global context using the Global-Local Attention mechanism. Lastly, we fuse both features to determine the emotion.
Figure 2 shows an overview of our method. Specifically, we assume that emotions can be recognized by understanding the context components of the scene together with the facial expression. Our method aims to do emotion recognition in the wild by incorporating both facial information of the person’s face and contextual information surrounding that person. Our model consists of three components: Encoding Module, Global-Local Attention (GLA) Module, and Fusion Module. Our key design is the novel GLA module, which utilizes facial features as the local information to attend better to salient locations in the global context.

Face and Context Encoding

Our Encoding Module comprises the Facial Encoding Module to learn the face-specific features, and the Context Encoding Module to learn the context-specific features. Specifically, both the Face Encoding and Context Enconding Module are built on several convolutional layers to extract meaningful features from the corresponding input. Each module is comprised of five convolutional layers followed by a Batch Normalization layer an ReLU activation function. The number of filters starts with 32 in the first layer, increasing by a factor of 2 at each subsequent layer except the last one. Our network ends up with 256-channel feature map, which is the embedded representation with respect to the input image. In practice, we also mask the facial regions in the raw input to prevent the attention module from only focusing on the facial region while omitting the context information in other parts of the image.

Global-Local Attention

Inspired by the attention mechanism [1], to model the associative relationship of the local information (i.e., the facial region in our work) and global information (i.e., the surrounding context background), we propose the Global-Local Attention Module to guide the network focus on meaningful regions (Figure 3). In particular, our attention mechanism models the hidden correlation between the face and different regions in the context by capturing their similarity.

Figure 3. The proposed Global-Local Attention module takes the extracted face feature vector and the context feature map as the input to perform context attention inference.

We first reduce the facial feature map Ff\mathbf{F}_f into vector representation using the Global Pooling operator, denoted as vf\mathbf{v}_f. The context feature map can be viewed as a set of Wc×HcW_c \times H_c vectors with DcD_c dimensions, each vector in each cell (i,j)(i,j) represents the embedded features at that location with the corresponding patch in the input image. Therefore, at each region (i,j)(i,j) in the context feature map, we have Fc(i,j)=vi,j\mathbf{F}_c^{(i,j)} = \mathbf{v}_{i,j}.

We then concatenate [vf;vi,j][\mathbf{v}_f; \mathbf{v}_{i,j}] into a holistic vector vˉi,j\bar{\mathbf{v}}_{i,j}, which contains both information about the face and some small regions of the scene. We then employ a feed-forward neural network to compute the score corresponding to that region by feeding vˉi,j\bar{\mathbf{v}}_{i,j} into the network. By applying the same process for all regions, each region (i,j)(i,j) will output a raw score value si,js_{i,j}, we spatially apply the Softmax function to produce the attention map ai,j=Softmax(si,j)a_{i,j} = \text{Softmax}(s_{i,j}). To obtain the final context representation vector, we squish the feature maps by taking the average over all the regions weighted by ai,ja_{i,j} as follow:

vc=ΣiΣj(ai,jvi,j)\mathbf{v}_c = \Sigma_i\Sigma_j(a_{i,j} \odot \mathbf{v}_{i,j})

where vcRDc\mathbf{v}_c \in \mathbb{R}^{D_c} is the final single vector encoding the context information Intuively, vc\mathbf{v}_c mainly contains information from regions that have high attention, while other nonessential parts of the context are mostly ignored. With this design, our attention module can guide the network focus on important areas based on both facial information and context information of the image.

Face and Context Fusion

Figure 4. Detailed illustration of the Adaptive Fusion.

The Fusion Module takes the face vf\mathbf{v}_f and the context reprsentation vc\mathbf{v}_c as inputs, then the face score and context score are produced separately by two neural networks:

sf=F(vf;ϕf),sc=F(vc;ϕc)s_f = \mathcal{F}(\mathbf{v}_f; \phi_f), \quad\quad s_c = \mathcal{F}(\mathbf{v}_c; \phi_c)

where ϕf\phi_f and ϕc\phi_c are the network parameters of the face branch and context branch, respectively. Next, we normalize these scores by the Softmax function to produce weights for each face and context branch

wf=exp(sf)exp(sf)+exp(sc),wc=exp(sc)exp(sf)+exp(sc)w_f = \frac{\exp(s_f)}{\exp(s_f)+\exp(s_c)}, \quad w_c = \frac{\exp(s_c)}{\exp(s_f)+\exp(s_c)}

In this way, we let the two networks competitively determine which branch is more useful than the other. Then we amplify the more useful branch and lower the effect of the other by multiplying the extracted features with the corresponding weight:

vfvfwf,vcvcwc\mathbf{v}_f \leftarrow \mathbf{v}_f \odot w_f , \quad\quad \mathbf{v}_c \leftarrow \mathbf{v}_c \odot w_c

Finally, we use these vectors to estimate the emotion category. Specifically, in our experiments, after multiplying both vf\mathbf{v}_f and vc\mathbf{v}_c by their corresponding weights, we concatenate them together as the input for a network to make final predictions. Figure 4 shows our fusion procedure in detail.

References

[1] Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In NIPS, 2015.

[2] Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition networks. In ICCV, 2019.

Uncertainty-aware Label Distribution Learning for Facial Expression Recognition (Part 1)

Facial expression recognition (FER) plays an important role in understanding people's feelings and interactions between humans. Recently, automatic emotion recognition has gained a lot of attention from the research community due to its tremendous applications in education, healthcare, human analysis, surveillance or human-robot interaction. Recent FER methods are mostly based on deep learning and can achieve impressive results. The success of deep models can be attributed to large-scale FER datasets [1][2]. However, ambiguities of facial expression is still a key challenge in FER. Specifically, people with different backgrounds might perceive and interpret facial expressions differently, which can lead to noisy and inconsistent annotations. In addition, real-life facial expressions usually manifest a mixture of feelings rather than only a single emotion.

Motivation and Proposed Solution

Figure 1. Examples of real-world ambiguous facial expressions that can lead to noisy and inconsistent annotation.

As an example, Figure 1 shows that people may have different opinions about the expressed emotion, particularly in ambiguous images. Consequently, a distribution over emotion categories is better than a single label because it takes all sentiment classes into account and can cover various interpretations, thus mitigating the effect of ambiguity. However, existing large-scale FER datasets only provide a single label for each sample instead of a label distribution, which means we do not have a comprehensive description for each facial expression. This can lead to insufficient supervision during training and pose a big challenge for many FER systems.

To overcome the ambiguity problem in FER, we proposes a new uncertainty-aware label distribution learning method that constructs emotion distributions for training samples. Specifically, we leverage the neighborhood information of samples that have similar expressions to construct the emotion distributions from single labels and utilize them as training supervision signal.

Methodology

Preliminaries

We denote xX\mathbf{x} \in \mathcal{X} as the instance variable in the input space X\mathcal{X} and xi\mathbf{x}^{i} as the particular ii-th instance. The label set is denoted as Y={y1,y2,...,ym}\mathcal{Y} = \{y_1, y_2,..., y_m\} where mm is the number of classes and yjy_j is the label value of the jj-th class. The logical label vector of xi\mathbf{x}^{i} is indicated by li\mathbf{l}^{i} = (ly1i,ly2i,...,lymi)(l^{i}_{y_1}, l^{i}_{y_2}, ..., l^{i}_{y_m}) with lyji{0,1}\mathbf{l}^{i}_{y_j} \in \{0, 1\} and l1=1\| \mathbf{l} \| _1 = 1. We define the label distribution of xi\mathbf{x}^{i} as di\mathbf{d}^{i} = (dy1i,dy2i,...,dymi)(d^{i}_{y_1}, d^{i}_{y_2}, ..., d^{i}_{y_m}) with d1=1\| \mathbf{d} \| _1 = 1 and dyji[0,1]d^{i}_{y_j} \in [0, 1] representing the relative degree that xi\mathbf{x}^{i} belongs to the class yjy_j.

Most existing FER datasets assign only a single class or equivalently, a logical label li\mathbf{l}^{i} for each training sample xi\mathbf{x}^{i}. In particular, the given training dataset is a collection of nn samples with logical labels DlD_l = {(xi,li)1in}\{ (\mathbf{x}^{i}, \mathbf{l}^{i}) \vert 1 \le i \le n\}. However, we find that a label distribution di\mathbf{d}^i is a more comprehensive and suitable annotation for the image than a single label.

Inspired by the recent success of label distribution learning (LDL) in addressing label ambiguity [3], we aim to construct an emotion distribution di\mathbf{d}^i for each training sample xi\mathbf{x}^i, thus transform the training set DlD_l into DdD_d = {(xi,di)1in}\{ (\mathbf{x}^{i}, \mathbf{d}^{i}) \vert 1 \le i \le n\}, which can provide richer supervision information and help mitigate the ambiguity issue. We use cross-entropy to measure the discrepancy between the model's prediction and the constructed target distribution. Hence, the model can be trained by minimizing the following classification loss:

Lcls=i=1nCE(di,f(xi;θ))=i=1nj=1mdjilogfj(xi;θ).\mathcal{L}_{cls} = \sum_{i=1}^n \text{CE}\left(\mathbf{d}^i, f(\mathbf{x}^i; \theta)\right) = -\sum_{i=1}^n \sum_{j=1}^m \mathbf{d}_j^{i} \log f_j(\mathbf{x}^{i};\theta).

where f(x;θ)f(\mathbf{x}; \theta) is a neural network with parameters θ\theta followed by a softmax layer to map the input image x\mathbf{x} into a emotion distribution.

Overview

Figure 2. An overview of our Label Distribution Learning with Valence-Arousal (LDLVA) for facial expression recognition under ambiguity.

An overview of our method is presented in Figure 2. To construct the label distribution for each training instance xi\mathbf{x}^i, we leverage its neighborhood information in the valence-arousal space. Particularly, we identify KK neighbor instances for each training sample xi\mathbf{x}^i and utilize our adaptive similarity mechanism to determine their contribution degrees to the target distribution di\mathbf{d}^i. Then, we combine the neighbors' predictions and their corresponding contribution degrees with the provided label li\mathbf{l}^i and li\mathbf{l}^i's uncertainty factor to obtain the label distribution di\mathbf{d}^i. The constructed distribution di\mathbf{d}^i will be used as supervision information to train the model via label distribution learning.

Adaptive Similarity

We assume that the label distribution of the main instance xi\mathbf{x}^i can be computed as a linear combination of its neighbors' distributions. To determine the contribution of each neighbor, we propose an adaptive similarity mechanism that not only leverages the relationships between xi\mathbf{x}^i and its neighbors in the auxiliary space but also utilizes their feature vectors extracted from the backbone. We choose the valence-arousal [4] as the auxiliary space to construct the target label distribution. We use the KK-Nearest Neighbor algorithm to identify KK closest points for each training sample xi\mathbf{x}^i, denoted as N(i)N(i). We calculate the adaptive contribution degrees of neighbor instances as the product of the local similarity skis^i_k and the calibration score ζki\zeta^i_k as follows:

cki={ζkiski,for xkN(i),0,otherwise.c^i_k = \begin{cases} \zeta^i_k s^i_k, &\text{for } \mathbf{x}^k \in N(i), \\ 0, &\text{otherwise}. \end{cases}

where the local similarity skis^i_k is defined based on the distance between the instance and its neighbor in the valence-arousal space ai\mathbf{a}^i and ak\mathbf{a}^k

ski=exp(aiak22δ2),xkN(i)s^i_k = \exp\left(-\frac{\| \mathbf{a}^i - \mathbf{a}^k \|^2_2}{\delta^2}\right), \quad \forall \mathbf{x}^k \in N(i)

We utilize a multilayer perceptron (MLP) gg with parameter ϕ\phi to calculate the adaptive calibration score from the extracted features of the two instances vi\mathbf{v}^i and vk\mathbf{v}^k obtained from the backbone.

ζki=Sigmoid(g([vi,vk];ϕ))\zeta^i_k = Sigmoid\left(g([\mathbf{v}^i,\mathbf{v}^{k}];\phi)\right)

The proposed adaptive similarity can correct the similarity errors in the valance-arousal space, as the valence-arousal values are not always available in practice and we leverage an existing method to generate pseudo-valence-arousal.

Uncertainty-aware Label Distribution Construction

After obtaining the contribution degree of each neighbor xkN(i)\mathbf{x}^k \in N(i), we can now generate the target label distribution di\mathbf{d}^i for the main instance xi\mathbf{x}^i. The target label distribution is calculated using the logical label li\mathbf{l}^i and the aggregated distribution d~i\tilde{\mathbf{d}}^i defined as follows:

di~=kckif(xk;θ)kcki,di=(1λi)li+λidi~\tilde{\mathbf{d}^i} = \frac{\sum_k c^i_k f(\mathbf{x}^{k};\theta)}{\sum_k c^i_k}, \\ \mathbf{d}^i = (1-\lambda^i) \mathbf{l}^i + \lambda^i \tilde{\mathbf{d}^i}

where λi[0,1]\lambda^i \in [0,1] is the uncertainty factor for the logical label. It controls the balance between the provided label li\mathbf{l}^i and the aggregated distribution di~\tilde{\mathbf{d}^i} from the local neighborhood.

Intuitively, a high value of λi\lambda^i indicates that the logical label is highly uncertain, which can be caused by ambiguous expression or low-quality input images, thus we should put more weight towards neighborhood information di~\tilde{\mathbf{d}^i}. Conversely, when λi\lambda^i is small, the label distribution di\mathbf{d}^i should be close to li\mathbf{l}^i since we are certain about the provided manual label. In our implementation, λi\lambda^i is a trainable parameter for each instance and will be optimized jointly with the model's parameters using gradient descent.

Loss Function

To enhance the model's ability to discriminate between ambiguous emotions, we also propose a discriminative loss to reduce the intra-class variations of the learned facial representations. We incorporate the label uncertainty factor λi\lambda^i to adaptively penalize the distance between the sample and its corresponding class center. For instances with high uncertainty, the network can effectively tolerate their features in the optimization process. Furthermore, we also add pairwise distances between class centers to encourage large margins between different classes, thus enhancing the discriminative power. Our discriminative loss is calculated as follows:

LD=12i=1n(1λi)viμyi22+j=1mk=1kjmexp(μjμk22V)\mathcal{L}_D = \frac{1}{2}\sum_{i=1}^n (1-\lambda^i)\Vert \mathbf{v}^i - \mathbf{\mu}_{y^i} \Vert_2^2 + \sum_{j=1}^m \sum_{\substack{k=1 \\ k \neq j}}^m \exp \left(-\frac{\Vert\mathbf{\mu}_{j}-\mathbf{\mu}_{k}\Vert_2^2}{\sqrt{V}}\right)

where yiy^i is the class index of the ii-th sample while μj\mathbf{\mu}_{j}, μk\mathbf{\mu}_{k}, and μyi\mathbf{\mu}_{y^i} RV\in \mathbb{R}^V are the center vectors of the j{j}-th, k{k}-th, and yiy^i-th classes, respectively. Intuitively, the first term of LD\mathcal{L}_D encourages the feature vectors of one class to be close to their corresponding center while the second term improves the inter-class discrimination by pushing the cluster centers far away from each other. Finally, the total loss for training is computed as:

L=Lcls+γLD\mathcal{L} = \mathcal{L}_{cls} + \gamma\mathcal{L}_D

where γ\gamma is the balancing coefficient between the two losses.

References

[1] Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2019

[2] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, 2017.

[3] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 2017.

Uncertainty-aware Label Distribution Learning for Facial Expression Recognition (Part 2)

In the previous post, we have introduced the our proposed method for Facial Expression Recognition. In this post, we will examine the effectiveness and efficiency of the proposal.

Experimental Results

Noisy and Inconsistent Labels

Table 1. Test performance with synthetic noise.

We conduct experiments to study the robustness of our LDLVA on mislabelled data by adding synthetic noise to AffectNet, RAF-DB, and SFEW datasets. Specifically, we randomly flip the manual labels to one of the other categories. . We report the mean accuracy and standard error in Table 1. The results clearly show that our method consistently outperforms other approaches in all cases. We also observe that the improvements are even more apparent when the noise ratio increases, for example, the accuracy improvement on RAF-DB is 4.7\% with 10\% noise and 6.93\% with 30\% noise. The consistent results under various settings demonstrate the ability of our method to effectively deal with noisy annotation, which is crucial in the robustness against label ambiguity.

Table 2. Test performance with inconsistent labels between cross-datasets.

Since the annotations for large-scale FER data are commonly obtained via crowd-sourcing, this can create label inconsistency, especially between different datasets. To examine the effectiveness of our proposed methods in dealing with this problem, we also perform experiment with the cross-dataset protocol. Table 2 shows that our method achieves the best performance on all three datasets and the highest average accuracy and surpasses the current state-of-the-art methods. This confirms the advantages of our method over previous works and demonstrates the generalization ability to data with label inconsistency, which is essential for real-world FER applications.

Comparison with state of the arts

Table 3. Comparison with recent methods on the original datasets.

We further compare our method with several state of the arts on the original AffectNet, RAF-DB, and SFEW to evaluate the robustness of our method to the uncertainty and ambiguity that unavoidably exists in real-world FER datasets. The results are presented in Table 3. By leveraging label distribution learning on valence-arousal space, our model outperforms other methods and achieves state-of-the-art performance on AffectNet, RAF-DB, and SFEW. Although these datasets are considered to be "clean", the results suggest that they indeed suffer from uncertainty and ambiguity.

Qualitative Analysis

Real-world Ambiguity: To understand more about real-world ambiguous expressions, we conducted a user study in which we asked participants to choose the most clearly expressed emotion on random test images. We compare our model's predictions with the survey results in Figure 3. We can see that these images are ambiguous as they express a combination of different emotions, hence the participants do not fully agree and have different opinions about the most prominent emotion on the faces. It is further shown that our model can give consistent results and agree with the perception of humans to some degree.

Figure 3. Comparison of the results from our user study and our model.

Uncertainty Factor: Figure 4 shows the estimated uncertainty factors of some training images and their original labels. The uncertainty values decrease from top to bottom. Highly uncertain labels can be caused by low-quality inputs (as shown in Angry and Surprise columns) or ambiguous facial expressions. In contrast, when the emotions can be easily recognized as those in the last row, the uncertainty factors are assigned low values. This characteristic can guide the model to decide whether to put more weight on the provided label or the neighborhood information. Therefore, the model can be more robust against uncertainty and ambiguity.

Figure 4. Visualization of uncertainty values of some examples from RAF-DB dataset.

Conclusion

We have introduced a new label distribution learning method for facial expression recognition by leveraging structure information in the valence-arousal space to recover the intensities distributed over emotion categories. The constructed label distribution provides rich information about the emotions, thus can effectively describe the ambiguity degree of the facial image. Intensive experiments on popular datasets demonstrate the effectiveness of our method over previous approaches under inconsistency and uncertainty conditions in facial expression recognition.

References

[1] Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2019

[2] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, 2017.

[3] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 2017.