4 posts tagged with "type: insight"

View All Tags

Style Transfer for 2D Talking Head Generation (Part 4)

In the previous series, we have presented a comprehensive algorithmic overview of our method, which is key to achieve personalized talking head animation creation. Now, we would like to demonstrate our experimental procedure to validate the efficacy of the proposed system.

image

Experimental setups

Datasets

Face. We use the VoxCeleb2 dataset to learn facial expressions. All videos are extracted at 60 FPS. We first trim the video to retain the face in the center, then resize it to 512×512512\times 512. Our internal face tracker is leveraged to obtain 68 key points on the face. Face segmentation model is used to obtain the skin mask. The head and torso motion is manually identified for the first frame of each series and tracked for the remaining frames using optical flow.

Audio. We use the Common Voice dataset to train the Audio Encoder. There are around 26 hours of unlabeled statements throughout all samples. 8080-dimensional log Mel spectrograms are employed as surface representation and are computed with 1120\frac{1}{120}(s) frame-shift, 160\frac{1}{60}(s) frame length, and 512512-point STFT representation.

We evaluate and benchmark our results in the RAVDESS dataset. RAVDESS is a validated multimodal database of emotional speech and song, which is suitable and challenging to validate our method and different baselines. Note that, we use this dataset for benchmarking only to avoid training bias.

Implementation

We implement our framework using PyTorch. We train the network on the NVIDIA Titan V100 GPU with Adam optimizer. The learning rate is set to 10410^{-4}, 10410^{-4}, 10510^{-5}, 10410^{-4} to train the Audio Encoding, the Motion Generator, the Style-Aware Generator, and the Style Mapping, respectively. The batch size is set to 88 for the Style-Aware Generator and 6464 for other modules.

Qualitative Evaluation

fig3

Figure 1. Our 2D photo-realistic talking head results with different styles. (a), (c), (e) are ballad, rap, and opera style references, respectively; (b), (d), (f) are the corresponding style transfer results.

Figure 1 shows that our method successfully transfers different styles such as ballad, rap, or opera to a new target character. Figure 2(a) shows the comparison between our method and recent works on 2D photorealistic talking head animation (MakeItTalk, LSP, ...) when the character sings an opera song. Focusing on the mouth, we notice that our method produces better results in mouth motion variance and eye expression compared to the results from MakeItTalk and LSP. In Figure 2(b), we show the comparison between different styles when they are encoded in one input audio to generate talking heads. Note that, in this case, different input images are used to verify the synthesis effectiveness of our method. Although different styles are encoded into different images to generate different talking heads, the animation is realistic and the performance of lip-synchronization is well-reserved.

fig3

Figure 2. Comparison between different 2D talking head galleries on multiple styles. Our method generates more natural and realistic motion, especially around the mouth and the eye of the character.

Quantitative Evaluation

Evaluation metrics

We use six different metrics to evaluate how good and natural the animation of the generated talking head is: Cumulative Probability Blur Detection (CPBD), Landmark Distance (D-L), Landmarks Distance around the Mouth (LMD), Landmark Velocity difference (D-V), Difference in the open mouth area (D-A).

Style transfer metric. To evaluate style transfer results efficiently, we introduce three new following metrics.

Style-Aware Landmarks Distance (SLD): To evaluate the style information encoded in a generated talking head, we design a metric called Style-Aware Landmarks Distance (SLD). This metric calculates the accuracy of mouth, eyes, head pose shapes between a chunked window of style reference and a chunked window of corresponding talking head animation. Lower is better.

Let's assumed that a style reference video with NsN_s frames is split into multiple temporal periods of F\rm F frames (window size), i.e., style reference windows Ws=(ws(0:F),ws(v:F+v),ws(2v:F+2v),,ws(κv:F+κv))\rm W_s = \left(w_s^{(0:F)}, w_s^{(v:F+v)}, w_s^{(2v:F+2v)},\cdots, w_s^{(\kappa v:F +\kappa v)}\right), with ws(i:F+i)\rm w_s^{(i:F+i)} being the frames from i\rm i-th to (F+i)\rm (F+i)-th of the reference video, v\rm v is the stride, and κ=(NsF)/v\rm \kappa = \lfloor(N_s - F) / v\rfloor.

Similar to the reference video, we chunk the generated animation video into smaller chunked windows Wa=(wa(0:F),wa(v:F+v),wa(2v:F+2v),,wa(κv:F+κv))\rm W_a = \Bigl(w_a^{(0:F)}, w_a^{(v:F+v)}, w_a^{(2v:F+2v)}, \cdots , w_a^{(\kappa v:F+ \kappa v)}\Bigr). The SLD is then calculated with the core is the D-L metric as:

SLD=1WswsWs(minwaWa(DL(ws,wa)))\rm{SLD} = \frac{1}{\rm \vert W_s\vert}\sum_{\rm w_s \in \rm W_s} \left(\underset{\rm w_a \in \rm W_a}{\rm{min}} \left(\rm{D\rm{-}L}\left(\rm w_s, \rm w_a\right)\right)\right)

where DL\rm{D\rm{-}L} is the Landmark Distance metric.

Similarly, we calculate the Style-Aware Landmarks Velocity Difference (SLV) for landmark velocity, and Style-Aware Mouth Area Difference (SMD) for open mouth area accuracy.

Talking Head Generation Results

fig3

Table 1. Result of different 2D talking head generation methods.

Table 1 shows the 2D talking head result comparison between our method and recent baselines. From Table 1, we can see that our method outperforms recent state-of-the-art approaches by a large margin. In particular, our method achieves the highest accuracy in CPBD, LMD, D-L, D-V, and D-A metrics. These results show that our method successfully renders the 2D talking head and increases the quality of the rendered results. Overall, our method can increase the sharpness of the head (identified by CPBD) metric, while generating natural facial motion (identified by LMD, D-L, D-V, and D-A metric).

Style Comparison

fig3

Table 2. Result comparison in terms of style transfer between different 2D talking head generation methods.

Table 2 shows the comparison between our method and other baselines in terms of style transfer. Three designed metrics (SLD, SLV, and SMD) are used for evaluation and benchmarking. The results show that our method outperforms others by a large margin in all three metrics. This substantial performance gap strongly suggests the efficacy of our method in accurately capturing style characteristics from the reference image and seamlessly transferring them onto the target image. This not only highlights the robustness of our method but also underscores its potential for practical applications in various domains requiring high-quality style transfer. Furthermore, these results underscore the importance of our approach in advancing the state-of-the-art in style transfer techniques, promising richer and more faithful artistic transformations.

Conclusion

In this work, we have introduced a novel method designed to generate lifelike 2D talking heads from input audio signals, revolutionizing the realm of character animation. In addition to the primary audio stream and an accompanying image, our framework harnesses a meticulously curated set of reference frames to effectively learn the character style characteristics. Notably, our approach excels even with the most demanding and challenging vocal styles, including ballad, opera, and rap, where complex movements are necessary to produce animations that are faithful and natural. Extensive experiments demonstrate the superior performance of our talking head synthesis, showing qualitative and quantitative advantages over recent state-of-the-art methods. The versatility of our framework can be potential for diverse applications, ranging from dubbing, video conferencing experiences, to the creation of dynamic virtual avatars. With the ability to accurately capture and animate diverse head movement styles, we hope to further advance the field of character animation, allowing more expressive and vivid human-like facial talking animation.

Style Transfer for 2D Talking Head Generation (Part 3)

In the previous part, we have presented our general pipeline to generate a stylized talking head animation given an input audio signal, which consists of Motion Generator, Style Mapping, and Style-aware Generator modules. Here, let's investigate the style-transferring process of how to learn and capture specific talking styles of individuals to achieve personalized motion synthesis.

image

Style Transfer

The style transfer phase focuses on transferring the styles to a new character by re-weighting the Motion Generator given the input audio. In our transferring phase, we assume that the talking or singing styles are encoded in both the audio stream and reference images, and should be distinguishable from the visual information of the character. Therefore, this style information is learnable and can be transferred from one to another character. Practically, we also mainly rely on the pre-trained models from the training phase to perform the style transfer. Since Style-Aware Generator can cover the visual information generated from different styles, our goal in this phase is to make sure the style encoded in the Intermediate Audio-driven Motions can be adjusted to different styles rather than just the neutral one (i.e., the styles in the training data). We capture both the audio stream and reference images as the input in this stage. Figure 1 shows the details of our style transfer process (the light blue box) that performs appropriate adjustments on the Generator's weights to take into account a specific style.

fig3

Figure 1. A detailed illustration of our Style-aware Talking Head Generator and Style Transferring process.

Given the reference images and an audio stream (e.g., opera, rap, etc.), we first use the pre-trained audio encoding to extract the audio feature and apply the Motion Generator to reconstruct the audio-driven motion ϕmg\mathbf{\phi}_{\rm mg}. The reference images are fed through a pre-trained landmark detector to extract their corresponding facial landmarks ϕs\mathbf{\phi}_{\rm s}. The generated motions and facial landmarks are vectorized into (68×3)(68 \times 3)-dimensional vectors. Both ϕmg\mathbf{\phi}_{\rm mg} and ϕs\mathbf{\phi}_{\rm s} are then passed through a style transfer network to extract the mean features. A style transfer loss Ltransfer\mathcal{L}_{\rm transfer} is then optimized through back-propagation. The mean features are the latent encoded vector containing both information from the audio-driven landmarks and the facial landmarks.

Style Transfer Network

The style transfer network f()f(\cdot) is a discriminator that aims to learn the differences between motions of the input reference images and audio-driven motions extracted from the Motion Generator. Thanks to the style transfer loss Ltransfer\mathcal{L}_{\rm transfer}, the network is optimized to lower the gap of both mentioned motions, and then re-weight the parameters of Motion Generator to generate output motions that is similar to the target style. After re-weighting, the Motion Generator can produce style-aware audio-driven motions which are then passed into Style-Aware Generator to generate 2D animation with style. The style transfer network has three multilayer perceptrons (MLP), each MLP layer has 10241024, 512512, and 256256 neurons, subsequently. The final layer produces the mean features used in the style transfer loss.

Style Transfer Loss

The style transfer loss is proposed to ensure the generated motions take into account the target style. This loss is in-cooperated with the Motion Generator loss Lmg\mathcal{L}_{\rm mg} for fine-tuning the Motion Generator module during the transferring process. The style transfer loss Ltransfer\mathcal{L}_{\rm transfer} is contributed by the constraint loss Lsc\mathcal{L}_{\rm sc} and the regularization loss Lr\mathcal{L}_{\rm r}. The constraint loss is introduced to learn the style from the source motion and then transfer it into the generated one through the style transfer network.

Lsc=f(ϕmg)f(ϕs)22(1)\mathcal{L}_{\rm sc} = \left\lVert f(\mathbf{\phi}_{\rm{mg}}) - f(\mathbf{\phi}_{\rm s}) \right\lVert_2^{2} \tag{1}

where f()f(\cdot) is the style transfer network.

The regularization loss Lr\mathcal{L}_{\rm r} aims to increase the generalization of the style transfer process. Besides, it can deal with extreme cases of the generated motions that may break the manifold of valid styles and negatively affect the generated images. This loss is computed as:

Lr=(ϕ^mgf(ϕ^mg)21)2(2)\mathcal{L}_{\rm r} = \bigg(\left\lVert \nabla_{\mathbf{\hat\phi}_{\rm{mg}}}f(\mathbf{\hat\phi}_{\rm{mg}}) \right\lVert_2 - 1\bigg)^{2} \tag{2}

where ϕ^mg\mathbf{\hat\phi}_{\rm{mg}} is the joint representation that controls the contribution of source motion ϕs\mathbf{\phi}_{\rm s} during the style learning process. ϕ^mg\mathbf{\hat\phi}_{\rm{mg}} is computed from ϕs\mathbf{\phi}_{\rm s} and ϕmg\mathbf{\phi}_{\rm{mg}} as follows:

ϕ^mg=γϕs+(1γ)ϕmg(3)\mathbf{\hat\phi}_{\rm {mg}} = \gamma \mathbf{\phi}_{\rm s} + (1 - \gamma) \mathbf{\phi}_{\rm {mg}} \tag{3}

where γ\gamma controls the amount of leveraged style information.

The final transferring loss Ltransfer\mathcal{L}_{\rm {transfer}} is computed as:

Ltransfer=Lmg+Lsc+Lr(4)\mathcal{L}_{\rm {transfer}} = \mathcal{L}_{\rm mg} + \mathcal{L}_{\rm sc} + \mathcal{L}_{\rm r} \tag{4}

where Lmg\mathcal{L}_{\rm mg} is the Motion Generator reconstruction loss. So as to control the style, both reference images and the audio stream are required during the transferring process.

Style Transfer for 2D Talking Head Generation (Part 2)

In the previous part, we have introduced an interesting problem in 2D animation and provided an overview of our system in order to achieve personalized talking head creation via style transfer. We also investigated existing research and identified crucial problems in the topic. Now we can dive into more details of our proposed method, firstly, how to generate a talking head animation from an audio input.

image

Style-Aware Talking Head Generator

Our Style-Aware Talking Head Generator is illustrated in Figure 1. It first takes the an audio sequence (such as speech or singing) to synthesize the corresponding Intermediate Audio-driven Motion. The Style Reference is the learned by the Style Mapping to capture the personalized talking style of any arbitrary character, which is then combined with the intermediate motion to create talking head nimation with desirable style .

Audio Stream Representation and Motion Generator

Audio Stream Representation Representing audio for learning is essential in a talking head generator. Different speaking styles, referred to as individualized styles, can present challenges when using deep speech representation directly, resulting in suboptimal outcomes, particularly when dealing with distant variations in speech features. To improve the generalization of the audio stream extractor, we incorporate Lu et al.’s manifold projection technique.

Motion generator Given the extracted audio features, this step generates audio-driven motions in our framework. In practice, the character’s style is mainly defined by the mouth, eye, head, and torso movement. Therefore, we consider the motion around these regions of the face in our work.

styletransfer1

Figure 1. A detailed illustration of our Style-aware Talking Head Generator.

Style Reference Images

To learn the character’s styles more effectively, we define the Style Reference Images as a set of images retrieved from a video of a specific character by using the key motion templates. Inspired by Lu et al. and music theory about rhythm, we use four key motion templates that contain popular motion range and behavior. Each behavior is then plotted as a reference style pattern, which is used to retrieve the ones that are most similar in each video in the dataset. To retrieve similar patterns, we apply similarity search for each image in the video of the character. The result image set is called the Style Reference Images and is used to provide character’s styles information in our framework.

image

Figure 1. An Illustration of four key motion templates..

The style reference is expected to capture the personalized spotlight of the characters when they are talking or singing. To learn and capture the style information from a target character, we need to use the key motion templates that match with the syllable. According to music theory about rhythm, a word can have many syllables and one syllable can have more than one vowel. Vowels are a, e, i, o, u. The other letters (like b, c, d, f ) are consonants. However, each word can be split into single syllables and follow open and closed syllable patterns. A closed syllable has a short vowel ending in a consonant. It currently matches with the ‘None’ case and ‘M’ case, which are split based on the differences in mouth shape. An open syllable ends with a vowel sound that is spelled with a single vowel letter. ‘R’ case and ‘O’ case are two cases of the open syllable that have high differences in mouth motions. Each word can be formed by more than one vowel and there are seven syllable types in total for English. A visualization of the motion templates is shown in Figure 2.

Style Mapping

The Style Mapping is designed to disentangle the style in the reference images and then map the extracted style to the neutral image. Then, the input of this module is a pair of two images: a neutral image Is\mathbf{\textit{I}}_{s}, and a style reference image Ir\mathbf{\textit{I}}_{r}. The output is an Intermediate Style Pattern (ISP - an image) which has the identity that comes from Is\mathbf{\textit{I}}_{s} and the style represented in Ir\mathbf{\textit{I}}_{r}. ISP has the visual information of the neutral image but the style is from the style reference image. In practice, we first disentangle the style information encoded in the pose and expression of both the neutral and reference image, then map the style from the reference image into the neutral image to generate the output ISP image Io\mathbf{\textit{I}}_{o}.

Disentangling Neutral Image Since the head pose, expression, and keypoints from the neutral image contain the style information of a specific character, they need to be disentangled to learn the style information. In this step, given an input image Is\mathbf{\textit{I}}_{s}, a set of k\rm{k} number of keypoints ck\rm{c}_k is first disentangled to store the geometry signature via a Keypoint Extractor network. Then, we extract the pose, which is parameterized by a translation vector τR3\tau \in \mathbb{R}^3 and a rotation matrix RR3×3\rm{R} \in \mathbb{R}^{3\times 3}, and expression information εk\varepsilon_k from the image by a Pose Expression network. The extracted keypoints maintain the geometry signature and style information of the head in the neutral image.

Ck=ck×R+τ+εk(1)\rm{C}_k = \rm{c}_k \times \rm{R} +\tau + \varepsilon_k \tag{1}

Disentangling Style Reference Image Similar to the neutral image, we use two deep networks to disentangle and extract the head pose and keypoints from the style reference image. However, instead of extracting new keypoints from the reference images, we reuse the extracted ones ck\rm{c}_k from the neutral image, which contains the identity-specific geometry signature of the neutral image. The final keypoints Cˉk\bar{\rm{C}}_k of the style reference image are computed as:

Cˉk=ck×Rˉ+τˉ+εˉk(2)\bar{\rm{C}}_k = \rm{c}_k \times \bar{\rm{R}} +\bar{\tau} + \bar{\varepsilon}_k \tag{2}

where τˉR3\bar{\tau} \in \mathbb{R}^3, RˉR3×3\bar{\rm{R}} \in \mathbb{R}^{3\times 3} and εˉ\bar{\varepsilon} are translation vector, rotation matrix, and expression information extracted from the style reference image, respectively.

Style Mapping To construct the Intermediate Style Pattern Io\mathbf{\textit{I}}_{o}, we first extract two keypoints sets Ck\rm{C_k} and Cˉk\bar{\rm{C}}_k from the neutral image and the style reference image. We then estimate the warping function based on the two keypoints sets to warp the encoded features of the source (neutral image) to the target so that it can represent the style of the reference image. Then, we feed the warped version of the source encoded features and the extracted style information into an Intermediate Generator to obtain the ISP image. In practice, we choose the neutral image as a general image in Obama Weekly Address dataset, while the style reference image is one of the four images in the Style Reference Images set. By applying the style mapping process for all four images in the Style Reference Images, we obtain a set of four ISP images. This set (the Intermediate Style Pattern - ISP) is used as the input for the Style-Aware Generator in the next section.

Style-Aware Generator

This module generates a 2D talking head from a source image, the generated intermediate motion, and the style information represented in the Intermediate Style Pattern. In this module, the facial map plays an essential role in explicitly identifying groups of facial keypoints, which makes the style-aware learning process easier to converge. In our experiment, the facial map has the size of 512×512512 \times 512 and can be obtained by connecting consecutive keypoints in a preset semantic sequence and projecting it onto the 2D image plane using a pre-computed camera matrix.

Style-Aware Loss To learn the style-aware loss, we introduce the style-aware photometric loss Lsp\mathcal{L}_{\rm{sp}}. This loss is combined with the generator loss LG\mathcal{L}_{\mathbf{G}} to improve the generation quality and penalize the generated output that has a high deviation from the reference style patterns. The style-aware photometric loss is formulated as the pixel-wise error between the generated image I\mathbf{\textit{I}}' and the matched style pattern image Im\mathbf{\textit{I}}_{\rm {m}}:

Lsp=W(IIm)1(3)\mathcal{L}_{\rm{sp}}=\Vert \mathbf{W} \odot (\mathbf{\textit{I}}' - \mathbf{\textit{I}}_{\rm {m}}) \Vert_{1} \tag{3}

where W\mathbf{W} is the weighting mask which has values depending on different face regions; \odot denotes the Hadamard product; the matched style pattern image Im\mathbf{\textit{I}}_{\rm {m}} is obtained by using a pre-trained deep feature extractor to retrieve the best-matched image corresponding to one of the style reference images. To acquire W\mathbf{W}, we first use an off-the-shelf face parsing method to generate the segmentation mask of the face. To achieve high fidelity image generation, we want the network to focus more on each facial region. Specifically, the corresponding weight of W\mathbf{W} according to mouth, eyes, and skin regions are set to 5.0,3.0,1.05.0, 3.0, 1.0, respectively. Note that weights for other regions in the weighting mask W\mathbf{W}, e.g. background, are set to 00.

Style Transfer for 2D Talking Head Generation (Part 1)

Audio-driven talking head animation is a challenging research topic with many real-world applications. Recent works have focused on creating photo-realistic 2D animation, while learning different talking or singing styles remains an open problem. In this paper, we present a new method to generate talking head animation with learnable style references. Given a set of style reference frames, our framework can reconstruct 2D talking head animation based on a single input image and an audio stream. Our method first produces facial landmarks motion from the audio stream and constructs the intermediate style patterns from the style reference images. We then feed both outputs into a style-aware image generator to generate the photo-realistic and fidelity 2D animation. In practice, our framework can extract the style information of a specific character and transfer it to any new static image for talking head animation. The intensive experimental results show that our method achieves better results than recent state-of-the-art approaches qualitatively and quantitatively.

image

Introduction

Talking head animation is an active research topic in both academia and industry. This task has a wide range of real-world interactive applications such as digital avatars, and digital animations. Given an arbitrary input audio and a 2D image (or a set of 2D images) of a character, the goal of talking head animation is to generate photorealistic frames. The output can be the 2D or 3D talking head. With recent advances in deep learning, especially generative adversarial networks, several works have addressed different aspects of the talking head animation task such as head pose control, facial expression, emotion generation, and photo-realistic synthesis.

While there has been considerable advancement in the generation of talking head animation, achieving photo-realistic and high fidelity animation is not a trivial task. It is even more challenging to render natural motion of the head with different styles. In practice, several aspects contribute to this challenge. First, generating a photo-realistic talking head using only a single image and audio as inputs requires multi-modal synchronization and mapping between the audio stream and facial information. In many circumstances, this process may result in fuzzy backgrounds, ambiguous fidelity, or abnormal face attributes. Second, various talking and singing styles can express diverse personalities. Therefore, the animation methods should be able to adapt and generalize well to different styles. Finally, controlling the head motion and connecting it with the full-body animation remains an open problem.

Recently, several methods have been proposed to generate photo-realistic talking heads or to match the pose from a source video while little work has focused on learning the personalized character style. In practice, apart from personalized talking style, we have different singing styles such as ballad and rap. These styles pose a more challenging problem for talking head animation as they have the unique eye, head, mouth, and torso motion. The facial movements of singing styles are also more varied and dynamic than the talking style. Therefore, learning and bringing these styles into 2D talking heads is more challenging. Currently, most of the style-aware talking head animation methods do not fully disentangle the audio style information and the visual information, which causes ambiguity during the transferring process.

image

Figure 1. Given an audio stream, a single image, and a set of style reference frames, our method generates realistic 2D talking head animation.

In this work, we present a new deep learning framework called Style Transfer for 2D talking head animation. Our framework provides an effective way to transfer talking or singing styles from the style reference to animate single 2D portrait of a character given an arbitrary input audio stream. We first generate photo-realistic 2D animation with natural expression and motion. We then propose a new method to transfer the personalized style of a character into any talking head with a simple style-aware transfer process. Figure 1 shows an overview of our approach.

Research Overview

2D Talking Head Animation Creating talking head animation from an input image and audio has been widely studied in the past few years. One of the earliest works considered this as a sorting task that reorders images from footage video. Some works proposed to capture 3D model from dubber and actor to synthesize photo-realistic face, while others introduced a learning approach to create a trainable system that could synthesize a mouth shape from an unseen utterance. Later works focused on audio-driven to generate realistic mouth shapes and realistic faces, or generating full facial landmarks. Meanwhile, the quality of the animation of those aforementioned approaches can be improved by creating a talking face that includes pose and expression, mainly on generating fidelity talking head with natural head pose and realistic motions. Recently, some methods have been elaborated to encode the personalized information within the talking head animation, or by taking advantage of the diffusion model to improve the diversity of the generated talking face

Speaker Style Estimation There are many kinds of speaker styles such as generic, personal, controlled pose, or special expression. Generic style could be learned by training on multiple videos, while personalized style could be captured by particularly training on one avatar of a person. In general, some well-known methods aim to generates controllable poses with an input video, or to transfer poses and expressions from another video input, such as mapping the style from dubber to actor. Another interesting method tried to capture motions from the driven video and transfer them into input image during the generation process, speaker information and speaking environment can be further ensembled to characterize the speaker variability in the environment. For example, we can leverage a pre-captured database of 3D mouth shapes and associated speech audio from one speaker to refine the mouth shape of a new actor. Recently, Zhang et. al. developed a state-of-the-art method that can generate diverse and synchronized talking videos from input audio and a single reference image by utilizing condition variational autoencoder to capture style code.

Speech Representation for Face Animation Some prior works used hand-crafted models to match phoneme and mouth shape in each millisecond audio signal as speech representation. More advanced, DeepSpeech paved the way for learning a speech recognition system using an end-to-end deep network. Following that, an improvement was made by trainining Bi-LSTMs to learn a language-long-term structure that models the relationship between speech and the complex activity of faces. Additionally the Mel-frequency spectral coefficients can be utilized to synthesize high-quality mouth texture of a character, and then combined it with a 3D pose matching method to synchronize the lip motion with the audio in the target animation. With the rise of the diffusion technique, Diff2lip proposed an audio-conditional diffusion model that effectively encodes audio in their generator to solve the lip-sync challenge.

Our goal is to introduce a new deep-learning framework that can transfer talking or singing styles from any personalized style reference to animate a single 2D portrait of a character given an arbitrary input audio stream. Compared to existing approaches, which have been mainly focusing on conventional talking head animation, our method can not only produce animation for common talking styles but also allows transferring for several special styles that are much more challenging such as singing

To summarize, our research aims to propose a new framework for generating photorealistic 2D talking head animations from the audio stream as input. Furthemore, we present a style-aware transfer technique, which enables us to learn and apply any new style to the animated head. Our generated 2D animation is photo-realistic and high fidelity with natural motions. To validate our meticulously designed system, we conduct intensive analysis and demonstrate that our proposed method outperforms recent approaches both qualitatively and quantitatively.