In this blog, I will present a survey on Geometry-Aware Neural Style Transfer (NST) methods and compare them with other standard NST methods. Furthermore, some weaknesses of current Geometry-Aware NST methods are also analyzed. All of the experiments are conducted based on the implementation of DST, NST, AdaIN, and FastDST.
1. Problem statement
Input: A content image and a style image
Output: A stylized version of the content image
2. Experiment Setting
|Inference Time (s)
|105 + 345
|2 + 342
Table 1: Experiment Settings. I resize all images to 256 x 256 for all experiments. AdaIN is the one-shot forward method, so it does not require optimization at the inference. DST and FastDST require time to get source point and target point for geometry-transformation (105s and 2s for DST and FastDST, respectively). DST and FastDST share the same methods, however, NBB is more time consuming (105s) than facial landmark detection in FastDST (2s). Note that the inference time is calculated using Tesla K80 (on Google Colab)
NST is the fist deep-learning method for style transfer problem, where given a style image and a content image and resulting stylized version of the content image.
AdaIN is the fast version of NST. AdaIN optimizes a forward network during the training phase instead of directly optimizing the resulting image.
DST is the geometry-aware version of NST.
FastDST is the fast version of DST. More concretely, FastDST replaces the time consuming NBB by facial landmark detection (by opencv dlib).
AdaIN, DST, and FastDST keep the arbitrary style transfer of NST.
3. Finding 1:
DST-based methods (i.e. DST and FastDST) are well aware of the geometry shape of the style image compared to Texture-NST methods (i.e. NST and AdaIN).
NST and AdaIN optimize resulting images using texture-style loss and content loss. DST and FastDST propose to add deformation loss beside two above losses. Given a set of k source points and matching target , deformation loss tries to optimize params move points in to points in as in equation below:
The main difference between DST and FastDST is the way each of them finds and (see 2. Experiment Setting).
4. Finding 2:
DST-based methods do not work well with large geometry deformation (e.g. some anime style)
We see that DST outputs have distortion and do not well capture style in the style images. I assume that DST-based methods only work with two inputs that their facial landmarks are roughly aligned.
More than that, FastDST does not work with the above style images. The problem is that facial landmark detector (i.e. dlib of opencv) can not find a face in the above style images.
The geometry gap between cartoon and real face is very large. As a result, my hypothesis is that optimizing the geometry gap by just an exemplar style image is not enough.
5. The face-of-art as the facial landmark detector in DST
In the-face-of-art paper, they introduced a novel method to detect facial landmarks of multiple artwork styles. Then the geometry-aware result will be generated by a classic method TPS based on the landmarks of real face and art face. Finally the texture of the geometry-aware result will be transferred by NST.
In the DST paper, the authors introduce one-shot-FoA. First, pre-trained facial landmark to get and (instead of NBB in DST and dlib of opencv in FastDST). Then they replace TPS and NST steps by the optimizing step of DST.
The image above (from the paper) shows the quite similarity result between DST and one-shot-FoA, but one-shot-FoA geometry ratio is more like style images.
However, In my opinion, when applied to cartoon style, my hypothesis is that the results of both the-face-of-art and one-shot-FoA suffer geometry distortion like in DST. The reason is that TPS is even more sensitive to the large geometry deformation than the optimizing strategy of DST.
- A clearly challenging problem of DST-based is the inference speed.
- All the above methods are exemplar-NST. Thus, they face a challenging problem when picking a suitable style image and do not work well with arbitrary reference style images. To overcome this, works below propose some solutions.
- Specifically to the face transfer problem, in the-face-of-art paper, they also introduced the common-facial-landmark by a set of facial landmarks. The common-facial-landmark can be more general for the style than the facial-landmark of an exemplar style image.
- Works like ASMAGAN, where the style is learnt by a set of art works. If applicable, the methods should embed the information of the facial landmark during training time. There exists an repo for cartoon facial landmark