Fine-Grained Visual Classification using Self-Assessment Classifier (Part 2)
In previous part, we have discussed about the proposal to deal with fine-grained image classification. In this part, we will verify the effectiveness and efficiency of the proposal.
1. Experimental Setup
Dataset. We evaluate our method on three popular fine-grained datasets: CUB-200-2011, Stanford Dogs and FGVC Aircraft (See Table 1).
Dataset | Target | # Cate | # Train | # Test |
---|---|---|---|---|
CUB-200-2011 | Bird | 200 | 5,994 | 5,794 |
Stanford Dogs | Dog | 120 | 12,000 | 8,580 |
FGVC-Aircraft | Aircraft | 100 | 6,667 | 3,333 |
Table 1: Fine-grained classification datasets in our experiments.
Implementation. All experiments are conducted on an NVIDIA Titan V GPU with 12GB RAM. The model is trained using Stochastic Gradient Descent with a momentum of 0.9. The maximum number of epochs is set at 80; the weight decay equals 0.00001, and the mini-batch size is 12. Besides, the initial learning rate is set to 0.001, with exponential decay of 0.9 after every two epochs. Based on validation results, the number of top-k ambiguity classes is set to 10, while the parameters , are set to and , respectively.
Baseline. To validate the effectiveness and generalization of our method, we integrate it into 7 different deep networks, including two popular Deep CNN backbones, Inception-V3 and ResNet-50; and five fine-grained classification methods: WS, DT, WS_DAN, MMAL, and the recent transformer work ViT. It is worth noting that we only add our Self Assessment Classifier into these works, other setups and hyper-parameters for training are kept unchanged when we compare with original codes.
2. Experimental Results
Table.2 summarises the contribution of our Self Assessment Classifier (SAC) to the fine-grained classification results of different methods on three datasets CUB-200-2011, Stanford Dogs, and FGVC Aircraft. This table clearly shows that by integrating SAC into different classifiers, the fine-grained classification results are consistently improved. In particular, we observe an average improvement of , , and in the CUB-200-2011, Stanford Dogs, and FGVC Aircraft datasets, respectively.
Methods | CUB-200-2011 | Stanford Dogs | FGVC Aircraft |
---|---|---|---|
MAMC | 86.5 | 85.2 | _ |
PC | 86.9 | 83.8 | 89.2 |
MC | 87.3 | _ | 92.9 |
DCL | 87.8 | _ | 93.0 |
ACNet | 88.1 | _ | 92.4 |
DF-GMM | 88.8 | _ | 93.8 |
API-Net | 90.0 | 90.3 | 93.9 |
GHORD | 89.6 | _ | 94.3 |
CAL | 90.6 | _ | 94.2 |
Parts Models | 90.4 | 93.9 | _ |
ViT + DCAL | 91.4 | _ | 91.5 |
P2P-Net | 90.2 | _ | 94.2 |
Inception-V3 | 83.7 | 85.1 | 87.4 |
Inception-V3+ SAC | 85.3 (+1.6) | 86.8 (+1.7) | 89.2 (+1.8) |
ResNet-50 | 86.4 | 86.1 | 90.3 |
ResNet-50+ SAC | 88.3 (+1.9) | 87.4 (+1.3) | 92.1 (+1.8) |
WS | 88.8 | 91.4 | 92.3 |
WS+SAC | 89.9 (+1.1) | 92.5 (+1.1) | 93.2 (+0.9) |
DT | 89.2 | 88.0 | 90.7 |
DT+SAC | 90.1 (+0.9) | 88.8 (+0.8) | 91.9 (+1.2) |
MMAL | 89.6 | 90.6 | 94.7 |
MMAL+SAC | 90.8 (+1.2) | 91.6 (+1.0) | 95.5 (+0.8) |
WS_DAN | 89.4 | 92.2 | 93.0 |
WS_DAN+SAC | 91.1 (+1.7) | 93.1 (+0.9) | 93.9 (+0.9) |
ViT | 91.0 | 93.2 | 92.1 |
ViT+SAC | 91.8 (+0.8) | 94.5 (+1.3) | 93.1 (+1.0) |
Avg. Improvement | +1.3 | +1.2 | +1.2 |
Table: Contribution (% Acc) of our Self Assessment Classifier (SAC) on fine-grained classification results.
3. Qualitative Results
Attention Maps. Figure.1 illustrates the visualization of attention maps between image feature maps and each ambiguity class. The visualization indicates that by employing our Self Assessment Classifier, each fine-grained class focuses on different informative regions.
Prediction Results. Figure.2 illustrates the classification results and corresponding localization areas of different methods. In all samples, we can see that our SAC focuses on different areas based on different hard-to-distinguish classes. Thus, the method can focus on more meaningful areas and also ignore unnecessary ones. Hence, SAC achieves good predictions even with challenging cases.
Conclusion
We introduce a Self Assessment Classifier (SAC) which effectively learns the discriminative features in the image and resolves the ambiguity from the top-k prediction classes. Our method generates the attention map and uses this map to dynamically erase unnecessary regions during the training. The intensive experiments on CUB-200-2011, Stanford Dogs, and FGVC Aircraft datasets show that our proposed method can be easily integrated into different fine-grained classifiers and clearly improve their accuracy.