/ / 12 min read

Fine-Grained Visual Classification using Self-Assessment Classifier (Part 2)

Effectiveness and efficiency of Self-Assessment Classifier

In previous part, we have discussed about the proposal to deal with fine-grained image classification. In this part, we will verify the effectiveness and efficiency of the proposal.

1. Experimental Setup

Dataset. We evaluate our method on three popular fine-grained datasets: CUB-200-2011, Stanford Dogs and FGVC Aircraft (See Table 1).

DatasetTarget# Cate# Train# Test
CUB-200-2011Bird2005,9945,794
Stanford DogsDog12012,0008,580
FGVC-AircraftAircraft1006,6673,333

Table 1: Fine-grained classification datasets in our experiments.

Implementation. All experiments are conducted on an NVIDIA Titan V GPU with 12GB RAM. The model is trained using Stochastic Gradient Descent with a momentum of 0.9. The maximum number of epochs is set at 80; the weight decay equals 0.00001, and the mini-batch size is 12. Besides, the initial learning rate is set to 0.001, with exponential decay of 0.9 after every two epochs. Based on validation results, the number of top-k ambiguity classes is set to 10, while the parameters dϕd_{\phi}, α\alpha are set to 0.10.1 and 0.50.5, respectively.

Baseline. To validate the effectiveness and generalization of our method, we integrate it into 7 different deep networks, including two popular Deep CNN backbones, Inception-V3 and ResNet-50; and five fine-grained classification methods: WS, DT, WS_DAN, MMAL, and the recent transformer work ViT. It is worth noting that we only add our Self Assessment Classifier into these works, other setups and hyper-parameters for training are kept unchanged when we compare with original codes.

2. Experimental Results

Table.2 summarises the contribution of our Self Assessment Classifier (SAC) to the fine-grained classification results of different methods on three datasets CUB-200-2011, Stanford Dogs, and FGVC Aircraft. This table clearly shows that by integrating SAC into different classifiers, the fine-grained classification results are consistently improved. In particular, we observe an average improvement of +1.3+1.3, +1.2+1.2, and +1.2+1.2 in the CUB-200-2011, Stanford Dogs, and FGVC Aircraft datasets, respectively.

MethodsCUB-200-2011Stanford DogsFGVC Aircraft
MAMC86.585.2_
PC86.983.889.2
MC87.3_92.9
DCL87.8_93.0
ACNet88.1_92.4
DF-GMM88.8_93.8
API-Net90.090.393.9
GHORD89.6_94.3
CAL90.6_94.2
Parts Models90.493.9_
ViT + DCAL91.4_91.5
P2P-Net90.2_94.2
Inception-V383.785.187.4
Inception-V3+ SAC85.3 (+1.6)86.8 (+1.7)89.2 (+1.8)
ResNet-5086.486.190.3
ResNet-50+ SAC88.3 (+1.9)87.4 (+1.3)92.1 (+1.8)
WS88.891.492.3
WS+SAC89.9 (+1.1)92.5 (+1.1)93.2 (+0.9)
DT89.288.090.7
DT+SAC90.1 (+0.9)88.8 (+0.8)91.9 (+1.2)
MMAL89.690.694.7
MMAL+SAC90.8 (+1.2)91.6 (+1.0)95.5 (+0.8)
WS_DAN89.492.293.0
WS_DAN+SAC91.1 (+1.7)93.1 (+0.9)93.9 (+0.9)
ViT91.093.292.1
ViT+SAC91.8 (+0.8)94.5 (+1.3)93.1 (+1.0)
Avg. Improvement+1.3+1.2+1.2

Table: Contribution (% Acc) of our Self Assessment Classifier (SAC) on fine-grained classification results.

3. Qualitative Results

Attention Maps. Figure.1 illustrates the visualization of attention maps between image feature maps and each ambiguity class. The visualization indicates that by employing our Self Assessment Classifier, each fine-grained class focuses on different informative regions.

Figure 1. The visualization of the attention map between image feature maps and different ambiguity classes from our method. The red-colored class label denotes that the prediction is matched with the ground-truth.

Prediction Results. Figure.2 illustrates the classification results and corresponding localization areas of different methods. In all samples, we can see that our SAC focuses on different areas based on different hard-to-distinguish classes. Thus, the method can focus on more meaningful areas and also ignore unnecessary ones. Hence, SAC achieves good predictions even with challenging cases.

Figure 2. Qualitative comparison of different classification methods. (a) Input image and its corresponding ground-truth label, (b) ResNet-50, (c) WS_DAN, (d) MMAL, and (e) Our SAC. Boxes are localization areas. Red color indicates wrong classification result. Blue color indicates correct predicted label.

Conclusion

We introduce a Self Assessment Classifier (SAC) which effectively learns the discriminative features in the image and resolves the ambiguity from the top-k prediction classes. Our method generates the attention map and uses this map to dynamically erase unnecessary regions during the training. The intensive experiments on CUB-200-2011, Stanford Dogs, and FGVC Aircraft datasets show that our proposed method can be easily integrated into different fine-grained classifiers and clearly improve their accuracy.

Like What You See?