2 posts tagged with "fine-grained"

View All Tags

Fine-Grained Visual Classification using Self-Assessment Classifier (Part 2)

In previous part, we have discussed about the proposal to deal with fine-grained image classification. In this part, we will verify the effectiveness and efficiency of the proposal.

1. Experimental Setup

Dataset. We evaluate our method on three popular fine-grained datasets: CUB-200-2011, Stanford Dogs and FGVC Aircraft (See Table 1).

DatasetTarget# Cate# Train# Test
CUB-200-2011Bird2005,9945,794
Stanford DogsDog12012,0008,580
FGVC-AircraftAircraft1006,6673,333

Table 1: Fine-grained classification datasets in our experiments.

Implementation. All experiments are conducted on an NVIDIA Titan V GPU with 12GB RAM. The model is trained using Stochastic Gradient Descent with a momentum of 0.9. The maximum number of epochs is set at 80; the weight decay equals 0.00001, and the mini-batch size is 12. Besides, the initial learning rate is set to 0.001, with exponential decay of 0.9 after every two epochs. Based on validation results, the number of top-k ambiguity classes is set to 10, while the parameters dϕd_{\phi}, α\alpha are set to 0.10.1 and 0.50.5, respectively.

Baseline. To validate the effectiveness and generalization of our method, we integrate it into 7 different deep networks, including two popular Deep CNN backbones, Inception-V3 and ResNet-50; and five fine-grained classification methods: WS, DT, WS_DAN, MMAL, and the recent transformer work ViT. It is worth noting that we only add our Self Assessment Classifier into these works, other setups and hyper-parameters for training are kept unchanged when we compare with original codes.

2. Experimental Results

Table.2 summarises the contribution of our Self Assessment Classifier (SAC) to the fine-grained classification results of different methods on three datasets CUB-200-2011, Stanford Dogs, and FGVC Aircraft. This table clearly shows that by integrating SAC into different classifiers, the fine-grained classification results are consistently improved. In particular, we observe an average improvement of +1.3+1.3, +1.2+1.2, and +1.2+1.2 in the CUB-200-2011, Stanford Dogs, and FGVC Aircraft datasets, respectively.

MethodsCUB-200-2011Stanford DogsFGVC Aircraft
MAMC86.585.2_
PC86.983.889.2
MC87.3_92.9
DCL87.8_93.0
ACNet88.1_92.4
DF-GMM88.8_93.8
API-Net90.090.393.9
GHORD89.6_94.3
CAL90.6_94.2
Parts Models90.493.9_
ViT + DCAL91.4_91.5
P2P-Net90.2_94.2
Inception-V383.785.187.4
Inception-V3+ SAC85.3 (+1.6)86.8 (+1.7)89.2 (+1.8)
ResNet-5086.486.190.3
ResNet-50+ SAC88.3 (+1.9)87.4 (+1.3)92.1 (+1.8)
WS88.891.492.3
WS+SAC89.9 (+1.1)92.5 (+1.1)93.2 (+0.9)
DT89.288.090.7
DT+SAC90.1 (+0.9)88.8 (+0.8)91.9 (+1.2)
MMAL89.690.694.7
MMAL+SAC90.8 (+1.2)91.6 (+1.0)95.5 (+0.8)
WS_DAN89.492.293.0
WS_DAN+SAC91.1 (+1.7)93.1 (+0.9)93.9 (+0.9)
ViT91.093.292.1
ViT+SAC91.8 (+0.8)94.5 (+1.3)93.1 (+1.0)
Avg. Improvement+1.3+1.2+1.2

Table: Contribution (% Acc) of our Self Assessment Classifier (SAC) on fine-grained classification results.

3. Qualitative Results

Attention Maps. Figure.1 illustrates the visualization of attention maps between image feature maps and each ambiguity class. The visualization indicates that by employing our Self Assessment Classifier, each fine-grained class focuses on different informative regions.

Figure 1. The visualization of the attention map between image feature maps and different ambiguity classes from our method. The red-colored class label denotes that the prediction is matched with the ground-truth.

Prediction Results. Figure.2 illustrates the classification results and corresponding localization areas of different methods. In all samples, we can see that our SAC focuses on different areas based on different hard-to-distinguish classes. Thus, the method can focus on more meaningful areas and also ignore unnecessary ones. Hence, SAC achieves good predictions even with challenging cases.

Figure 2. Qualitative comparison of different classification methods. (a) Input image and its corresponding ground-truth label, (b) ResNet-50, (c) WS_DAN, (d) MMAL, and (e) Our SAC. Boxes are localization areas. Red color indicates wrong classification result. Blue color indicates correct predicted label.

Conclusion

We introduce a Self Assessment Classifier (SAC) which effectively learns the discriminative features in the image and resolves the ambiguity from the top-k prediction classes. Our method generates the attention map and uses this map to dynamically erase unnecessary regions during the training. The intensive experiments on CUB-200-2011, Stanford Dogs, and FGVC Aircraft datasets show that our proposed method can be easily integrated into different fine-grained classifiers and clearly improve their accuracy.

Fine-Grained Visual Classification using Self-Assessment Classifier (Part 1)

Extracting discriminative features plays a crucial role in the fine-grained visual classification task. Most of the existing methods focus on developing attention or augmentation mechanisms to achieve this goal. However, addressing the ambiguity in the top-k prediction classes is not fully investigated. In this paper, we introduce a Self Assessment Classifier, which simultaneously leverages the representation of the image and top-k prediction classes to reassess the classification results. Our method is inspired by self-supervised learning with coarse-grained and fine-grained classifiers to increase the discrimination of features in the backbone and produce attention maps of informative areas on the image. In practice, our method works as an auxiliary branch and can be easily integrated into different architectures. We show that by effectively addressing the ambiguity in the top-k prediction classes, our method achieves new state-of-the-art results on CUB200-2011, Stanford Dog, and FGVC Aircraft datasets. Furthermore, our method also consistently improves the accuracy of different existing fine-grained classifiers with a unified setup.

1. Introduction

The task of fine-grained visual classification involves categorizing images that belong to the same class (e.g., various species of birds, types of aircraft, or different varieties of flowers). Compared to standard image classification tasks, fine-grained classification poses greater challenges due to three primary factors: (i) significant intra-class variation, where objects within the same category exhibit diverse poses and viewpoints; (ii) subtle inter-class distinctions, where objects from different categories may appear very similar except for minor differences, such as the color patterns of a bird's head often determining its fine-grained classification; and (iii) constraints on training data availability, as annotating fine-grained categories typically demands specialized expertise and considerable annotation effort. Consequently, achieving accurate classification results solely with state-of-the-art CNN models like VGG is nontrivial.

Recent research demonstrates that a crucial strategy for fine-grained classification involves identifying informative regions across various parts of objects and extracting distinguishing features. A common approach to achieving this is by learning the object's parts through human annotations. However, annotating fine-grained regions is labor-intensive, rendering this method impractical. Some advancements have explored unsupervised or weakly-supervised learning techniques to identify informative object parts or region of interest bounding boxes. While these methods offer potential solutions to circumvent manual labeling of fine-grained regions, they come with limitations such as reduced accuracy, high computational costs during training or inference, and challenges in accurately detecting distinct bounding boxes.

n this paper, we introduce the Self Assessment Classifier (SAC) method to tackle the inherent ambiguity present in fine-grained classification tasks. Essentially, our approach is devised to reevaluate the top-k prediction outcomes and filter out uninformative regions within the input image. This serves to mitigate inter-class ambiguity and enables the backbone network to learn more discerning features. Throughout training, our method generates attention maps that highlight informative regions within the input image. By integrating this method into a backbone network, we aim to reduce misclassifications among top-k ambiguous classes. It's important to note that "ambiguity classes" refer to instances where uncertainty in prediction can lead to incorrect classifications. Our contributions can be succinctly outlined as follows:

  • We propose a novel self-class assessment method that simultaneously learns discriminative features and addresses ambiguity issues in fine-grained visual classification tasks.
  • We demonstrate the versatility of our method by showcasing its seamless integration into various fine-grained classifiers, resulting in improved state-of-the-art performance.

Figure 1. Comparison between generic classification and fine-grained classification.

2. Method Overview

We propose two main steps in our method: Top-k Coarse-grained Class Search (TCCS) and Self Assessment Classifier (SAC). TCCS works as a coarse-grained classifier to extract visual features from the backbone. The Self Assessment Classifier works as a fine-grained classifier to reassess the ambiguity classes and eliminate the non-informative regions. Our SAC has four modules: the Top-k Class Embedding module aims to encode the information of the ambiguity class; the Joint Embedding module aims to jointly learn the coarse-grained features and top-k ambiguity classes; the Self Assessment module is designed to differentiate between ambiguity classes; and finally, the Dropping module is a data augmentation method, designed to erase unnecessary inter-class similar regions out of the input image. Figure.2 shows an overview of our approach.

Figure 2. Method Overview.

3. Top-k Coarse-grained Class Search

The TCCS takes an image as input. Each input image is passed through a Deep CNN to extract feature map FRdf×m×n\textbf{\textit{F}} \in \mathbb{R}^{d_f \times m \times n} and the visual feature VRdv\textbf{\textit{V}} \in \mathbb{R}^{d_v}. m,nm, n, and dfd_f represent the feature map height, width, and the number of channels, respectively; dvd_v denotes the dimension of the visual feature V\textbf{\textit{V}}. In practice, the visual feature V\textbf{\textit{V}} is usually obtained by applying some fully connected layers after the convolutional feature map F\textbf{\textit{F}}.

The visual features V\textbf{\textit{V}} is used by the 1st1^{st} classifier, i.e., the original classifier of the backbone, to obtain the top-k prediction results. Assuming that the fine-grained dataset has NN classes. The top-k prediction results Ck={C1,...,Ck}C_k = \{C_1,..., C_k\} is a subset of all prediction classes CNC_N, with kk is the number of candidates that have the kk-highest confident scores.

4. Self Assessment Classifier

Our Self Assessment Classifier takes the image feature F\textbf{\textit{F}} and top-k prediction CkC_k from TCCS as the input to reassess the fine-grained classification results.

Top-k Class Embedding

The output of the TCCS module CkC_k is passed through the top-k class embedding module to output label embedding set Ek={E1,...Ei,...,Ek},i{1,2,...,k},EiRde\textbf{E}_k = \{E_1,...E_i,..., E_k\}, i \in \{1,2, ..., k\}, E_i \in \mathbb{R}^{d_{e}}. This module contains a word embedding layer~\cite{pennington2014glove} for encoding each word in class labels and a GRU~\cite{2014ChoGRU} layer for learning the temporal information in class label names. ded_{e} represents the dimension of each class label. It is worth noting that the embedding module is trained end-to-end with the whole model. Hence, the class label representations are learned from scratch without the need of any pre-extracted/pre-trained or transfer learning.

Given an input class label, we trim the input to a maximum of 44 words. The class label that is shorter than 44 words is zero-padded. Each word is then represented by a 300300-D word embedding. This step results in a sequence of word embeddings with a size of 4×3004 \times 300 and denotes as E^i\hat{E}_i of ii-th class label in CkC_k class label set. In order to obtain the dependency within the class label name, the E^i\hat{E}_i is passed through a Gated Recurrent Unit (GRU), which results in a 10241024-D vector representation EiE_i for each input class. Note that, although we use the language modality (i.e., class label name), it is not extra information as the class label name and the class label identity (for calculating the loss) represent the same object category.

Joint Embedding

This module takes the feature map F\textbf{\textit{F}} and the top-k class embedding Ek\textbf{E}_k as the input to produce the joint representation JRdj\textbf{\textit{J}} \in \mathbb{R}^{d_j} and the attention map. We first flatten F\textbf{\textit{F}} into (df×f)(d_f \times f), and Ek\textbf{E}_k is into (de×k)(d_e \times k). The joint representation J\textbf{\textit{J}} is calculated using two modalities F\textbf{\textit{F}} and Ek\textbf{E}_k as follows:

JT=(T×1vec(F))×2vec(Ek)\textbf{\textit{J}}^T= \left(\mathcal{T} \times_1 \text{vec}(\textbf{\textit{F}}) \right) \times_2 \text{vec}(\textbf{E}_k)

where TRdF×dEk×dj\mathcal{T} \in \mathbb{R}^{d_{\textbf{\textit{F}}} \times d_{\textbf{E}_k} \times d_j} is a learnable tensor; dF=(df×f)d_{\textbf{\textit{F}}} = (d_f \times f); dEk=(de×k)d_{\textbf{E}_k} = (d_e \times k); vec()\text{vec}() is a vectorization operator; ×i\times_i denotes the ii-mode tensor product.

In practice, the preceding T\mathcal{T} is too large and infeasible to learn. Thus, we apply decomposition solutions that reduce the size of T\mathcal{T} but still retain the learning effectiveness. We rely on the idea of the unitary attention mechanism. Specifically, let JpRdj\textbf{\textit{J}}_p \in \mathbb{R}^{d_j} be the joint representation of pthp^{th} couple of channels where each channel in the couple is from a different input. The joint representation J\textbf{\textit{J}} is approximated by using the joint representations of all couples instead of using fully parameterized interaction as in Eq.~\ref{eq:hypothesis}. Hence, we compute J\textbf{\textit{J}} as:

J=pMpJp\textbf{\textit{J}} = \sum_p \mathcal{M}_p \textbf{\textit{J}}_p

Note that in Equation above, we compute a weighted sum over all possible couples. The pthp^{th} couple is associated with a scalar weight Mp\mathcal{M}_p. The set of Mp\mathcal{M}_p is called the attention map M\mathcal{M}, where MRf×k\mathcal{M} \in \mathbb{R}^{f \times k}.

There are f×kf \times k possible couples over the two modalities. The representation of each channel in a couple is Fi,(Ek)j\textbf{\textit{F}}_{i}, \left(\textbf{E}_k\right)_{j}, where i[1,f],j[1,k]i \in [1,f], j \in [1,k], respectively. The joint representation Jp\textbf{\textit{J}}_p is then computed as follows

JpT=(Tu×1Fi)×2(Ek)j\textbf{\textit{J}}_p^T= \left(\mathcal{T}_{u} \times_1 \textbf{\textit{F}}_{i} \right)\times_2 \left(\textbf{E}_k\right)_{j}

where TuRdf×de×dj\mathcal{T}_{u} \in \mathbb{R}^{d_f \times d_e \times d_j} is the learning tensor between channels in the couple.

From Equation above, we can compute the attention map M\mathcal{M} using the reduced parameterized bilinear interaction over the inputs F\textbf{\textit{F}} and Ek\textbf{E}_k. The attention map is computed as

M=softmax((TM×1F)×2Ek)\mathcal{M} = \text{softmax}\left(\left(\mathcal{T}_\mathcal{M} \times_1 \textbf{\textit{F}} \right) \times_2 \textbf{E}_k \right)

where TMRdf×de\mathcal{T}_\mathcal{M} \in \mathbb{R}^{d_f \times d_e} is the learnable tensor.

The joint representation J\textbf{\textit{J}} can be rewritten as

JT=i=1fj=1kMij((Tu×1Fi)×2(Ek)j)\textbf{\textit{J}}^T= \sum_{i=1}^{f}\sum_{j=1}^{k} \mathcal{M}_{ij} \left( \left( \mathcal{T}_{u} \times_1 \textbf{\textit{F}}_{i}\right) \times_2 \left(\textbf{E}_k\right)_{j} \right)

It is also worth noting from Equation above that to compute J\textbf{\textit{J}}, instead of learning the large tensor TRdF×dEk×dj\mathcal{T} \in \mathbb{R}^{d_{F} \times d_{\textbf{E}_k} \times d_j}, we now only need to learn two smaller tensors TuRdf×de×dj\mathcal{T}_{u} \in \mathbb{R}^{d_{f} \times d_{e} \times d_j} in Eq.~\ref{eq:couplecompute} and $\mathcal{T}\mathcal{M} \in \mathbb{R}^{d_f \times d_e}$.

Self Assessment

The joint representation J\textbf{\textit{J}} from the Joint Embedding module is used as the input in the Self Assessment step to obtain the 2nd2^{nd} top-k predictions Ck\textbf{C}'_k. Note that Ck={C1,...,Ck}\textbf{C}'_k = \{C'_1,..., C'_k\}. Intuitively, Ck\textbf{C}'_k is the top-k classification result after self-assessment. This module is a fine-grained classifier that produces the 2nd2^{nd} predictions to reassess the ambiguity classification results.

The contribution of the coarse-grained and fine-grained classifier is calculated by

Pr(ρ=ρi)=αPr1(ρ=ρi)+(1α)Pr2(ρ=ρi)\text{Pr}(\small{\rho} = \small{\rho}_i) = \alpha \text{Pr}_1(\small{\rho} = \small{\rho}_i) + (1- \alpha) \text{Pr}_2(\small{\rho} = \small{\rho}_i)

where α\alpha is the trade-off hyper-parameter (0α1)\left(0 \leq \alpha \leq 1\right). Pr1(ρ=ρi),Pr2(ρ=ρi)\text{Pr}_1(\small{\rho} = \small{\rho}_i), \text{Pr}_2(\small{\rho} = \small{\rho}_i) denotes the prediction probabilities for class ρi\small{\rho}_i, from the coarse-grained and fine-grained classifiers, respectively.

Next

In the next post, we will verify the effectiveness and efficiency of the method.