Posts tagged "fine-grained"

Fine-Grained Visual Classification using Self-Assessment Classifier (Part 2)

August 28, 2024 · 5 min read

In previous part, we have discussed about the proposal to deal with fine-grained image classification. In this part, we will verify the effectiveness and efficiency of the proposal.

1. Experimental Setup

Dataset. We evaluate our method on three popular fine-grained datasets: CUB-200-2011, Stanford Dogs and FGVC Aircraft (See Table 1).

Dataset	Target	# Cate	# Train	# Test
CUB-200-2011	Bird	200	5,994	5,794
Stanford Dogs	Dog	120	12,000	8,580
FGVC-Aircraft	Aircraft	100	6,667	3,333

Table 1: Fine-grained classification datasets in our experiments.

Implementation. All experiments are conducted on an NVIDIA Titan V GPU with 12GB RAM. The model is trained using Stochastic Gradient Descent with a momentum of 0.9. The maximum number of epochs is set at 80; the weight decay equals 0.00001, and the mini-batch size is 12. Besides, the initial learning rate is set to 0.001, with exponential decay of 0.9 after every two epochs. Based on validation results, the number of top-k ambiguity classes is set to 10, while the parameters $d_{\phi}$ , $\alpha$ are set to $0.1$ and $0.5$ , respectively.

Baseline. To validate the effectiveness and generalization of our method, we integrate it into 7 different deep networks, including two popular Deep CNN backbones, Inception-V3 and ResNet-50; and five fine-grained classification methods: WS, DT, WS_DAN, MMAL, and the recent transformer work ViT. It is worth noting that we only add our Self Assessment Classifier into these works, other setups and hyper-parameters for training are kept unchanged when we compare with original codes.

2. Experimental Results

Table.2 summarises the contribution of our Self Assessment Classifier (SAC) to the fine-grained classification results of different methods on three datasets CUB-200-2011, Stanford Dogs, and FGVC Aircraft. This table clearly shows that by integrating SAC into different classifiers, the fine-grained classification results are consistently improved. In particular, we observe an average improvement of $+1.3$ , $+1.2$ , and $+1.2$ in the CUB-200-2011, Stanford Dogs, and FGVC Aircraft datasets, respectively.

Methods	CUB-200-2011	Stanford Dogs	FGVC Aircraft
MAMC	86.5	85.2	_
PC	86.9	83.8	89.2
MC	87.3	_	92.9
DCL	87.8	_	93.0
ACNet	88.1	_	92.4
DF-GMM	88.8	_	93.8
API-Net	90.0	90.3	93.9
GHORD	89.6	_	94.3
CAL	90.6	_	94.2
Parts Models	90.4	93.9	_
ViT + DCAL	91.4	_	91.5
P2P-Net	90.2	_	94.2
Inception-V3	83.7	85.1	87.4
Inception-V3+ SAC	85.3 (+1.6)	86.8 (+1.7)	89.2 (+1.8)
ResNet-50	86.4	86.1	90.3
ResNet-50+ SAC	88.3 (+1.9)	87.4 (+1.3)	92.1 (+1.8)
WS	88.8	91.4	92.3
WS+SAC	89.9 (+1.1)	92.5 (+1.1)	93.2 (+0.9)
DT	89.2	88.0	90.7
DT+SAC	90.1 (+0.9)	88.8 (+0.8)	91.9 (+1.2)
MMAL	89.6	90.6	94.7
MMAL+SAC	90.8 (+1.2)	91.6 (+1.0)	95.5 (+0.8)
WS_DAN	89.4	92.2	93.0
WS_DAN+SAC	91.1 (+1.7)	93.1 (+0.9)	93.9 (+0.9)
ViT	91.0	93.2	92.1
ViT+SAC	91.8 (+0.8)	94.5 (+1.3)	93.1 (+1.0)
Avg. Improvement	+1.3	+1.2	+1.2

Table: Contribution (% Acc) of our Self Assessment Classifier (SAC) on fine-grained classification results.

3. Qualitative Results

Attention Maps. Figure.1 illustrates the visualization of attention maps between image feature maps and each ambiguity class. The visualization indicates that by employing our Self Assessment Classifier, each fine-grained class focuses on different informative regions.

Figure 1. The visualization of the attention map between image feature maps and different ambiguity classes from our method. The red-colored class label denotes that the prediction is matched with the ground-truth.

Prediction Results. Figure.2 illustrates the classification results and corresponding localization areas of different methods. In all samples, we can see that our SAC focuses on different areas based on different hard-to-distinguish classes. Thus, the method can focus on more meaningful areas and also ignore unnecessary ones. Hence, SAC achieves good predictions even with challenging cases.

Figure 2. Qualitative comparison of different classification methods. (a) Input image and its corresponding ground-truth label, (b) ResNet-50, (c) WS_DAN, (d) MMAL, and (e) Our SAC. Boxes are localization areas. Red color indicates wrong classification result. Blue color indicates correct predicted label.

Conclusion

We introduce a Self Assessment Classifier (SAC) which effectively learns the discriminative features in the image and resolves the ambiguity from the top-k prediction classes. Our method generates the attention map and uses this map to dynamically erase unnecessary regions during the training. The intensive experiments on CUB-200-2011, Stanford Dogs, and FGVC Aircraft datasets show that our proposed method can be easily integrated into different fine-grained classifiers and clearly improve their accuracy.

Fine-Grained Visual Classification using Self-Assessment Classifier (Part 1)

August 15, 2024 · 9 min read

Extracting discriminative features plays a crucial role in the fine-grained visual classification task. Most of the existing methods focus on developing attention or augmentation mechanisms to achieve this goal. However, addressing the ambiguity in the top-k prediction classes is not fully investigated. In this paper, we introduce a Self Assessment Classifier, which simultaneously leverages the representation of the image and top-k prediction classes to reassess the classification results. Our method is inspired by self-supervised learning with coarse-grained and fine-grained classifiers to increase the discrimination of features in the backbone and produce attention maps of informative areas on the image. In practice, our method works as an auxiliary branch and can be easily integrated into different architectures. We show that by effectively addressing the ambiguity in the top-k prediction classes, our method achieves new state-of-the-art results on CUB200-2011, Stanford Dog, and FGVC Aircraft datasets. Furthermore, our method also consistently improves the accuracy of different existing fine-grained classifiers with a unified setup.

1. Introduction

The task of fine-grained visual classification involves categorizing images that belong to the same class (e.g., various species of birds, types of aircraft, or different varieties of flowers). Compared to standard image classification tasks, fine-grained classification poses greater challenges due to three primary factors: (i) significant intra-class variation, where objects within the same category exhibit diverse poses and viewpoints; (ii) subtle inter-class distinctions, where objects from different categories may appear very similar except for minor differences, such as the color patterns of a bird's head often determining its fine-grained classification; and (iii) constraints on training data availability, as annotating fine-grained categories typically demands specialized expertise and considerable annotation effort. Consequently, achieving accurate classification results solely with state-of-the-art CNN models like VGG is nontrivial.

Recent research demonstrates that a crucial strategy for fine-grained classification involves identifying informative regions across various parts of objects and extracting distinguishing features. A common approach to achieving this is by learning the object's parts through human annotations. However, annotating fine-grained regions is labor-intensive, rendering this method impractical. Some advancements have explored unsupervised or weakly-supervised learning techniques to identify informative object parts or region of interest bounding boxes. While these methods offer potential solutions to circumvent manual labeling of fine-grained regions, they come with limitations such as reduced accuracy, high computational costs during training or inference, and challenges in accurately detecting distinct bounding boxes.

n this paper, we introduce the Self Assessment Classifier (SAC) method to tackle the inherent ambiguity present in fine-grained classification tasks. Essentially, our approach is devised to reevaluate the top-k prediction outcomes and filter out uninformative regions within the input image. This serves to mitigate inter-class ambiguity and enables the backbone network to learn more discerning features. Throughout training, our method generates attention maps that highlight informative regions within the input image. By integrating this method into a backbone network, we aim to reduce misclassifications among top-k ambiguous classes. It's important to note that "ambiguity classes" refer to instances where uncertainty in prediction can lead to incorrect classifications. Our contributions can be succinctly outlined as follows:

We propose a novel self-class assessment method that simultaneously learns discriminative features and addresses ambiguity issues in fine-grained visual classification tasks.
We demonstrate the versatility of our method by showcasing its seamless integration into various fine-grained classifiers, resulting in improved state-of-the-art performance.

Figure 1. Comparison between generic classification and fine-grained classification.

2. Method Overview

We propose two main steps in our method: Top-k Coarse-grained Class Search (TCCS) and Self Assessment Classifier (SAC). TCCS works as a coarse-grained classifier to extract visual features from the backbone. The Self Assessment Classifier works as a fine-grained classifier to reassess the ambiguity classes and eliminate the non-informative regions. Our SAC has four modules: the Top-k Class Embedding module aims to encode the information of the ambiguity class; the Joint Embedding module aims to jointly learn the coarse-grained features and top-k ambiguity classes; the Self Assessment module is designed to differentiate between ambiguity classes; and finally, the Dropping module is a data augmentation method, designed to erase unnecessary inter-class similar regions out of the input image. Figure.2 shows an overview of our approach.

Figure 2. Method Overview.

3. Top-k Coarse-grained Class Search

The TCCS takes an image as input. Each input image is passed through a Deep CNN to extract feature map $\textbf{\textit{F}} \in \mathbb{R}^{d_f \times m \times n}$ and the visual feature $\textbf{\textit{V}} \in \mathbb{R}^{d_v}$ . $m, n$ , and $d_f$ represent the feature map height, width, and the number of channels, respectively; $d_v$ denotes the dimension of the visual feature $\textbf{\textit{V}}$ . In practice, the visual feature $\textbf{\textit{V}}$ is usually obtained by applying some fully connected layers after the convolutional feature map $\textbf{\textit{F}}$ .

The visual features $\textbf{\textit{V}}$ is used by the $1^{st}$ classifier, i.e., the original classifier of the backbone, to obtain the top-k prediction results. Assuming that the fine-grained dataset has $N$ classes. The top-k prediction results $C_k = \{C_1,..., C_k\}$ is a subset of all prediction classes $C_N$ , with $k$ is the number of candidates that have the $k$ -highest confident scores.

4. Self Assessment Classifier

Our Self Assessment Classifier takes the image feature $\textbf{\textit{F}}$ and top-k prediction $C_k$ from TCCS as the input to reassess the fine-grained classification results.

Top-k Class Embedding

The output of the TCCS module $C_k$ is passed through the top-k class embedding module to output label embedding set $\textbf{E}_k = \{E_1,...E_i,..., E_k\}, i \in \{1,2, ..., k\}, E_i \in \mathbb{R}^{d_{e}}$ . This module contains a word embedding layer~\cite{pennington2014glove} for encoding each word in class labels and a GRU~\cite{2014ChoGRU} layer for learning the temporal information in class label names. $d_{e}$ represents the dimension of each class label. It is worth noting that the embedding module is trained end-to-end with the whole model. Hence, the class label representations are learned from scratch without the need of any pre-extracted/pre-trained or transfer learning.

Given an input class label, we trim the input to a maximum of $4$ words. The class label that is shorter than $4$ words is zero-padded. Each word is then represented by a $300$ -D word embedding. This step results in a sequence of word embeddings with a size of $4 \times 300$ and denotes as $\hat{E}_i$ of $i$ -th class label in $C_k$ class label set. In order to obtain the dependency within the class label name, the $\hat{E}_i$ is passed through a Gated Recurrent Unit (GRU), which results in a $1024$ -D vector representation $E_i$ for each input class. Note that, although we use the language modality (i.e., class label name), it is not extra information as the class label name and the class label identity (for calculating the loss) represent the same object category.

Joint Embedding

This module takes the feature map $\textbf{\textit{F}}$ and the top-k class embedding $\textbf{E}_k$ as the input to produce the joint representation $\textbf{\textit{J}} \in \mathbb{R}^{d_j}$ and the attention map. We first flatten $\textbf{\textit{F}}$ into $(d_f \times f)$ , and $\textbf{E}_k$ is into $(d_e \times k)$ . The joint representation $\textbf{\textit{J}}$ is calculated using two modalities $\textbf{\textit{F}}$ and $\textbf{E}_k$ as follows:

\textbf{\textit{J}}^T= \left(\mathcal{T} \times_1 \text{vec}(\textbf{\textit{F}}) \right) \times_2 \text{vec}(\textbf{E}_k)

where $\mathcal{T} \in \mathbb{R}^{d_{\textbf{\textit{F}}} \times d_{\textbf{E}_k} \times d_j}$ is a learnable tensor; $d_{\textbf{\textit{F}}} = (d_f \times f)$ ; $d_{\textbf{E}_k} = (d_e \times k)$ ; $\text{vec}()$ is a vectorization operator; $\times_i$ denotes the $i$ -mode tensor product.

In practice, the preceding $\mathcal{T}$ is too large and infeasible to learn. Thus, we apply decomposition solutions that reduce the size of $\mathcal{T}$ but still retain the learning effectiveness. We rely on the idea of the unitary attention mechanism. Specifically, let $\textbf{\textit{J}}_p \in \mathbb{R}^{d_j}$ be the joint representation of $p^{th}$ couple of channels where each channel in the couple is from a different input. The joint representation $\textbf{\textit{J}}$ is approximated by using the joint representations of all couples instead of using fully parameterized interaction as in Eq.~\ref{eq:hypothesis}. Hence, we compute $\textbf{\textit{J}}$ as:

\textbf{\textit{J}} = \sum_p \mathcal{M}_p \textbf{\textit{J}}_p

Note that in Equation above, we compute a weighted sum over all possible couples. The $p^{th}$ couple is associated with a scalar weight $\mathcal{M}_p$ . The set of $\mathcal{M}_p$ is called the attention map $\mathcal{M}$ , where $\mathcal{M} \in \mathbb{R}^{f \times k}$ .

There are $f \times k$ possible couples over the two modalities. The representation of each channel in a couple is $\textbf{\textit{F}}_{i}, \left(\textbf{E}_k\right)_{j}$ , where $i \in [1,f], j \in [1,k]$ , respectively. The joint representation $\textbf{\textit{J}}_p$ is then computed as follows

\textbf{\textit{J}}_p^T= \left(\mathcal{T}_{u} \times_1 \textbf{\textit{F}}_{i} \right)\times_2 \left(\textbf{E}_k\right)_{j}

where $\mathcal{T}_{u} \in \mathbb{R}^{d_f \times d_e \times d_j}$ is the learning tensor between channels in the couple.

From Equation above, we can compute the attention map $\mathcal{M}$ using the reduced parameterized bilinear interaction over the inputs $\textbf{\textit{F}}$ and $\textbf{E}_k$ . The attention map is computed as

\mathcal{M} = \text{softmax}\left(\left(\mathcal{T}_\mathcal{M} \times_1 \textbf{\textit{F}} \right) \times_2 \textbf{E}_k \right)

where $\mathcal{T}_\mathcal{M} \in \mathbb{R}^{d_f \times d_e}$ is the learnable tensor.

The joint representation $\textbf{\textit{J}}$ can be rewritten as

\textbf{\textit{J}}^T= \sum_{i=1}^{f}\sum_{j=1}^{k} \mathcal{M}_{ij} \left( \left( \mathcal{T}_{u} \times_1 \textbf{\textit{F}}_{i}\right) \times_2 \left(\textbf{E}_k\right)_{j} \right)

It is also worth noting from Equation above that to compute $\textbf{\textit{J}}$ , instead of learning the large tensor $\mathcal{T} \in \mathbb{R}^{d_{F} \times d_{\textbf{E}_k} \times d_j}$ , we now only need to learn two smaller tensors $\mathcal{T}_{u} \in \mathbb{R}^{d_{f} \times d_{e} \times d_j}$ in Eq.~\ref{eq:couplecompute} and $\mathcal{T}\mathcal{M} \in \mathbb{R}^{d_f \times d_e}$.

Self Assessment

The joint representation $\textbf{\textit{J}}$ from the Joint Embedding module is used as the input in the Self Assessment step to obtain the $2^{nd}$ top-k predictions $\textbf{C}'_k$ . Note that $\textbf{C}'_k = \{C'_1,..., C'_k\}$ . Intuitively, $\textbf{C}'_k$ is the top-k classification result after self-assessment. This module is a fine-grained classifier that produces the $2^{nd}$ predictions to reassess the ambiguity classification results.

The contribution of the coarse-grained and fine-grained classifier is calculated by

\text{Pr}(\small{\rho} = \small{\rho}_i) = \alpha \text{Pr}_1(\small{\rho} = \small{\rho}_i) + (1- \alpha) \text{Pr}_2(\small{\rho} = \small{\rho}_i)

where $\alpha$ is the trade-off hyper-parameter $\left(0 \leq \alpha \leq 1\right)$ . $\text{Pr}_1(\small{\rho} = \small{\rho}_i), \text{Pr}_2(\small{\rho} = \small{\rho}_i)$ denotes the prediction probabilities for class $\small{\rho}_i$ , from the coarse-grained and fine-grained classifiers, respectively.

In the next post, we will verify the effectiveness and efficiency of the method.

2 posts tagged with "fine-grained"

Fine-Grained Visual Classification using Self-Assessment Classifier (Part 2)

1. Experimental Setup

2. Experimental Results

3. Qualitative Results

Conclusion

Fine-Grained Visual Classification using Self-Assessment Classifier (Part 1)

1. Introduction

2. Method Overview

3. Top-k Coarse-grained Class Search

4. Self Assessment Classifier

Top-k Class Embedding

Joint Embedding

Self Assessment

Next

2 posts tagged with "fine-grained"

#1. Experimental Setup

#2. Experimental Results

#3. Qualitative Results

#Conclusion

#1. Introduction

#2. Method Overview

#3. Top-k Coarse-grained Class Search

#4. Self Assessment Classifier

#Top-k Class Embedding

#Joint Embedding

#Self Assessment

#Next

1. Experimental Setup

2. Experimental Results

3. Qualitative Results

Conclusion

1. Introduction

2. Method Overview

3. Top-k Coarse-grained Class Search

4. Self Assessment Classifier

Top-k Class Embedding

Joint Embedding

Self Assessment

Next