/ / 118 min read

Fine-Grained Visual Classification using Self-Assessment Classifier (Part 1)

An introduction about Fine-Grained Visual Classification

Extracting discriminative features plays a crucial role in the fine-grained visual classification task. Most of the existing methods focus on developing attention or augmentation mechanisms to achieve this goal. However, addressing the ambiguity in the top-k prediction classes is not fully investigated. In this paper, we introduce a Self Assessment Classifier, which simultaneously leverages the representation of the image and top-k prediction classes to reassess the classification results. Our method is inspired by self-supervised learning with coarse-grained and fine-grained classifiers to increase the discrimination of features in the backbone and produce attention maps of informative areas on the image. In practice, our method works as an auxiliary branch and can be easily integrated into different architectures. We show that by effectively addressing the ambiguity in the top-k prediction classes, our method achieves new state-of-the-art results on CUB200-2011, Stanford Dog, and FGVC Aircraft datasets. Furthermore, our method also consistently improves the accuracy of different existing fine-grained classifiers with a unified setup.

1. Introduction

The task of fine-grained visual classification involves categorizing images that belong to the same class (e.g., various species of birds, types of aircraft, or different varieties of flowers). Compared to standard image classification tasks, fine-grained classification poses greater challenges due to three primary factors: (i) significant intra-class variation, where objects within the same category exhibit diverse poses and viewpoints; (ii) subtle inter-class distinctions, where objects from different categories may appear very similar except for minor differences, such as the color patterns of a bird's head often determining its fine-grained classification; and (iii) constraints on training data availability, as annotating fine-grained categories typically demands specialized expertise and considerable annotation effort. Consequently, achieving accurate classification results solely with state-of-the-art CNN models like VGG is nontrivial.

Recent research demonstrates that a crucial strategy for fine-grained classification involves identifying informative regions across various parts of objects and extracting distinguishing features. A common approach to achieving this is by learning the object's parts through human annotations. However, annotating fine-grained regions is labor-intensive, rendering this method impractical. Some advancements have explored unsupervised or weakly-supervised learning techniques to identify informative object parts or region of interest bounding boxes. While these methods offer potential solutions to circumvent manual labeling of fine-grained regions, they come with limitations such as reduced accuracy, high computational costs during training or inference, and challenges in accurately detecting distinct bounding boxes.

n this paper, we introduce the Self Assessment Classifier (SAC) method to tackle the inherent ambiguity present in fine-grained classification tasks. Essentially, our approach is devised to reevaluate the top-k prediction outcomes and filter out uninformative regions within the input image. This serves to mitigate inter-class ambiguity and enables the backbone network to learn more discerning features. Throughout training, our method generates attention maps that highlight informative regions within the input image. By integrating this method into a backbone network, we aim to reduce misclassifications among top-k ambiguous classes. It's important to note that "ambiguity classes" refer to instances where uncertainty in prediction can lead to incorrect classifications. Our contributions can be succinctly outlined as follows:

  • We propose a novel self-class assessment method that simultaneously learns discriminative features and addresses ambiguity issues in fine-grained visual classification tasks.
  • We demonstrate the versatility of our method by showcasing its seamless integration into various fine-grained classifiers, resulting in improved state-of-the-art performance.

Figure 1. Comparison between generic classification and fine-grained classification.

2. Method Overview

We propose two main steps in our method: Top-k Coarse-grained Class Search (TCCS) and Self Assessment Classifier (SAC). TCCS works as a coarse-grained classifier to extract visual features from the backbone. The Self Assessment Classifier works as a fine-grained classifier to reassess the ambiguity classes and eliminate the non-informative regions. Our SAC has four modules: the Top-k Class Embedding module aims to encode the information of the ambiguity class; the Joint Embedding module aims to jointly learn the coarse-grained features and top-k ambiguity classes; the Self Assessment module is designed to differentiate between ambiguity classes; and finally, the Dropping module is a data augmentation method, designed to erase unnecessary inter-class similar regions out of the input image. Figure.2 shows an overview of our approach.

Figure 2. Method Overview.

3. Top-k Coarse-grained Class Search

The TCCS takes an image as input. Each input image is passed through a Deep CNN to extract feature map FRdf×m×n\textbf{\textit{F}} \in \mathbb{R}^{d_f \times m \times n} and the visual feature VRdv\textbf{\textit{V}} \in \mathbb{R}^{d_v}. m,nm, n, and dfd_f represent the feature map height, width, and the number of channels, respectively; dvd_v denotes the dimension of the visual feature V\textbf{\textit{V}}. In practice, the visual feature V\textbf{\textit{V}} is usually obtained by applying some fully connected layers after the convolutional feature map F\textbf{\textit{F}}.

The visual features V\textbf{\textit{V}} is used by the 1st1^{st} classifier, i.e., the original classifier of the backbone, to obtain the top-k prediction results. Assuming that the fine-grained dataset has NN classes. The top-k prediction results Ck={C1,...,Ck}C_k = \{C_1,..., C_k\} is a subset of all prediction classes CNC_N, with kk is the number of candidates that have the kk-highest confident scores.

4. Self Assessment Classifier

Our Self Assessment Classifier takes the image feature F\textbf{\textit{F}} and top-k prediction CkC_k from TCCS as the input to reassess the fine-grained classification results.

Top-k Class Embedding

The output of the TCCS module CkC_k is passed through the top-k class embedding module to output label embedding set Ek={E1,...Ei,...,Ek},i{1,2,...,k},EiRde\textbf{E}_k = \{E_1,...E_i,..., E_k\}, i \in \{1,2, ..., k\}, E_i \in \mathbb{R}^{d_{e}}. This module contains a word embedding layer~\cite{pennington2014glove} for encoding each word in class labels and a GRU~\cite{2014ChoGRU} layer for learning the temporal information in class label names. ded_{e} represents the dimension of each class label. It is worth noting that the embedding module is trained end-to-end with the whole model. Hence, the class label representations are learned from scratch without the need of any pre-extracted/pre-trained or transfer learning.

Given an input class label, we trim the input to a maximum of 44 words. The class label that is shorter than 44 words is zero-padded. Each word is then represented by a 300300-D word embedding. This step results in a sequence of word embeddings with a size of 4×3004 \times 300 and denotes as E^i\hat{E}_i of ii-th class label in CkC_k class label set. In order to obtain the dependency within the class label name, the E^i\hat{E}_i is passed through a Gated Recurrent Unit (GRU), which results in a 10241024-D vector representation EiE_i for each input class. Note that, although we use the language modality (i.e., class label name), it is not extra information as the class label name and the class label identity (for calculating the loss) represent the same object category.

Joint Embedding

This module takes the feature map F\textbf{\textit{F}} and the top-k class embedding Ek\textbf{E}_k as the input to produce the joint representation JRdj\textbf{\textit{J}} \in \mathbb{R}^{d_j} and the attention map. We first flatten F\textbf{\textit{F}} into (df×f)(d_f \times f), and Ek\textbf{E}_k is into (de×k)(d_e \times k). The joint representation J\textbf{\textit{J}} is calculated using two modalities F\textbf{\textit{F}} and Ek\textbf{E}_k as follows:

JT=(T×1vec(F))×2vec(Ek)\textbf{\textit{J}}^T= \left(\mathcal{T} \times_1 \text{vec}(\textbf{\textit{F}}) \right) \times_2 \text{vec}(\textbf{E}_k)

where TRdF×dEk×dj\mathcal{T} \in \mathbb{R}^{d_{\textbf{\textit{F}}} \times d_{\textbf{E}_k} \times d_j} is a learnable tensor; dF=(df×f)d_{\textbf{\textit{F}}} = (d_f \times f); dEk=(de×k)d_{\textbf{E}_k} = (d_e \times k); vec()\text{vec}() is a vectorization operator; ×i\times_i denotes the ii-mode tensor product.

In practice, the preceding T\mathcal{T} is too large and infeasible to learn. Thus, we apply decomposition solutions that reduce the size of T\mathcal{T} but still retain the learning effectiveness. We rely on the idea of the unitary attention mechanism. Specifically, let JpRdj\textbf{\textit{J}}_p \in \mathbb{R}^{d_j} be the joint representation of pthp^{th} couple of channels where each channel in the couple is from a different input. The joint representation J\textbf{\textit{J}} is approximated by using the joint representations of all couples instead of using fully parameterized interaction as in Eq.~\ref{eq:hypothesis}. Hence, we compute J\textbf{\textit{J}} as:

J=pMpJp\textbf{\textit{J}} = \sum_p \mathcal{M}_p \textbf{\textit{J}}_p

Note that in Equation above, we compute a weighted sum over all possible couples. The pthp^{th} couple is associated with a scalar weight Mp\mathcal{M}_p. The set of Mp\mathcal{M}_p is called the attention map M\mathcal{M}, where MRf×k\mathcal{M} \in \mathbb{R}^{f \times k}.

There are f×kf \times k possible couples over the two modalities. The representation of each channel in a couple is Fi,(Ek)j\textbf{\textit{F}}_{i}, \left(\textbf{E}_k\right)_{j}, where i[1,f],j[1,k]i \in [1,f], j \in [1,k], respectively. The joint representation Jp\textbf{\textit{J}}_p is then computed as follows

JpT=(Tu×1Fi)×2(Ek)j\textbf{\textit{J}}_p^T= \left(\mathcal{T}_{u} \times_1 \textbf{\textit{F}}_{i} \right)\times_2 \left(\textbf{E}_k\right)_{j}

where TuRdf×de×dj\mathcal{T}_{u} \in \mathbb{R}^{d_f \times d_e \times d_j} is the learning tensor between channels in the couple.

From Equation above, we can compute the attention map M\mathcal{M} using the reduced parameterized bilinear interaction over the inputs F\textbf{\textit{F}} and Ek\textbf{E}_k. The attention map is computed as

M=softmax((TM×1F)×2Ek)\mathcal{M} = \text{softmax}\left(\left(\mathcal{T}_\mathcal{M} \times_1 \textbf{\textit{F}} \right) \times_2 \textbf{E}_k \right)

where TMRdf×de\mathcal{T}_\mathcal{M} \in \mathbb{R}^{d_f \times d_e} is the learnable tensor.

The joint representation J\textbf{\textit{J}} can be rewritten as

JT=i=1fj=1kMij((Tu×1Fi)×2(Ek)j)\textbf{\textit{J}}^T= \sum_{i=1}^{f}\sum_{j=1}^{k} \mathcal{M}_{ij} \left( \left( \mathcal{T}_{u} \times_1 \textbf{\textit{F}}_{i}\right) \times_2 \left(\textbf{E}_k\right)_{j} \right)

It is also worth noting from Equation above that to compute J\textbf{\textit{J}}, instead of learning the large tensor TRdF×dEk×dj\mathcal{T} \in \mathbb{R}^{d_{F} \times d_{\textbf{E}_k} \times d_j}, we now only need to learn two smaller tensors TuRdf×de×dj\mathcal{T}_{u} \in \mathbb{R}^{d_{f} \times d_{e} \times d_j} in Eq.~\ref{eq:couplecompute} and $\mathcal{T}\mathcal{M} \in \mathbb{R}^{d_f \times d_e}$.

Self Assessment

The joint representation J\textbf{\textit{J}} from the Joint Embedding module is used as the input in the Self Assessment step to obtain the 2nd2^{nd} top-k predictions Ck\textbf{C}'_k. Note that Ck={C1,...,Ck}\textbf{C}'_k = \{C'_1,..., C'_k\}. Intuitively, Ck\textbf{C}'_k is the top-k classification result after self-assessment. This module is a fine-grained classifier that produces the 2nd2^{nd} predictions to reassess the ambiguity classification results.

The contribution of the coarse-grained and fine-grained classifier is calculated by

Pr(ρ=ρi)=αPr1(ρ=ρi)+(1α)Pr2(ρ=ρi)\text{Pr}(\small{\rho} = \small{\rho}_i) = \alpha \text{Pr}_1(\small{\rho} = \small{\rho}_i) + (1- \alpha) \text{Pr}_2(\small{\rho} = \small{\rho}_i)

where α\alpha is the trade-off hyper-parameter (0α1)\left(0 \leq \alpha \leq 1\right). Pr1(ρ=ρi),Pr2(ρ=ρi)\text{Pr}_1(\small{\rho} = \small{\rho}_i), \text{Pr}_2(\small{\rho} = \small{\rho}_i) denotes the prediction probabilities for class ρi\small{\rho}_i, from the coarse-grained and fine-grained classifiers, respectively.

Next

In the next post, we will verify the effectiveness and efficiency of the method.

Like What You See?