AIOZ AI / Sep 28th / 9 min read

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 7)

Attribute Learning.

To provide fine-grained information to the CFOR system, attribute learning is a most important task which should be optimized in both time-processing performance and ability to deal with large-scale imbalanced datasets.

1. Framework

Local multitask learning is applied to attribute learning. The proposed framework, depicted in Figure.1 and Figure.2 and comprising online and offline phases, consists of three main components. The initial component introduces a local multitask transfer learning model with a loss function designed to leverage inner-group correlations among attributes. The second component presents an imbalanced data resolver based on MCC (Matthews Correlation Coefficient) without any adjustments to the pretrained model or loss function. The third component discusses prior knowledge used for local attribute grouping to facilitate local multitask learning.

Figure 1. Local MTL with an imbalanced data problem solver framework (Offline phase).

The input and output of the learning framework will be images and their attribute vectors, respectively. However, with the local grouping role, the attribute vector’s size will be based on the number of attributes in each group. The dataset should be merged or split based on the local grouping role.

To evaluate the effectiveness of the proposed framework, we apply it in the fashion field and split the dataset into five local groups: fabric, shape, part, style, and texture. Because fashion has lesser intergroup correlations, the shared block should be designed to optimize the effectiveness of inner-group correlations to improve the overall performance. However, in crowd attributes (such as activities, locations, and participants), intergroup correlations should be taken into account to improve performance. Thus, the shared block should be modified to adapt to the context.

Figure 2. Local MTL with an imbalanced data problem solver framework (Online phase).

2. Deep Multitask Learning

Our aim is to estimate a number of fashion attributes via a joint estimation model. However, with the dynamic attributes, MTL which supports creating a joint estimation model becomes vulnerable in the training phase due to its nonusability when the number of attributes increases. Thus, the local grouping method can help solve this situation.

Framework in Detail

In the experiments, the proposed framework processes the query image and generates a confident score vector comprising 7 attribute scores per group across 5 groups, which is then thresholded to obtain binary outputs. The architecture is outlined in detail below.

Figure.1 illustrates the overall structure of the proposed method. For each group, a training set is assumed, consisting of $N$ fashion images, each with $M$ attributes. The dataset is represented as $D = {(X_i, Y_i)}$ , where $X_i$ is the image and $Y_i$ is the label encoded as a one-hot vector. Inspired by prior researches, we employ an end-to-end DNN architecture as a shared block to learn joint representations for all tasks. The loss function employed is binary cross-entropy, and the activation function at the output layer is sigmoid, chosen for its simplicity and flexibility in modifying the DNN architecture.

Network Architecture

NASNet automatically generates network architectures, constructing an optimal model by initially creating architectures on a smaller dataset and then scaling them up to a larger one. Through experiments, the search for the best cells is conducted on the CIFAR-10 dataset, which are subsequently applied to the ImageNet dataset by stacking multiple copies of them, each with their own parameters. The resulting model demonstrates a 1.2% improvement in top-1 accuracy compared to the best human-designed architectures. NASNet proves its effectiveness over previous architectures and offers a transfer learning model trained on a diverse ImageNet dataset. Leveraging the pretrained NASNet model on ImageNet, transfer learning is applied to the DeepFashion dataset to expedite convergence and enhance performance. Additionally, a dropout layer is incorporated with NASNet to mitigate overfitting. While utilizing the NASNet model generation algorithm to tailor a model for the DeepFashion dataset is a promising approach, the time and hardware resources required for NASNet's model generation and training from scratch are significant. Due to hardware limitations, only transfer learning is employed.

Figure 3. Best normal cells and reduction cells identified with CIFAR-10 and ImageNet architecture (right) are built from the best convolutional cells . Zoph et al. built two types of cells because they want to create architectures for images of any size. While normal cells return a feature map which has the same dimension, reduction cells return a feature map with height and width reduced by a factor of two.

We will do experiments on NASNet architectures to find out which one is suitable for each specific task in our CFOR system. In our fashion retrieval experiments, the category classifier task and region classifier task are applied transfer with single-task learning, while fashion attribute recognition is applied local multitask learning. Besides, to adapt to large-scale datasets and reduce the effect of overfitting, we recommend changing the final fully connected layer to the global average pooling layer along with dropout.

In the next post, we will discover Searching and Indexing Method in the CFOR System as well as the effectiveness of the whole system.