8 posts tagged with "object retrieval"

View All Tags

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 8)

Welcome to the concluding part of our series! In this final installment, we'll delve deeper into the intricacies of the search system that accompanies our proposal. Additionally, we'll conduct a comprehensive evaluation to gauge the overall effectiveness and performance of our solution. Let's explore the various aspects of the search functionality and analyze how it contributes to the success of our proposed system. Let's embark on this journey together as we wrap up our discussion and reflect on the insights gained throughout this series.

1. Searching and Indexing Method in the CFOR System

To adapt our retrieval system for large-scale datasets, we've developed indexes for the CFOR system to facilitate nonexhaustive similarity search using GPU acceleration. Leveraging the searching algorithm outlined by Johnson et al. (referenced as "billion-scale similarity search with GPUs"), we've implemented it within the CFOR system for retrieval tasks. In searching, the CFOR system enhances accuracy by reducing the search space through additional information such as regions, categories, and attributes. For indexing, the object ontology aids in creating multi-indexing files to minimize search time. Our focus is on similarity search in vector collections using the L2 distance in the k-selection algorithm.

In the realm of searching, we distinguish between exact search (exhaustive search) and compressed search (greedy nonexhaustive search).

  • Exact search: Almost all searching algorithms in this type try to compute the full pairwise distance between the query and each data point in the database sequentially or using the index file.
  • Compressed-Domain search: Almost all searching algorithms in this type try to compute distance between the query and each data point in the database by applying space transformation, encoding, subspace splitting, or hashing. These methods can help improve searching time by using index files, but they have a trade-off in searching accuracy.

2. Data

Our fashion retrieval system was built on a subset of approximately 300,000 images of DeepFashion. In the DeepFashion dataset, objects from different aspects are caught in complicated background. The input image in the dataset is annotated with different labels based on details (fine-grained) of input of the current model concern, i.e., rich annotation. The samples given in Figures 1 and 2 show more details about the DeepFashion dataset.

Figure 1. Images from the DeepFashion dataset obtained from different views and complicated background.

Figure 2. Images from the DeepFashion dataset annotated with different labels based on details of input of the current model concern.

In testing, we employ part of the benchmark data to fine tune the trained models. We ensure that there are no fashion item overlaps between fine-tuning and testing sets. The dataset includes ∼220,000 images of the training set, 40,000 images of the validating set, and 40,000 images of the testing set split. However, in attribute learning, we limited the number of attribute labels used for testing and the number of training images for specific attributes to make an imbalanced attribute dataset so as to prove our proposed methods.

3. Results and Discussion

In the CFOR system, object ontology is useful in controlling training flow which impacts the performance of object category classification and attribute multitask classification. For object category classification, ontology controls the amount of training data through concepts. For attribute multitask classification, ontology manages local grouping which directly affects the performance of the proposed local imbalanced data solver on the large-scale dataset.

In this section, we will evaluate the effectiveness of different deep networks with the support of ontology on both category classification and attribute multitask classification in the CFOR system to pick out the best architecture for training the system. We will also compare our results with FashionNet.

Category Classification

We compare the performance between different deep architectures including NASNet, ResNet-18, ResNet-101, FashionNet, NASNet with average pooling dropout (NASNet APD) (proposed by us), and ResNet with average pooling dropout (ResNet APD) (proposed by us). These experiments will be evaluated by top-k accuracy (Figure.3). Our target is to find out the best possible architecture to apply as a core network of the CFOR system. This step can be mentioned as a preparation step before applying the CFOR system for fashion retrieval.

Figure 3. Accuracy plot for top-k accuracy in category classification.

The result of category classification by ResNet-18 APD is higher than 1.23% (at k= 1) after removing nodes and making average pooling in the ResNet-18 architecture (compared with the original ResNet-18 architecture). This increased value is 0.93% with the ResNet-101 architecture (compared with the original ResNet-101 architecture) and 0.02% with the NASNet v3 architecture (compared with the original NASNet v3 architecture). The ResNet-101 APD architecture (the best architecture addressed) outperformed the FashionNet architecture (the best performing architecture in category classification on the DeepFashion dataset versus others such as WTBI or DARN), and the value is 4.6% with k= 3 and 2.58% with k = 5. Based on the above experimental results, the ResNet-101 architecture provides better classification and higher performance compared to others (NASNet and ResNet-18). For this reason, we propose ResNet-101 as the core network architecture for training classification models.

Attribute Learning

Attribute multitask learning is an important part of the CFOR system. In this section, we evaluate the performance of the proposed local imbalanced data solver with MCC in dealing with the imbalanced attribute data on the large-scale fashion dataset.

Precision is the proportion of relevant instances among the retrieved instances which consider both true positives and false positives in each attribute. However, the number of true positives and false positives is bias because of the imbalanced data problem. Thus, precision can also be affected by the imbalanced data problem. Otherwise, recall, which cares about true-positive labels but not false-positive labels, will be used to evaluate experiments because of its good reflection for fewer data attributes.

Local MTL gets over STL and MTL in 28/35 attributes with a 54.70% recall rate (higher than that in STL (17.06%) and that in MTL (28.70%)). While a single task shows its weakness in fewer data attributes and multitasks get struggled with the serious imbalanced problem and lesser intergroup correlations in fashion data, local MTL can lower their negative influences as well as widen the positive effect of inner-group correlations on attribute learning. Thus, local MTL gets over STL and MTL in 13/15 fewer sample attributes (Figure.4).

Figure 4.Recall graph of 14 attributes in STL and local MTL.

Based on the experiment, comparison of chic, solid, and maxi attributes which have equal accuracy between MTL with and without MCC shows that MTL with MCC had higher recall compared to that without MCC in 20/35 remaining attributes. The overall performance increases about 3%. For attributes with fewer data, MTL with MCC had higher recall compared to that without MCC in 9/14 attributes. The overall performance for these fewer data attributes increases 5.14% (see Figure 5 for more details).

Figure 5.Recall graph of 35 attributes using local multitask models with and without MCC.

Retrieval in CFOR System Results

In this experiment, we test the retrieval ability of the CFOR system by using MAP from 1 retrieval result for each query (MAP@1) to 30 retrieval results for each query (MAP@30) so as to evaluate the effectiveness. The similarity retrieval experiment will check whether the extracted attributes in retrieved images are matched with ground-truth attributes in query image. The retrieval method will be based on deep features and over 35 attributes. After experimenting in 35 attributes belonging to 5 groups, the starting MAP@5 is acceptable (hovering 0.531) which shows the effectiveness of the searching method. The MAP@30 hovers 0.815, and the trend keeps rising which shows consistency and stabilization of information prediction methods in the CFOR system. A simple visualization of the retrieval process in the CFOR system is shown in Figure.6.

Figure 6. An example of the retrieval results of the CFOR system.

4. Conclusion

This work presents the coarse-to-fine object retrieval system, a learning framework for e-commerce online retrieval, which is supported to deal with large-scale imbalanced datasets. The framework can impact input and output as well as reconstruct datasets from the coarse-grained level to the fine-grained level and is believed to be an effective method in improving learning performance designed for retrieval. For input reconstruction, the framework based on ontology is used for threading training flow, local grouping in multitask attribute learning, and hierarchical storage and retrieval. For output optimization, we take advantage of MCC to minimize the effect of the imbalanced dataset on multitask attribute learning.

Through extensive experiments, we demonstrate the applicability of object ontology in improving training flow, the effectiveness of different deep networks (ResNet and NASNet) applied on important tasks in fine-grained retrieval, and the usefulness of local multitask attribute learning and an MCC-based imbalanced data solver in attribute multitask learning. The CFOR system is designed to have flexibility so that it can be optimized easily in the future.

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 7)

To provide fine-grained information to the CFOR system, attribute learning is a most important task which should be optimized in both time-processing performance and ability to deal with large-scale imbalanced datasets.

1. Framework

Local multitask learning is applied to attribute learning. The proposed framework, depicted in Figure.1 and Figure.2 and comprising online and offline phases, consists of three main components. The initial component introduces a local multitask transfer learning model with a loss function designed to leverage inner-group correlations among attributes. The second component presents an imbalanced data resolver based on MCC (Matthews Correlation Coefficient) without any adjustments to the pretrained model or loss function. The third component discusses prior knowledge used for local attribute grouping to facilitate local multitask learning.

Figure 1. Local MTL with an imbalanced data problem solver framework (Offline phase).

The input and output of the learning framework will be images and their attribute vectors, respectively. However, with the local grouping role, the attribute vector’s size will be based on the number of attributes in each group. The dataset should be merged or split based on the local grouping role.

To evaluate the effectiveness of the proposed framework, we apply it in the fashion field and split the dataset into five local groups: fabric, shape, part, style, and texture. Because fashion has lesser intergroup correlations, the shared block should be designed to optimize the effectiveness of inner-group correlations to improve the overall performance. However, in crowd attributes (such as activities, locations, and participants), intergroup correlations should be taken into account to improve performance. Thus, the shared block should be modified to adapt to the context.

Figure 2. Local MTL with an imbalanced data problem solver framework (Online phase).

2. Deep Multitask Learning

Our aim is to estimate a number of fashion attributes via a joint estimation model. However, with the dynamic attributes, MTL which supports creating a joint estimation model becomes vulnerable in the training phase due to its nonusability when the number of attributes increases. Thus, the local grouping method can help solve this situation.

Framework in Detail

In the experiments, the proposed framework processes the query image and generates a confident score vector comprising 7 attribute scores per group across 5 groups, which is then thresholded to obtain binary outputs. The architecture is outlined in detail below.

Figure.1 illustrates the overall structure of the proposed method. For each group, a training set is assumed, consisting of NN fashion images, each with MM attributes. The dataset is represented as D=(Xi,Yi)D = {(X_i, Y_i)}, where XiX_i is the image and YiY_i is the label encoded as a one-hot vector. Inspired by prior researches, we employ an end-to-end DNN architecture as a shared block to learn joint representations for all tasks. The loss function employed is binary cross-entropy, and the activation function at the output layer is sigmoid, chosen for its simplicity and flexibility in modifying the DNN architecture.

Network Architecture

NASNet automatically generates network architectures, constructing an optimal model by initially creating architectures on a smaller dataset and then scaling them up to a larger one. Through experiments, the search for the best cells is conducted on the CIFAR-10 dataset, which are subsequently applied to the ImageNet dataset by stacking multiple copies of them, each with their own parameters. The resulting model demonstrates a 1.2% improvement in top-1 accuracy compared to the best human-designed architectures. NASNet proves its effectiveness over previous architectures and offers a transfer learning model trained on a diverse ImageNet dataset. Leveraging the pretrained NASNet model on ImageNet, transfer learning is applied to the DeepFashion dataset to expedite convergence and enhance performance. Additionally, a dropout layer is incorporated with NASNet to mitigate overfitting. While utilizing the NASNet model generation algorithm to tailor a model for the DeepFashion dataset is a promising approach, the time and hardware resources required for NASNet's model generation and training from scratch are significant. Due to hardware limitations, only transfer learning is employed.

Figure 3. Best normal cells and reduction cells identified with CIFAR-10 and ImageNet architecture (right) are built from the best convolutional cells . Zoph et al. built two types of cells because they want to create architectures for images of any size. While normal cells return a feature map which has the same dimension, reduction cells return a feature map with height and width reduced by a factor of two.

We will do experiments on NASNet architectures to find out which one is suitable for each specific task in our CFOR system. In our fashion retrieval experiments, the category classifier task and region classifier task are applied transfer with single-task learning, while fashion attribute recognition is applied local multitask learning. Besides, to adapt to large-scale datasets and reduce the effect of overfitting, we recommend changing the final fully connected layer to the global average pooling layer along with dropout.

Next

In the next post, we will discover Searching and Indexing Method in the CFOR System as well as the effectiveness of the whole system.

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 6)

In this section, we will mention about ontology, fashion ontology, and its related information and present the contributions of object ontology to the CFOR system.

1. Ontology Definition for CFOR System

As described by Guarino, ontology is a "formal, explicit specification of a shared conceptualization." Typically, ontologies consist of concepts and their hierarchical structure, aiding in organizing information within a domain. A complete ontology typically includes concepts, relations, and axioms. Additionally, ontologies offer several key advantages:

  • Describing domain knowledge through a semantic hierarchical tree, with concepts represented as nodes identified by words or phrases.
  • Bridging the semantic gap in various tasks, including those in computer vision and other disciplines.
  • Enhancing software engineering practices by improving flexibility, reliability, specification, and reusability.
  • Supporting multitask problem-solving capabilities.

Any proposed ontology should satisfy two fundamental criteria:

  • Wide recognition within the community.
  • Feasibility for formalization using mathematical expressions, enabling digitization.

In our approach, we employ ontological engineering to facilitate communication and information sharing across different levels of data abstraction involved in image fashion retrieval, detection, and information tagging.

The object ontology comprises two primary levels: coarse-grained and fine-grained.

  • At the coarse level, the object ontology includes regions, categories, or high-level conceptual entities, which leverage global features extracted by deep networks for similarity retrieval. However, these deep features are treated as black boxes, lacking explicit semantic information to aid users in their search process.
  • At the fine-grained level, the object ontology encompasses attributes that provide detailed descriptions of objects.

In our experiment, we focus on describing the object "Fashion." The fashion ontology is constructed using prior knowledge and information from the DeepFashion dataset, along with ontology definitions introduced by Guarino. See Figure.1 for an illustration of the fashion ontology.

Figure 1. Fashion ontology in general and a version of ontology for clothes.

The fashion ontology developed comprises three primary semantic levels: 1. Regions: Representing areas such as Top, Bottom, and Body. 2. Categories: Specific objects associated with each region, such as dresses or robes for the Body region. 3. Attributes: Describing detailed visual concepts like denim or fur.

To streamline the discussion, our investigation focuses on the object fashion across three regions (Top, Body, and Bottom), select categories within these regions, and their respective attributes.

Within the CFOR system, a query image undergoes processing starting from the coarse level of the object ontology to identify the region and category of the corresponding object. Subsequently, the object proceeds to the fine-grained concept ontology to ascertain attributes. Once all necessary information is obtained, the object undergoes indexing and similarity distance computation to identify similar images in the database, ranked by a cumulative score derived from similarity scores of global features and attribute learning between the query image and target database images. For a detailed illustration, refer to Figure.2.

Figure 2. An example of a relationship between the query image and semantic information from the coarse-grained level to the fine-grained level of the fashion ontology.

2. Fashion Object Ontology

In this section, we introduce the fashion object ontology. Within the fashion domain, we categorize semantic fashion concepts based on regions. Each region encompasses a detailed ontology comprising categories and attributes. To facilitate experiments using the DeepFashion dataset, we extend the fashion ontology within the "Clothes" branch (refer to Figure 9). It's essential to emphasize that the proposed ontology is not specific to any application and should be viewed as a flexible foundation.

The fashion object ontology consists of multiple levels of concepts, with relations between each level to articulate their associations. Two primary relations are employed: 1. "Part of": This relation specifies that the concepts are components of the main concept. 2. "Has a": This relation describes the main concept in detail.

For this study, we concentrate solely on the Clothes branch to ensure equitable comparisons with other methodologies. The Clothes taxonomy comprises 50 distinct categories. A clothing region taxonomy has been established (refer to Figure.3), organizing all clothing categories hierarchically. The first level of this hierarchy represents the most general clothing region, with three primary regions defined: 1. Top (e.g., tee and tank) 2. Bottom (e.g., skirt and jeans) 3. Body (e.g., dress and robe)

Figure 3. Excerpt from the “Clothes” taxonomy defined in the fashion ontology.

3. Fine-Grained Object Ontology

Fine-grained object ontology is used to describe objects at the attribute level. Semantic information such as attributes can be useful for a customer to retrieve (see Figure.4). It is important to note that the proposed ontology is not application dependent and should be considered as an extensible basis.

Figure 4. Fine-grained group at the attribute level.

Cloth attributes vary across different levels—some attributes, like color, are common across all cloth regions, while others are specific to certain regions or categories. Our ontology is structured into two main parts, each detailed in the following sections: 1. Specific fashion concepts—pertaining to particular characteristics of clothes such as fabric, part, and style. 2. Visual concepts—related to popular visual characteristics like color, shape, and texture, not exclusive to fashion.

Rudd et al. demonstrated in a study that a multitask learning-based model outperforms a combination of single-task learning-based models in face attribute prediction. While this approach shows promising results for fashion attributes as well, there's a significant difference in the quantity of attributes between faces and fashion items. This disparity can pose challenges in scaling the system, such as in training and storage requirements. To address this, we propose applying local multitask learning to attribute learning, providing more flexibility. Further explanation is provided in the subsequent sections.

Next

In the next post, we will discover Attribute Learning and its correlation with multitask learning.

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 5)

In the previous post, we have discussed about offline phase in details. In this post, we will discover the online phase in details.

1. States in Online Phase

The online phase of the CFOR system corresponding to the demonstration in Figure.1

Figure 1. Online stage of the CFOR system.

2. Technical Details

The used functions will be described as follows:

  • (i) detector(imgQuery): an object in an image is automatically detected by using a trained detector. In this function, we inherit the successful software YOLO (version 3.0) to identify fashion items. Besides, the items identified are also refined by the region identification model, which is trained by “classifyModel” function.
  • (ii) infor_extract(states, obj, onto, classifyModels, multitaskModel): for each query object, all attribute learning models trained in function "multitaskModel" and coarse classification models in function "classifyModel" are run. We extract the region ⟶ category ⟶ attributes and necessary features for each stage of the ontology.
  • (iii) query_expansion(infor, feat): query expansion based on the mean vector is used for reranking retrieval results.
  • (iv) compute_sim_score(database, infor, feat): for each pair of features, asymmetric distance is used to measure the dissimilarity distance between the query and the sample in the database
  • (v) ranking(scorelist, database, top__k, GPU_search = True): based on the score between the query and all samples in the database obtained from function "compute_sim_score," ranking is applied; smaller is better.
  • (vi)retrieval(indexes, score_list, database, GPU_search = True): the retrieval process contains 3 steps including feature retrieval, fine-grained retrieval, and query expansion. For global retrieval, global features of the query object obtained from function “inforextract” and the features of samples in the database are passed to function "ranking" to get 1st top-_m retrieval results. For fine-grained retrieval, attribute features of the query object obtained from function “inforextract” and the features of samples in 1st top-_m retrieval results are passed to function "ranking" to get 2nd top-k retrieval results. For query expansion, the mean vector is computed from 2nd top-k retrieval results, and each feature of 2nd top-k retrieval results is passed to function "ranking" to get final top-k retrieval results, i.e., query expansion-based reranking.

As described in Figure 1, the online phase of the CFOR system contains three stages which will be put into use in real time. They are given as follows.

3. Prediction Stage

This stage will take advantage of object ontology and classification models obtained from the offline phase and then makes predictions from coarse to fine for each query image:

Fine-grained information in terms of regions, categories, and attributes provides more options for a customer to give a full semantic query. The object will be predicted from coarse to fine. In turn, the region, category, and attribute will be predicted based on object ontology and a local MDNN. The object retrieval system uses extracted semantic information as the category and attribute to search in detail.

3. Dissimilarity Measuring Stage

This stage will take advantage of the database as well as the indexing file from the offline phase and a dissimilarity measure to get scores and then rank, rerank, and release retrieval results for each query image. This stage is based on the dissimilarity measure between attribute vectors of query images and database images:

Based on combination of K-nearest neighbour search in terms of L2 distance and asymmetric distance computation, we take advantage of parallel processing by GPU through the Faiss method to compute the distance from the query image to the necessary one in the database. The distance which is also called the score of each image in the database is then sorted to rank the dissimilarity. The smaller the score of the image, the more similar the query. Based on the number of retrieval images required or thresholds, we will have an appropriate cutoff in the score as well as the number of retrieval images. This kind of measurement is used to compute distance for both deep features vectors and attribute vectors.

4. Dissimilarity Measuring Stage

Query expansion is a technique that can help gather additional relevant information from the input to increase retrieval performance. The information can be relevant images, additional features, description, etc. based on the query expansion algorithms and data. In this stage, we would like to take advantage of the previous retrieval results and then expand the query by using the mean vector to rerank and get reranked retrieval results to improve retrieval performance.

Query expansion based on the mean vector is chosen among many methods, the mean vector computed from features of retrieval results and the features of input help reduce the bias between different considered features. Thus, the CFOR system can eliminate unrelated features; that is, retrieval features have high gap from the mean vector features, which helps reduce outliers and rise the precision score.

Query expansion based on computing mean vector is performed very fast, and it can take advantage of the Faiss similarity searching method as well. Query expansion can remove outliers, thanks to the statistic essence of the mean vector.

Next

In the next post, we will mention the Fashion Ontology, a CFOR System Testing in Fashion.

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 4)

In the previous post, we have discussed about offline phase and online phase in overview. In this post, we will discover the offline phase in details.

1. States in Offline Phase

This phase consisted of three substages:

  • Object Ontology Establishment Stage. This stage defines fashion ontology to control the training flow as well as the online retrieval flow which serves as a bridge between high-level concepts (objects and categories), midlevel concepts (attributes), and raw data.
  • Learning Stage. This stage exploits deep networks with transfer learning in dealing with the specific tasks including object part learning, category learning, and attribute learning.
  • Storing and Indexing Stage. This stage defines a way of storing data as well as making the index list to reduce retrieval or searching time.

From the offline phase, in this section, inherited from previous state-of-the-art methods, we will mention about object part extraction, transfer learning, and its role in the retrieval system as well as data storing. These modules are highly generalized to any object. Other issues including ontology, attribute learning, network architecture, and indexing strategy will be detailed in the following sections.

2. Loss Function

This function inherited the current state-of-the-art ResNet for classification, and cross entropy loss function is applied for multiclass classification in the category classification model and attribute classification model.

For attribute multitask classification models, the loss function is described as follows:

3. Technical Details

Object ontology which is designed manually based on professional experience and public dataset for the community. It is organized into the hierarchical semantic tree with three main levels: region level, category level, and attribute level. Regions, categories, and attributes are learned automatically based on the local MDNN. The DeepFashion dataset, the used functions will be described as follows:

  • (i)extract_predicates(dta): in a rich-annotated dataset, e.g., DeepFashion, a sample image can be annotated by many labels in different fine-grained levels. For each fine-grained level, the function is used to extract the unique possible labels of samples and then store these labels into a corresponding array. For example, in the DeepFashion dataset, Top, Bottom, and Body are unique labels belonging to one fine-grained level, and thus, they are stored into one array. Similarly, fabric, shape, part, style, and texture labels belong to one fine-grained level and are stored into one array.
  • (ii)build_ontology(predicates, prior): this matches the extracted level and its labels from each predicate array into the corresponding stage of the general ontology, i.e., prior. For example, Top, Bottom, and Body belong to one level which is matched with the region stage of the ontology. After the matching is finished, all other unused stages are eliminated from the general ontology to generate the adapted ontology, e.g., fashion ontology.
  • (iii)extract_state(onto): from the built ontology, all stages and their labels are searched and stored into arrays which will be used to reconstruct the data. For example, the region stage array contains three classes, and the category stage array contains 50 classes.
  • (iv)extract_nes_dta(dta, state, onto): based on the stage and the classes extracted from the “extract_state” function, the whole DeepFashion dataset will be split. Only samples having the labels belonging to the stage are stored as the training set of that stage in the ontology. For example, with the region classification model, only samples labelled Top, Body, or Bottom are used for training.
  • (v)classifyModel(architecture, state_dta): in the DeepFashion dataset, based on ontology, there are four classification models: region classification model and category classification model for the Top region, Body region, and Bottom region. These models are retrained from the ImageNet dataset using ResNet-10.
  • (vi)multitaskModel(group_state_dta, architecture, Matthrew_coef = True): for each group state in terms of the fine-grained attribute level, a multitask classification model is built, e.g., fabric attribute group classification model and style attribute group classification model. These models are retrained from the ImageNet dataset using NASNet v3. Besides, the attribute learning and the usage of MCC are mentioned for an imbalanced data solver.
  • (vii)indexing(state_sta): indexing files are created that will be used for run-time retrieval. The method is based on the nonexhaustive compressed-domain search with GPU.
  • (viii)build_storage(onto, states): storage structures are automatically created based on built-in object ontology and extracted states.
  • (ix)infor_extract(states, dta, onto, classifyModels, multitaskModel): for each sample in the database, all attribute learning models trained in “multitaskModel” function are run and then all possible attributes which are higher than thresholds are extracted.
  • (x)feat_extract(dta, onto, classificationModels, multitaskModel): for each sample in the database, the features of the pre-softmax layer in four models trained in “classifyModel” function are obtained.
  • (xi)structure(storage, feat_dta, info_dta, indexFiles): the database is automatically built based on extracted features, extracted information, index, and storage structure.

3.1. Object Part Extraction

For the aforementioned reasons, foreground objects should be extracted from background regions efficiently and accurately before entering the retrieval step. The target of object extraction is to filter the necessary specific subjects. This also improves the efficiency of the following modules as well as increases the overall system performance. There are many successful object detection methods. Among them, YOLO shows the state-of-the-art results. In our system, we inherited the successful software YOLO (version 3.0) to identify fashion items.

3.2. Transfer Learning

Transfer learning is one of the best methods to reduce training time, especially with complicated architectures such as ResNet or NASNet. The key issue is the initial parameters. In the first step of the training process, we have to generate these parameters with some unsupervised learning methods. However, the initial one will be far from the optimal one. In transfer learning, we will reuse the trained parameters on a large and diverse dataset (such as ImageNet dataset). By this way, our training process will be easier to meet convergence. Thus, it reduces the training time.

Transfer learning can be applied in different ways based on the size of the dataset and data similarity. There are four scenarios in total. First, if the data size is small while data similarity is high, we use the pretrained model as a feature extractor. Second, if the data size is small and data similarity is low, we freeze the top layers and train the remaining layers of the pretrained model. Third (ideal situation), if the data size is large and data similarity is high, we can retrain the model by using the weights initialized in the pretrained one. Fourth (worst situation), if the data size is large and data similarity is low, transfer learning cannot be applied, and we have to train our model from scratch. In our fashion example experiments, while DeepFashion is a large dataset and ImageNet (dataset used for transfer learning) is a high diversity one, we can use all of the initialized weights from the pretrained model.

According to our approach, transfer learning will be applied in region, category classification as well as attribute learning along with ResNet and NASNet architectures, respectively. It can also be used in global deep feature extraction to improve the overall retrieval performance.

3.3. Data Storing

Features extracted from the category classification task and attribute learning will be stored in a hierarchical semantic tree based on object ontology. All features belong to a leaf of object ontology and will be stored in one file. In case of the expansion of large-scale data, the mentioned files can be indexed and split with a corresponding mapping key for each image. The folders will be organized based on object ontology in which each name corresponds to each concept. To clarify, data storing for the proposed ontology is defined as follows (see Figure below for an example of data storing):

  • (i) All files are stored in a folder named “database,” which is denoted as the “Object” node.
  • (ii) Based on ontology, “Object” node contains 3 nodes at the “Region” semantic level. Thus, we have 3 smaller folders: “Top,” “Body,” and “Bottom.”
  • (iii) At the next stage of ontology, we have the “Category” semantic level. Thus, we have 50 folders representing all nodes of “Category.”
  • (iv) Finally, we have the “Attribute” semantic level standing for the leaf node state in ontology. At this state, all features belong to the same “Region” and “Category” and are stored in one file.

Figure 1. An example of the storing structure.

Next

In the next post, we will mention the online phase in details.

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 3)

In the previous post, we have discussed about the imbalance data problem annd the overview of the object retrieval system. The CFOR system is organized into two main phases: offline phase and online phase. In this post, we will discover the overall of the online phase and offline phase of the proposal.

1. Overview of Offline Phase

This phase is designed to generate object ontology, database, indexing file and region detection model, category classification model, and attribute classification model.

  • Object ontology is designed manually based on professional experience and public dataset for the community. It is organized into a hierarchical semantic tree with three main levels: region level, category level, and attribute level.
  • The database is generated to store the preextracted features, regions, categories, and attributes of all images in the dataset. It supports to reduce the online retrieval time and provides the necessary semantic information for each retrieved image.
  • The indexing file which is created to support fast mapping in the online phase of the CFOR retrieval system is the key to perform the retrieval task at runtime.
  • Regions, categories, and attributes are learned automatically based on the local MDNN. Detection models and classification models are created to extract or predict semantic information of the query image and dataset such as regions, categories, and attributes.

2. Overview of Online Phase

This phase of the CFOR system is designed to run the retrieval process including object detection, semantic information extraction, and query expansion and retrieval.

In the object detection stage, we use the trained object detector to detect objects in the query image. In the semantic information extraction stage, the built-in object ontology and classification models are used for extracting the necessary semantic information of each identified object. The extracted semantic information and deep global features of each detected object passed through the searching system along with the indexing file to quickly compute the score between the query object and the sample in the database. Retrieval is applied to rank and export the most similar images to the query object and their relevant information. Query expansion is optional and used to increase the retrieval performance with a trade-off for retrieval time.

The power of mutually supporting object ontology, local MDNN, and imbalanced data solver in the CFOR system: Figure.1 shows the operation of the CFOR system with the interaction of the three main modules object ontology, a local MDNN, and an imbalanced data solver to optimize the learning strategy and improve the overall retrieval performance on large-scale datasets.

Figure 1. Synthesis of object ontology, deep learning, and imbalanced data problem solver in the CFOR system.

Object ontology supports conducting the training flow (with a local MDNN) and retrieval flow (from the coarse-grained level to the fine-grained level) to save computational costs in the training stage and retrieval stage on large-scale datasets. Training flow also paves a way for applying transfer learning which may improve the convergence rate of deep networks. Object ontology which could transform the global imbalance of data into local imbalance of data based on fine-grained groups makes the imbalanced data problem easier to deal with.

Deep multitask NN supports to link the object ontology to the raw data effectively at the category level and attribute level by exploiting inner-group correlations and intergroup correlations. The object ontology supports to update the system at the local level with parallel processing based on the local MDNN. Therefore, CFOR is updated in a flexible manner on large-scale datasets with many variations.And the proposed imbalanced data solver based on MCC which addresses data imbalance has contributed effectively to increasing the quality of object ontology implementation without adjusting network architecture and data augmentation.

Algorithm and demonstration of the CFOR system: an online phase and offline phase (Figures 2 and 3) are used to analyze tasks in the CFOR system. These phases will be demonstrated in detail in this section.

Figure 2. Offline stage of the CFOR system.

Figure 3. Online stage of the CFOR system.

Besides, the CFOR system can be put into use as a general solution for retrieval. To evaluate the performance of the proposed system, fashion objects with attributes are selected in experiments.

Next

In the next post, we will mention the offline phase in details.

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 2)

In the previous post, we have discussed about multitask learning and object retrieval system. In this post, we will discover imbalance data problem and the main proposal.

1. Imbalanced Data Problem

Imbalanced data are the problem in machine learning in which the class distribution is not uniform between the classes. Usually, they are composed of two types of classes: the majority classes (positive) and the minority classes (negative). Recent research in machine learning shows that using an uneven distribution of class examples during learning can cause learning algorithms with misleading performance (bias). It means a classifier with high accuracy in the majority, but it gives poor accuracy in the minority class. In the case of attribute learning, an imbalance occurs if the number of instances in some attributes varies significantly in quantity compared to other attributes. To deal with this situation, in general, adjusting the distribution of classes is an essence of many popular methods to handle imbalanced data problems.

  • Data sampling: sampling-based methods such as upsampling, downsampling, or data augmentation are considered to be a solution for imbalanced data problems. In addition to making data more balanced, they can help reduce training time (downsampling) or make the learning process more efficient (upsampling). The best approach we know is SMOTE which can solve the situation by automatically generating additional data (upsampling) based on the original dataset. However, these methods increase overfitting when training (upsampling) or losing (downsampling) data. Data augmentation is proved to be robust in dealing with imbalanced training data. However, this method takes up a lot of training resources, and it is difficult to find a proper augmented dataset which is large enough to train. And it is very difficult (or impossible) to augment data to balance the attributes in datasets because each object usually has many attributes.

    Figure 1. A simple data sampling

  • Architecture, loss function, and metric configuration: other methods exploit network architectures, loss functions, or metrics to address the imbalanced data problem when training. The methods (at the algorithm level) enhance the existing classifier by adjusting algorithms to recognize the smaller classes. Internal techniques provide general solutions for the imbalanced data problem because these are not specific to particular problems. These approaches show better performance compared to data sampling; however, they are often difficult to implement as well as configure in the future. Therefore, they are not always the best choice in dynamic retrieval systems in which the attributes have a large variety.Threshold and output-based configuration: instead of generating more data or making changes in the model, these methods find the best thresholds based on output. The essence of these methods is to use scores that show the probability to indicate which test sample is a member of a class in producing several learners by changing the threshold for class members. These methods are particularly effective in resolving imbalanced data problems without changing the configuration in the model. Moreover, they also do not reduce data or increase overfitting. SVM is proposed to find these thresholds. However, Boughorbel et al. proposed Matthews’ correlation coefficient (MCC) to deal with imbalanced data in classification. Although SVM shows better performance, MCC consumes less resources and processing time compared to it. Based on the methods of many other researchers, we found a solution for multitask learning that is suitable to retrieval systems using the end-to-end DCNN for training and MCC for estimating thresholds to get final outputs.

Figure 2. Metric losses

2. Materials and Methods: CFOR System

The CFOR system is very complicated but easy to understand. We focus on the main points of the CFOR system.

CFOR is an object retrieval system integrated by object ontology, a local MDNN (NASNet and ResNet), and an imbalanced data solver (MCC) to improve the performance of the large-scale object retrieval system from the coarse-grained level (categories) to the fine-grained level (attributes) (see Figure below).

Figure 2. Synthesis of object ontology, deep learning, and imbalanced data problem solver in the CFOR system.

Query Image. For traditional content-based image retrieval systems, with query images, one is just able to retrieve the images ranked on visual similarity to query image. It is very difficult (or impossible) for users to provide semantic information to the system based on query images. But the interesting thing is that, in our CFOR system, this challenge has been solved. The semantic information of the query image is extracted automatically by the category and attribute classification system, and users can use the extracted semantic information during the retrieval process.An example is how users can query “Asian face” with only a query image; here, “Asian race” is semantic information. The traditional retrieval methods cannot meet this requirement because of the curse of semantic gap. And the CFOR system can recognize “Asian race” and use it to retrieve. Another example for “Fashion” object based on our CFOR system is described in Figure 3.

Figure 3. Extracting regions, categories, and attributes from a query image with trained models of the CFOR system. After that, users can use this semantic information to reduce the searching space.

From the query image, based on fashion ontology, the detector quickly identifies the region (Top and Bottom; see Figure 4).

Figure 4. Fashion ontology used to retrieve.

After that, the user selects the region (Top; see Figure 5); the CFOR system quickly identifies the category related to the Top region (category: Blazer). Later, specific concepts and visual concepts are extracted according to Blazer, and users can select some of them (or all of them) to retrieve. For user-friendly interaction, only extracted regions, categories, and attributes are shown. Other information such as global deep features, attribute vector, ontology, or group of attributes which are used as searching input of the system will not be displayed. In such a way, users can order the CFOR system at the semantic level, and they can achieve the results that match both the content and semantics of the query image.

Figure 5.Retrieval results.

Next

In the next post, we will mention the online phase and offline phase of the proposed retrieval system.

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 1)

Object retrieval plays an increasingly important role in video surveillance, digital marketing, e-commerce, etc. It is facing challenges such as large-scale datasets, imbalanced data, viewpoint, cluster background, and fine-grained details (attributes). This paper has proposed a model to integrate object ontology, a local multitask deep neural network (local MDNN), and an imbalanced data solver to take advantages and overcome the shortcomings of deep learning network models to improve the performance of the large-scale object retrieval system from the coarse-grained level (categories) to the fine-grained level (attributes). Our proposed coarse-to-fine object retrieval (CFOR) system can be robust and resistant to the challenges listed above. To the best of our knowledge, the new main point of our CFOR system is the power of mutual support of object ontology, a local MDNN, and an imbalanced data solver in a unified system. Object ontology supports the exploitation of the inner-group correlations to improve the system performance in category classification, attribute classification, and conducting training flow and retrieval flow to save computational costs in the training stage and retrieval stage on large-scale datasets, respectively. A local MDNN supports linking object ontology to the raw data, and an imbalanced data solver based on Matthews’ correlation coefficient (MCC) addresses that the imbalance of data has contributed effectively to increasing the quality of object ontology realization without adjusting network architecture and data augmentation. In order to evaluate the performance of the CFOR system, we experimented on the DeepFashion dataset. This paper has shown that our local MDNN framework based on the pretrained NASNet architecture has achieved better performance (14.2% higher in recall rate) compared to single-task learning (STL) in the attribute learning task; it has also shown that our model with an imbalanced data solver has achieved better performance (5.14% higher in recall rate for fewer data attributes) compared to models that do not take this into account. Moreover, MAP@30 hovers 0.815 in retrieval on an average of 35 imbalanced fashion attributes.

1. Introduction

Nowadays, object retrieval is facing some challenges and has some advantages.

Query format plays a very important role in large-scale object retrieval systems. Thus, the query format should be user-friendly and satisfy user requirements in practice.

Two query formats are popular these days: image-based format and text-based format. The text-based query format is being used widely in many searching systems. However, in many cases, it is very difficult to use query text to express the content that human would like to retrieve because words have some limitations in expressing visual information. Instead, a query image is worth more than thousand words; it allows customers to search objects without typing, and the most important thing is that it can retrieve the results based on content. Nevertheless, the limitations of the query image in expressing semantic information could decrease the overall retrieval performance. Thus, the query image and retrieval image with useful related information (regions, categories, fine-grained attributes, etc.) will be the interesting points that we have to focus on to improve the performance of the coarse-to-fine object retrieval system.

Object retrieval systems should meet the requirements of retrieving from large-scale datasets not only at the coarse level but also at the detailed level (or attribute level). For example, in face retrieval systems, facial attribute retrieval is often required. In fashion retrieval systems, fashion attribute retrieval is an indispensable requirement. In person reidentification systems, in the reidentification stage, besides using the global features of the whole human body, attribute vectors of the face and clothes are also being exploited effectively. In crowd attribute recognition systems, the useful attribute set consisted of location, participants, and activities.

Objects often have multiple attributes, and there are methods to retrieve objects at the attribute level from large-scale datasets without manual annotation. In attribute recognition, the traditional methods often waste a lot of time in selecting hand-crafted features for each attribute group during the trial-and-error process but do not always achieve the desired results. In recent years, the deep convolutional neural network (DCNN) has demonstrated high performance in many tasks in computer vision such as detection, classification, recognition, and retrieval. And without exception, the DCNN is also used for attribute learning, with only one network architecture, and the DCNN model can learn to recognize many attributes.

The performance of the DCNN-based attribute learning model will not achieve high rate if the set of attributes plays the same role in the network architecture at the output level and imbalanced data are unresolved. To exploit the inner-group correlations in coarse-grained groups or fine-grained groups, the DCNN often is revised to the deep multitask NN. The performance of classification will be improved if the elements of fine-grained category groups or fine-grained attribute groups could share similar learning features, so the slope of their error surface will become more uniform and the deep multitask learning algorithm can easily reach the global optimum effectively.

Object ontology plays an important role in category classification, attribute classification, and conducting training flow and retrieval flow to save computational costs in the training stage and retrieval stage on large-scale datasets, respectively. Thus, based on our experience in researching objects related to attributes such as face, cloth, person (reidentification), crowd (monitoring), and fast filters in large-scale object retrieval, we would like to introduce an object ontology as a hierarchical semantic tree with three levels: region, category, and attribute levels. The attribute level consisted of visual concepts and specific concepts. Visual concepts support linking common visual attributes to arbitrary objects.

We introduce an object ontology based on popular large-scale standard datasets in science community, so we hope that our ontology can meet the criterion “widely recognized in community.” And for criterion “realization,” we have proposed the local MDNN to support linking object ontology to the raw data. However, if object ontology could not be linked with high quality, it could not function effectively. And we have proposed the imbalanced data solver based on MCC to address data imbalance that has contributed effectively to increasing the quality of linking object ontology to raw data without adjusting network architecture and data augmentation.

We review some typical works based on object ontology, deep multitask neural networks, and imbalanced data solvers to highlight our contributions.

Most of the works only present the set of attributes in the form of item lists or item groups. A few works used the terminology "ontology", but to the best of our knowledge, there are not works that present the object ontology in full meaning of regions, categories, and attributes.

In [8], FashionNet handles the challenges as deformation and occlusions by explicitly predicting clothing landmarks and pooling features over the estimated landmarks, resulting in more discriminative cloth representation. The authors do not use the terminology "ontology", but the DeepFashion dataset is organized based on a hierarchical tree; it is only deployed according to fashion, and it includes a two-level tree: the first level consisted of 50 categories and the second level consisted of 5 attribute groups (texture, fabric, shape, part, and style) (it does not have color attribute). The coarse-grained groups (at the category level) or fine-grained groups (at the attribute level) have the same role in deep neural networks, and the imbalanced data solver has not been considered yet.

Our idea is to improve the performance of deep neural networks based on object ontology and imbalanced data solvers with inspiration from Gödel’s incompleteness theory. This theory shows the limitation of any consistent formal system as well as the limitation of specific methods in solving problems. When the deep network configuration method is not able to create such a large effect as in the early days it took place, it is necessary to integrate object ontology and imbalanced data solvers into deep learning. Based on appropriate interventions in inputs and outputs, we introduce a new method that can help improve the performance of the object retrieval system.

The main contributions of this paper are as follows.

  • Our proposed unified model consisted of object ontology, a local MDNN, and an imbalanced data solver to improve the performance of the large-scale object retrieval system from the coarse-grained level (categories) to the fine-grained level (attributes).

  • Our proposed object ontology is a hierarchical semantic tree consisting of three main levels: region, category, and attribute levels. It can support the optimal learning strategy and minimize the effect of semantic gap. It is useful to improve the performance of category classification, attribute classification, and conducting training flow and retrieval flow to save computational costs in the training stage and retrieval stage on large-scale datasets, respectively.

  • Our proposed local MDNN is inspired by multitask neural networks. It is based on NASNet, ResNet exploiting the local multitask neural network architecture, to improve the performance of category classification and attribute classification and for flexible system updates. The local MDNN supports linking object ontology to raw data and takes advantage of inner-group correlations of categories and attributes. If the inner-group correlations (or intergroup correlations) are exploited, the performance of classification will be improved because the elements of fine-grained categories or the fine-grained attribute group share similar learning features, the slope of their error surface becomes more uniform, and our deep local multitask learning algorithm can easily reach the global optimum effectively.

Data imbalances often occur for large-scale datasets. Data augmentation is almost impossible because each object can have multiple attributes. The solution based on the loss functions, as in [6], may be possible, but it cannot exploit transfer learning. Our proposed imbalanced data solver is inherited from MCC without adjusting network architecture and data augmentation. It is integrated into the local MDNN to improve the performance of category classification and attribute classification, but it can still exploit transfer learning to reduce computational costs in the training stage on large-scale datasets.

Our proposed query format is based on object ontology with semantic information such as regions, categories, and attributes extracted automatically from the query image. Therefore, we can express semantic information from the image to the retrieval process that the traditional methods have not implemented yet.

Figure 1. A usecase of object retrieval system.

2. Object Retrieval System

Fine-grained object retrieval is supposed to search for similar images that include specific object attributes. It declares a transition model from image retrieval to object attribute retrieval. Specifically, unlike traditional image retrieval systems where queries and results are often coarse (e.g., texts or images), fine-grained image retrieval aims to retrieve semantic information such as categories and attributes. In the fashion field, taking advantages of semantic information, an object retrieval method based on the combination of the global feature with fine-grained attribute information was introduced [8]. Inspired by previous works, we would like to propose a coarse-to-fine object retrieval system which not only takes advantage of the combination of the global feature with fine-grained attribute information but also optimizes the learning strategy based on ontology and resolves the imbalanced data problem by interfering with the output.

In addition to meeting the semantic retrieval results, the object retrieval system must handle large-scale problems to run in real time. However, most solutions did not take advantage of the power of GPUs for parallel processing which can significantly reduce feature-matching time and retrieval time. To leverage the support of GPUs, we inherited the search algorithm introduced by Johnson et al. (billion-scale similarity search with GPUs) which is a nonexhaustive similarity search. The search method perfectly suited the proposed CFOR system which further decreased searching time by creating multi-index files based on built-in object ontology.

Figure 2. Original object retrieval system.

3. Attribute Learning

Attribute learning is a backbone of CFOR, and it has strong effects on performance of fine-grained object retrieval. Therefore, attribute learning is considered one of the important parts of the learning strategy.

Attribute Learning.

This method is used for object recognition systems at the fine-grained level. Unlike learning methods that are used for the high-level concept, attribute learning supports a solution for midlevel semantic concepts or visual concepts which are known to have (more or less) correlations to each other. There are two main different learning methods: single-task learning and multitask learning.Single-task attribute learning: in this type, attributes have their own learning model. Therefore, it leads to the number of models equal to the number of attributes. Moreover, each attribute is treated separately, for which the inner-group correlations are not yet exploited. Many works are known in the fashion field by using single-task learning for fashion attributes. At that time, there were many challenges in multitask learning. A shared CNN is defined to pave a way in the final format of the multitask multilabel predictions. Therefore, multitask learning becomes possible.Multitask attribute learning: to apply this technique to attributes, samples will be collected by merging given datasets into one with one-hot binary vector demonstration. Like single-task learning, the input will be the image. Despite the output of single-task learning which is a value that describes the existence (or not) of an attribute in an image, the output of multitask learning will be a one-hot binary vector which describes the existence (or not) of a group of attributes. Rudd has shown that joint optimization over all attributes outperforms training a single independent network with the same architecture for each attribute, in which the feature space is optimized along with the classifier on a per-attribute basis, both in terms of accuracy and storage, processing efficiency. This result shows that the multitask approach is much more effective in exploiting latent correlations than independent classifiers used to learn them. Although multitask learning can yield better performance compared to single-task learning, its critical weakness is that the model cannot be reused when there is any attribute change. A retraining or additional model will be applied when a new attribute is added. Lack of reuse is the reason that multitask learning methods are not flexible for attributes that change frequently. To address these challenges, we propose that local multitask attribute learning be considered a grouping method based on object ontology to improve its reuse.

Figure 3. Attribute learning model based on deep features with SVM classifiers.
Figure 4. Attribute learning model based on adaptive attribute domain with independent deep convolutional neural networks.
Figure 5. Attribute learning model based on the end-to-end deep neural network as a shared block with adaptive loss function.

Next

In the next post, we will mention Imbalanced Data Problem and our proposal in details.