Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval

Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval Jianjun Lei, Senior Member, IEEE, Yuxin Song, Bo Peng, Zhanyu Ma, Senior Member, IEEE, Ling Shao, Senior Member, IEEE, and Yi-Zhe Song Abstract—Sketch-based image retrieval (SBIR) is a challenging community [1]-[4]. With the popularization of touch screen task due to the large cross-domain gap between sketches and devices, sketch-based image retrieval (SBIR) has attracted natural images. How to align abstract sketches and natural extensive attention and achieved remarkable performance [5]- images into a common high-level semantic space remains a [8]. Given a hand-drawn sketch, the SBIR task aims to key problem in SBIR. In this paper, we propose a novel semi- retrieve the natural target images from the image database. heterogeneous three-way joint embedding network (Semi3-Net), which integrates three branches (a sketch branch, a natural However, compared with the natural target images, which are image branch, and an edgemap branch) to learn more dis- full of color and texture information, sketches only contain criminative cross-domain feature representations for the SBIR simple black and white pixels [9], [10]. Therefore, hand-drawn task. The key insight lies with how we cultivate the mutual and sketches and natural images belong to two heterogeneous data subtle relationships amongst the sketches, natural images, and domains, and aligning these two domains into a common edgemaps. A semi-heterogeneous feature mapping is designed to extract bottom features from each domain, where the sketch and feature space remains the most challenging problem in SBIR. edgemap branches are shared while the natural image branch is Traditional SBIR methods describe sketches and natural heterogeneous to the other branches. In addition, a joint semantic images using hand-crafted features [11]-[19]. Edgemaps of embedding is introduced to embed the features from different natural images are usually first extracted as sketch approxima- domains into a common high-level semantic space, where all of tions. Then, hand-designed features, such as HOG [20], SIFT the three branches are shared. To further capture informative features common to both natural images and the corresponding [21], and Shape Context [22], are exploited to describe both edgemaps, a co-attention model is introduced to conduct common the sketches and edgemaps. Finally, the K-Nearest Neighbor channel-wise feature recalibration between different domains. A (KNN) ranking process is utilized to evaluate the similarity hybrid-loss mechanism is designed to align the three branches, between the sketches and natural images to obtain the final where an alignment loss and a sketch-edgemap contrastive loss retrieval results. However, as mentioned above, hand-drawn are presented to encourage the network to learn invariant cross- domain representations. Experimental results on two widely sketches and natural images belong to two heterogeneous data used category-level datasets (Sketchy and TU-Berlin Extension) domains. It is difficult to design a common type of feature demonstrate that the proposed method outperforms state-of-the- applicable to two different data domains. Besides, sketches art methods. are usually drawn by non-professionals, making them full Index Terms—SBIR, cross-domain learning, co-attention of intra-class variations [23]-[25]. Most hand-crafted features model, hybrid-loss mechanism have difficulties in dealing with these intra-class variations and ambiguities of hand-drawn sketches, which also negatively impacts the performance of SBIR. I. I NTRODUCTION In recent years, convolutional neural networks (CNNs) have INCE the number of digital images on the Internet has in- been widely used across fields [26]-[29], such as person re- creased dramatically in recent years, content-based image identification, object detection, and video recommendation. retrieval technology has become a hot topic in computer vision In contrast to traditional hand-crafted methods, CNNs can automatically aggregate shallow features learned from the This work was supported in part by the Natural Science Foundation of Tian- bottom convolutional layers. Inspired by the learning ability jin ( No.18ZXZNGX00110, 18JCJQJC45800 ), and National Natural Science Foundation of China ( No.61931014, 61922015, 61722112 ). Copyright 20xx of CNNs, several Siamese networks and Triplet networks IEEE. Personal use of this material is permitted. However, permission to use have been proposed for the SBIR task [30]-[33]. Most of this material for any other purposes must be obtained from the IEEE by these methods encode a sketch-image or a sketch-edgemap sending an email to [email protected]. (Corresponding author: Bo Peng) pair, and learn the similarity between the input pair using a J. Lei, Y. Song, and B. Peng are with the School of Electrical and contrastive loss or triplet loss. However, there are still several Information Engineering, Tianjin University, Tianjin 300072, China (e-mail: difficulties and challenges to be solved in these methods. 1) [email protected]; [email protected]; [email protected]). Z. Ma is with the Pattern Recognition and Intelligent System Laboratory, The different characteristics of sketches and natural images Beijing University of Posts and Telecommunications, Beijing 100876, China make the SBIR task challenging. Generally, a sketch only (e-mail: [email protected]). contains the object to be retrieved, meaning the sketches L. Shao is with the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates (e-mail: [email protected]). tend to have relatively clean backgrounds. In addition, since Y.-Z. Song is with the SketchX Lab, Centre for Vision, Speech and Signal sketches are usually drawn by non-professionals, the shapes Processing, University of Surrey, Guildford, Surrey GU2 7XH, U.K. (e-mail: of objects in sketches are usually deformed and relatively [email protected]). Digital Object Identifier abstract. For natural images, although the objects are not arXiv:1911.04470v1 [cs.CV] 10 Nov 2019 2 usually significantly deformed, natural images taken by cam- learn joint feature representations for sketches, natural images, eras usually have complex backgrounds. Therefore, creating and edgemaps. a network that can learn more discriminative features for 2) To capture informative features common between nat- both sketches and natural images remains a challenge. 2) ural images and the corresponding edgemaps, a co-attention Since sketches and natural images are from two different data model is developed between the natural image and edgemap domains, there exists a significant domain gap between the branches. features of these two. Most deep SBIR methods that adopt 3) A hybrid-loss mechanism is designed to mitigate the domain gap, where an alignment loss and a sketch-edgemap a contrastive loss or a triplet loss to learn the cross-domain contrastive loss are presented to encourage the network to learn similarity are not effetive enough to cope with the intrinsic invariant cross-domain representations. domain gap. Therefore, finding a way to eliminate or reduce the cross-domain gap and embedding features from different 4) Experiments on two widely-used datasets, Sketchy and domains into a common high-level semantic space is critical TU-Berlin Extension, demonstrate that the proposed method for SBIR. 3) More importantly, most existing methods achieve outperforms state-of-the-art methods. SBIR by exploring the matching relationship between either The rest of the paper is organized as follows. Section II sketch-edgemap pairs or sketch-image pairs. However, the reviews the related works. Section III introduces the proposed methods using sketch-edgemap pairs ignore the discriminative method in detail. The experimental results and analysis are features contained in natural images, while the methods using presented in Section IV. Finally, the conclusion is drawn in sketch-image pairs ignore the auxiliary role of edgemaps. Section V. Enabling full use of the joint relationships among sketches, natural images, and edgemaps provides a novel way to solve II. R ELATED WORK the cross-domain learning problem. A. Traditional SBIR Methods To address the above issues, a novel semi-heterogeneous Traditional SBIR methods usually utilize edge extraction three-way joint embedding network (Semi3-Net) is proposed methods to extract edgemaps from natural images first. Then, in this paper to mitigate the domain gap and align sketches, hand-crafted features are used as descriptors for both sketches natural images, and edgemaps into a high-level semantic space. and edgemaps. Finally, a KNN ranking process within a The key insight behind our design is how we enforce mu- Bag-of-Words (BoW) framework is usually utilized to rank tual cooperation amongst the three branches. We importantly the candidate natural images for each sketch. For instance, recognize that when measured in terms of visual abstraction, Hu et al. [11] incorporated the Gradient Field HOG (GF- sketches and edgemaps are more closely linked compared with HOG) into a BoW scheme for SBIR, and obtained promising sketches and natural images. The sketches are highly abstract performance. Saavedra et al. [12] introduced Soft-Histogram and iconic representations of natural images, and edgemaps are of Edge Local Orientations (SHELO) as the descriptors for reduced versions of natural images, where detailed appearance sketches and edgemaps extracted from natural images, which information such as texture and color are removed. How- effectively improves the retrieval accuracy. In [13], a novel ever, compared with edgemaps, natural images contain more method for describing hand-drawn sketches was proposed by discriminative features for SBIR. Motivated by this insight, detecting learned keyshapes (LKS). Xu et al. [14] proposed we purposefully design a semi-heterogeneous joint embedding an academic coupled dictionary learning method to address network, where a semi-heterogeneous weight-sharing setting the cross-domain learning problem in SBIR. Qian et al. [15] among the three branches is adopted in the feature mapping introduced re-ranking and relevance feedback schemes to find part, while a three-branch all-sharing setting is conducted in more similar natural images based on initial retrieval results, the joint semantic embedding part. This design essentially thus improving the retrieval performance. promotes edgemaps to act as a “bridge” to help narrow the domain gap between the natural images and sketches. Fig. 1 B. Deep SBIR Methods offers a visualization of the proposed Semi3-Net architecture. More specifically, the semi-heterogeneous feature mapping Recently, many frameworks based on CNNs have been part is designed to extract the bottom features for each domain, proposed to address the challenges in SBIR [33]-[39]. Aiming where a co-attention model is introduced to learn informa- to learn the cross-domain similarity between the sketch and tive features common between different domains. Meanwhile, natural image domains, several Siamese networks have been the joint semantic embedding part is proposed to embed proposed to improve the retrieval performance. Qi et al. [33] the features from different domains into a common high- introduced a novel Siamese CNN architecture for SBIR, which level semantic space. In addition, a hybrid-loss mechanism is learns the features of sketches and edgemaps by jointly tuning proposed to achieve a more discriminative embedding, where two CNNs. Liu et al. [34] proposed a Siamese-AlexNet based an alignment loss and a sketch-edgemap contrastive loss are on two AlexNet [40] branches to learn the cross-domain introduced to encourage the network to learn invariant cross- similarity and mitigate the domain gap. Wang et al. [36] domain representations. The main contributions of this paper proposed a Siamese network to learn the similarity between the are summarized as follows. input sketches and edgemaps of 3D models, which is originally 1) A novel semi-heterogeneous three-way joint embedding designed for sketch-based 3D shape retrieval. Meanwhile, network is proposed, in which the semi-heterogeneous feature several Triplet architectures have also been proposed, which mapping and the joint semantic embedding are designed to include the sketch branch, the positive natural image branch, 3 FMB SEB I I cross-entropy Scale Natural image Co-attention Shared weights alignment model SI contrastive Scale cross-entropy FMB S SE EB B E E E Edgemap SE SE SE L L L ccco o on n ntttrrra a assstttiiivvveee Shared weights Shared weights Attention Scale module cross-entropy Sketch FMB SEB S S (a) (b) (c) (d) Co Con nv vo ol lu ut ti io on na al l+ +p po oo ol li in ng g l la ay ye ers rs Fully-connected layer Embedding layer L2 normalization layer Fig. 1. Illustration of the proposed Semi3-Net. (a) Three-way inputs. (b) Semi-heterogeneous feature mapping. (c) Joint semantic embedding. (d) Hybrid-loss mechanism. Blocks with the same color represent their weights are shared. and the negative natural image branch. In these methods, region at a time, using reinforcement learning. For instance, a ranking loss function is utilized to constrain the feature Hu et al. [50] proposed a channel attention model to recalibrate distance between a sketch and a positive natural image to be the weights of different channels, which effectively enhances smaller than the one between the sketch and a negative natural the discriminative power of features and achieves promising image. Sangkloy et al. [38] learned a cross-domain mapping classification performance. Gao et al. [51] proposed a novel through a pre-training strategy to embed natural images and aLSTMs framework for video captioning, which integrates the sketches in the same semantic space, and achieved superior attention mechanism and LSTM to capture salient structures retrieval performance. Recently, deep hashing methods [41], for video. In [52], a hierarchical LSTM with adaptive attention [42] have been exploited for retrieval task and have achieved approach was proposed for image and video captioning, and significant improvement on retrieval performance. Liu et. al. achieved the state-of-the-art performances on both tasks. Be- [34] integrated a deep architecture into the hashing framework sides, Li et al. [49] proposed a harmonious attention network to capture the cross-domain similarities and speed up the SBIR for person re-identification, where soft attention is used to process. Zhang et al. [39] proposed a Generative Domain- learn important pixels for fine-gained information matching, migration Hashing (GDH) approach, which uses a generative and hard attention is applied to search latent discriminative re- model to migrate sketches to their indistinguishable natural gions. Song et al. [35] proposed a soft attention spatial method image counterparts and achieves the best-performing results for fine-grained SBIR to capture more discriminative fine- on two SBIR datasets. grained features. This model reweights the different spatial regions of the feature map by learning a weight mask for each branch of the triplet network. However, although the attention C. Attention Models mechanisms above have strong abilities for feature learning, Attention models have recently been successfully applied they generally learn discriminative features using only the to various deep learning tasks, such as natural language pro- input itself. For the SBIR task, we are more concerned about cessing (NLP) [43], fine-grained image recognition [44], [45], learning discriminative cross-domain features for retrieval. In video moment retrieval [46], and visual question answering other words, the common features of different domains should (VQA) [47]. In the field of image and video processing, be considered simultaneously. To address this, a co-attention the two most commonly used attention models are soft- model is exploited in this paper for the SBIR task, to focus attention models [48] and hard-attention models [49]. Soft- on capturing informative features common between natural attention models assign different weights to different regions images and the corresponding edgemaps, and further mitigate or channels in an image or a video, by learning an attention the cross-domain gap. mask. In contrast, hard attention models only indicate one 4 is not important for the retrieval task. For the sketch branch FMB , a channel-wise attention module is also introduced to learn more discriminative features. The co-attention model Pooling layer Pooling layer will be discussed in Section III-B in detail. aa__ii aa__ii Particularly, the semi-heterogeneous in the proposed archi- XXII XXE E tecture refers to the three-branch semi-heterogeneous weight Co-attention model sharing strategy. The weights of the sketch and edgemap GAP branches are shared while the natural image branch is in- Attention Attention 1 1 c module module dependent of the others in the semi-heterogeneous feature FC mapping part. The three branches are integrated into the semi- c heterogeneous feature mapping architecture and interact with 11 Image Edgemap M M I E each other, which ensures that not only are the bottom features mask mask ReLU of each domain preserved, but that the features from different 11 domains are also prealigned through the co-attention model. FC B. Joint Semantic Embedding 1 1 c Co-mask CO In the joint semantic embedding part, as shown in Fig. Sigmoid 1 (c), the natural image branch SEB , the edgemap branch Attention module SEB , and the sketch branch SEB are developed to embed E S the features from different domains into a common high-level Scale Scale semantic space. Each branch of the joint semantic embedding aa__oo aa__oo XXII XXE E part includes several fully-connected (FC) layers. An extra embedding layer, followed by an L2 normalization layer, is also introduced in each branch. As previously stated, the bot- Fig. 2. Structure of the co-attention model. Yellow blocks represent the tom features of different branches are learned respectively, by layers that belong to the natural image branch, while green blocks represent the semi-heterogeneous feature mapping. In order to achieve the layers that belong to the edgemap branch. feature alignment between the natural images, edgemaps, and sketches in a common high-level semantic space, the weights of SEB , SEB , and SEB are completely shared in the joint III. SEMI-H ETEROGENEOUS T HREE-WAY J OINT I E S semantic embedding part. Based on the features learned in the E MBEDDING NETWORK common high-level semantic space, a hybrid-loss mechanism A. Semi-Heterogeneous Feature Mapping is proposed to learn the invariant cross-domain representations, As shown in Fig. 1 (b), the semi-heterogeneous feature and achieve a more discriminative embedding. mapping part consists of the natural image branch FMB , the edgemap branch FMB , and the sketch branch FMB . E S C. Co-attention Model Each branch includes a series of convolutional and pooling layers, which aim to learn the bottom features for each domain. To capture the informative features common between nat- As mentioned above, sketches and edgemaps have similar ural images and the corresponding edgemaps, a co-attention characteristics, and both lack detailed appearance information model is proposed between the natural image and edgemap such as texture and color, thus the weights of FMB and branches. As previously stated, edgemaps are reduced versions FMB are shared. Besides, since the amount of the sketch of natural images, where detailed appearance information are training data is much smaller than natural image training data, removed. Therefore, a natural image and its corresponding sharing weights between the sketch and edgemap branches can edgemap are spatially aligned, which can be made full use of partly alleviate problems that arise from the lack of sketch to narrow the gap between the natural images and edgemaps. training data. Additionally, since the weights of the sketch and edgemap Meanwhile, since there exist obvious features representation branches are shared, the introduction of the co-attention model differences between natural images and sketches (edgemaps), actually enforces mutual and subtle cooperation amongst three the bottom convolutional layers of the natural image branch branches. Specifically, before the bottom features of the three should be learned separately. Accordingly, the natural image branches are fed into the joint semantic embedding part, the branch FMB does not share weights with the other two co-attention model prealigns different domains by conducting branches in the feature mapping part. Additionally, with the common feature recalibration. As shown in Fig. 2, the pro- aim of learning the informative features common between posed co-attention model includes two attention modules and different domains, a co-attention model is introduced to as- a co-mask learning module. The co-attention model takes the sociate the FMB and FMB branches. By applying the output feature maps of the last pooling layers in the natural I E proposed co-attention model, the network is able to focus on image and edgemap branches as the inputs, and learns a the discriminative features common to both natural images common mask which is used to re-weight each channel of and the corresponding edgemaps, and discard information that the feature maps in both two branches. 5 In the attention module, a channel-wise soft attention mech- channel-wise relationship between different domains, simul- anism is adopted to capture discriminative features of each taneously. Specifically, by introducing the co-attention model input. Specifically, the attention module on the right of Fig. 2 between the natural image and edgemap branches, the common consists of a global average pooling (GAP) layer, two FC lay- informative features from different domains are highlighted, a i hwc ers, a ReLU layer, and a sigmoid layer. Let X 2 R and the cross-domain gap is effectively reduced. a i hwc and X 2 R denote the inputs of the attention module for the natural image branch and edgemap branch, respectively, D. Hybrid-Loss Mechanism where h, w, and c represent the height, width, and channel SBIR aims to obtain a retrieval ranking by learning the dimensions of the feature maps. Through the GAP layer, the similarity between a query sketch and the natural images in feature descriptors for natural image and edgemap branches are obtained by aggregating the global spatial information of a dataset. The distance between the positive sketch-image a i a i X and X , which can be formulated as: pairs should be smaller than the distance between the neg- I E ative sketch-image pairs. In the hybrid-loss mechanism, the h w P P gap 1 a i alignment loss and the sketch-edgemap contrastive loss are (1) X = X (u; v) I hw I u=1 v=1 presented to learn the invariant cross-domain representations, and mitigate the domain gap. Besides, we also introduce h w P P gap 1 a i (2) X = X (u; v) the cross-entropy loss [38] and the sketch-image contrastive E hw u=1 v=1 loss [30], which are two typical losses in SBIR. The four gap gap Based on X and X , two FC layers and a ReLU types of loss functions are complementary to each other, thus I E layer are applied to model the channel-wise dependencies, improving the separability between the different sample pairs. and the attention maps in both two domains are obtained. By To be clear, the feature maps produced by the L2 normal- deploying a sigmoid layer, each channel of the attention map ization layers for the three branches are denoted as f (I), is normalized to [0; 1]. For different domains, the final learned f (E), and f (S), where f () denotes the mapping func- E S 11c image attention mask M 2 R and edgemap attention tion learned by the network branch, and  ,  , and I E S 11c mask M 2 R are respectively formulated as: denote the weights of the natural image, edgemap, and sketch branches, respectively. gap 2 1 M = sigmoid(W  ReLU(W  X )) (3) I I 1) Alignment loss. To learn the invariant cross-domain gap 2 1 representations and align different domains into a high-level M = sigmoid(W  ReLU(W  X )) (4) E E semantic space, a novel alignment loss is proposed between 1 1 where W and W denote the weights of the first FC layer, the natural image branch and the edgemap branch. Although I E 2 2 and W and W denote the weights of the second FC layer. the image and the corresponding edgemap come from different I E In SBIR, the key challenge is to capture the discriminative data domains, they should have similar high-level semantics information common to different domains and then align the after the processing of joint semantic embedding. Motivated by different domains into a common high-level semantic space. this, aiming to minimize the feature distance between an image Therefore, unlike most existing works that use the obtained and its corresponding edgemap in the high-level semantic attention mask directly to reweight the channel responses, the space, the proposed alignment loss function is defined as: proposed co-attention model tries to capture the channel-wise L (I; E) = jjf (I) f (E)jj (8) dependencies common between different domains by learning alignment I E 2 a co-mask. Based on the learned image attention mask and By introducing the alignment loss, the cross-domain invari- edgemap attention mask, the co-mask M is defined as: CO ant representations between a natural image and its corre- M = M M (5) sponding edgemap are captured for the SBIR task. In other CO I E words, the proposed alignment loss provides a novel way of where denotes the element-wise product. Elements in dealing with the domain gap by constructing a correlation 11c M 2 R represent the joint weights of the corre- CO between the natural image and its corresponding edgemap. It a i a i sponding channels in X and X . I E potentially encourages the network to learn the discriminative a o hwc Afterwards, the output feature maps X 2 R and features common to both natural image and sketch domains, a o hwc X 2 R for the image and edgemap branches are and successfully aligns different domains into a common high- calculated respectively by rescaling the input feature maps level semantic space. a i a i X and X with the obtained channel-wise co-mask: I E 2) Sketch-edgemap contrastive loss. Considering the a o a i one-to-one correlation between an image and its corre- X = f (M ;X ) (6) scale CO I I sponding edgemap, the sketch-edgemap contrastive loss a o a i SE X = f (M ;X ) (7) scale CO L (S; E; l ) between the sketch and edgemap E sim contrastive branches is proposed to further constrain the matching rela- where f () denotes the channel-wise multiplication be- scale tionship between the sketch-image pair as follows. tween the co-mask and the input feature maps. SE + The proposed co-attention model not only considers the L (S; E; l ) = l d(f (S); f (E )) sim sim contrastive S E channel-wise feature responses of each domain by introduc- +(1 l ) max(0; m d(f (S); f (E ))) sim 1 S E ing the attention mechanism, but also captures the common (9) 6 where l denotes the similarity label, with 1 indicating a loss mechanism, the proposed network is able to learn more sim positive sketch-edgemap pair and 0 being a negative sketch- discriminative feature representations and effectively align the edgemap pair, E and E denote the edgemap corresponding sketches, natural images, and edgemaps into a common feature to the positive and negative natural image, respectively, d() space, thus improving the retrieval accuracy. denotes Euclidean distance, and m denotes the margin. The sketch-edgemap contrastive loss aims to measure the similarity E. Implementation Details and Training Procedure between input pairs from the sketch and edgemap branches, In the proposed Semi3-Net, each branch is constructed thus further aligning different domains into the high-level based on VGG19 [53], which consists of sixteen convolutional semantic space. layers, five pooling layers and three FC layers. In terms of 3) Cross-entropy loss. In order to learn the discrimi- network architecture, the convolutional layers and pooling native features from each domain, the cross-entropy losses layers in the semi-heterogeneous feature mapping part, and the [38] for the three branches are introduced. For each branch first two FC layers in the joint semantic embedding part are of the proposed network, a softmax cross-entropy loss the same as the ones in VGG19 in structure. Specifically, the L (p;y) is exploited, which is formulated as: crossentropy extra embedding layer in each branch is a 256-dimension fully- X connected layer, which aims to embed the different domains L (p;y) = y log p crossentropy k k into the common semantic feature space. Meanwhile, the size of the last FC layer is modified to the number of categories in the SBIR datasets. The entire training process includes the exp(z ) = y log( ) k P pre-training stage on each individual branch and the training exp(z ) j=1 stage on the proposed Semi3-Net. (10) In the pre-training stage, each individual branch, including where p = (p ; :::p ) denotes the discrete probability distri- 1 K the convolutional layers and pooling layers in the semi- bution of one data sample over K categories, y = (y ; :::y ) 1 K heterogeneous feature mapping part, and the FC layers in denotes the typical one-hot label corresponding to each cate- the joint semantic embedding part, is trained independently. gory, and z = (z ; :::z ) denotes the feature vector produced 1 K Specifically, the cross-entropy loss is adopted to pre-train each by the last FC layer. In the proposed Semi3-Net, the cross- branch using the corresponding source data in the training entropy loss forces the network to extract the discriminative dataset. Without learning the common embedding, the pre- features for each domain. training stage aims to learn the weights appropriate for sketch, 4) Sketch-image contrastive loss. Intuitively, in the SBIR natural image and edgemap recognition, respectively. task, the positive sketch-image pair should be close together, In the training stage, the weights of the three branches while the negative sketch-image pair should be far apart. Given are learned jointly, and the cross-domain representations are a sketch S and a natural image I , the sketch-image contrastive obtained by training the whole Semi3-Net. The overall loss loss [30] can be represented as: function in Eq. (12) is utilized in the training stage. As for SI + L (S; I; l ) = l d(f (S); f (I )) sim sim contrastive S I the sketch-edgemap and the sketch-image contrastive losses +(1 l ) max(0; m d(f (S); f (I ))) sim 2 S I illustrated above, the sketch-image and sketch-edgemap pairs (11) for training need to be generated. To this end, for each sketch where I and I denote the positive and negative natural in the training dataset, we randomly select a natural image images, respectively, and m denotes the margin. By utilizing (edgemap) from the same category to form the positive pair the sketch-image contrastive loss, the cross-domain similarity and a natural image (edgemap) from the other categories to between sketches and natural images is effectively measured. form the negative pair. In the training process, the ratio of Finally, the alignment loss in Eq. (8), the sketch-edgemap positive and negative sample pairs is set to 1:1, and the positive contrastive loss in Eq. (9), the cross-entropy loss in Eq. (10), and negative pairs are randomly selected following this rule, and the sketch-image contrastive loss in Eq. (11) are com- for each training batch. bined, thus the overall loss function L(S; I; E;p ;y ; l ) D D sim is derived as: IV. E XPERIM ENTS L(S; I; E;p ;y ; l ) = L (p ;y ) D D sim crossentropy D D A. Experimental Settings D=I;E;S In this paper, two category-level SBIR benchmarks, Sketchy SI + L (S; I; l ) contrastive sim [38] and TU-Berlin Extension [34], are adopted to evaluate the + L (I; E) alignment proposed Semi3-Net. Sketchy consists of 75,471 sketches and SE 12,500 natural images, from 125 categories with 100 objects + L (S; E; l ) sim contrastive (12) per category. In our experiments, we utilize the extended where , , and denote the hyper-parameters that control Sketchy dataset [34] with 73,002 natural images in total, which the trade-off among different types of losses. adds an extra 60,502 natural images collected from ImageNet The proposed hybrid-loss mechanism constructs the correla- [54]. TU-Berlin Extension [34] consists of 204,489 natural tion among sketches, edgemaps, and natural images, providing images and 20k free-hand drawn sketches from 250 categories, a novel way to deal with the domain gap by learning the in- with 80 sketches per category. The natural images in these variant cross-domain representations. By adopting the hybrid- two datasets are all realistic with complex backgrounds and 7 large variations, thus bringing great challenges to the SBIR TABLE I COMPARISON WITH THE STATE-OF-THE-ART SBIR METHODS task. Importantly, for fair comparison against state-of-the-art ON SKETCHY AND TU-BERLIN EXTENSION DATASETS. methods, the same training-testing splits used for the existing methods are adopted in our experiments. For Sketchy and TU- TU-Berlin Sketchy berlin Extension, 50 and 10 sketches, respectively, of each Methods Extension category are utilized as the testing queries, while the rest are MAP MAP used for training. 3D shape [36] 0:084 0:054 All experiments are performed under the simulation envi- HOG [20] 0:115 0:091 ronments of GeForce GTX 1080 Ti GPU and Intel i7-8700K GF-HOG [11] 0:157 0:119 processor @3.70 GHz. The training process of the proposed SHELO [12] 0:161 0:123 LKS [13] 0:190 0:157 Semi3-Net is implemented using SGD on Caffe [55] with SaN [37] 0:208 0:154 a batch size of 32. The initial learning rate is set to 2e-4, Siamese CNN [33] 0:481 0:322 and the weight decay and the momentum are set to 5e-4 and Siamese-AlexNet [34] 0:518 0:367 0.9, respectively. For both datasets, the balance parameters GN Triplet [38] 0:529 0:187 , , and are set to 10, 100, and 10, respectively, as they Triplet-AlexNet [34] 0:573 0:448 consistently yield promising result. The margins m and m 1 2 DSH [34] 0:783 0:570 for the two contrastive losses are both set to 0.3. GDH [39] 0:810 0:690 In the proposed method, the Gb method [56] is applied Our method 0:916 0:800 to extract the edgemap of each natural image. During the testing phase, sketches, natural images, and the corresponding edgemaps are fed into the trained model, and the feature semi-heterogeneous joint embedding network architecture. Be- vectors of the three branches are obtained. Then, the cosine sides, by integrating the co-attention model and hybrid-loss distance between the feature vectors of the query sketch and mechanism, the proposed Semi3-Net is encouraged to learn each natural image in the dataset is calculated. Finally, KNN a more discriminative embedding and invariant cross-domain is utilized to sort all the natural images for the final retrieval representations, simultaneously. 4) The proposed Semi3-Net result. Similar to the existing SBIR methods, mean average not only achieves superior performance over all traditional precision (MAP) is used to evaluate the retrieval performance. methods, but also outperforms the current best state-of-the- art deep learning method GDH [39] by 0.106 and 0.110 in B. Comparison Results MAP on Sketchy and TU-Berlin Extension, respectively. This further validates the effectiveness of the proposed Semi3-Net Table I shows the performance of the proposed method for the SBIR task. compared with the state-of-the-art methods, on the Sketchy and TU-Berlin Extension datasets. The comparison methods Figure 3 shows some retrieval examples by the proposed include traditional methods (i.e., LKS [13], HOG [20], SHELO Semi3-Net for the Sketchy dataset. Specifically, 10 relatively [12], and GF-HOG [11]) and deep learning methods (i.e., challenging query sketches with top-15 retrieval rank lists are Siamese CNN [33], SaN [37], GN Triplet [38], 3D shape [36], presented, where the incorrect retrieval results are marked with Siamese-AlexNet [34], Triplet-AlexNet [34], DSH [34] and red bounding boxes. As can be seen, the proposed method GDH [39]). It can be observed from the table that: 1) Com- performs well for retrieving natural images with complex pared with the methods that utilize edgemaps to replace natural backgrounds, such as the queries “Umbrella” and “Racket”. images for retrieval [33], [36], the proposed method obtains This is mainly because the key information common to dif- better retrieval accuracy. This is mainly because the edgemaps ferent domains can be effectively captured by the proposed extracted from natural images may lose certain information semi-heterogeneous feature mapping part with the co-attention useful for retrieval, and the CNNs pre-trained on ImageNet are model. Additionally, for relatively abstract sketches, such as more effective for natural images than edgemaps. 2) Compared the sketch “Cat”, the proposed method still achieves a con- with the methods that only utilize sketch-image pairs for SBIR sistent superior performance. This indicates that the semantic [38], the proposed Semi3-Net achieves better performance, features of the abstract sketches are extracted by embedding by introducing edgemap information. It demonstrates that an the features from different domains into a high-level semantic edgemap can be utilized as a bridge to effectively narrow the space. In other words, although the abstract sketches only distance between the natural image and sketch domains. 3) consist of black and white pixels, the proposed Semi3-Net Compared with DSH [34], which fuses the natural image and can gain a better understanding of the sketch domain by edgemap features into one feature representation, the proposed introducing the joint semantic embedding part. However, there Semi3-Net achieves 0.133 and 0.230 improvements in terms are also some incorrect retrieval images in Fig. 3. For example, of MAP, on Sketchy and TU-Berlin Extension, respectively. for the query “Fish”, a dolphin is retrieved with the proposed This is mainly because the proposed Semi3-Net makes full method. Meanwhile, for the query “Duck”, several swans are use of the one-to-one matching relationship between a natu- retrieved. However, the incorrect retrieved natural images are ral image and its corresponding edgemap to align different quite similar to the query. Specifically, the fish and duck domains into a high-level semantic space. Meanwhile, the sketches, which lack color and texture information, are very domain shift process is well achieved under the proposed similar in shape to a dolphin and a swan, respectively, which 8 Query Retrieval results Umbrella Eyeglasses Cat Racket Flower Fish Bicycle Elephant Duck Piano Fig. 3. Some retrieval examples for the proposed Semi3-Net on the Sketchy dataset. Incorrect retrieval results are marked with red bounding boxes. TABLE II scatter into nearly the same clusters. This further demonstrates EVALUATION OF THE SEMI-HETEROGENEOUS that the proposed Semi3-Net effectively aligns sketches and ARCHITECTURE ON SKETCHY DATASET. natural images into a common high-level semantic space, thus improving the retrieval accuracy. Methods MAP All sharing 0:857 Only FC layer sharing 0:879 C. Ablation Study Only sketch-edgemap sharing 0:890 1) Evaluation of the semi-heterogeneous architecture: To Semi3-Net 0:916 evaluate the superiority of the proposed semi-heterogeneous network architecture, networks with different weight sharing strategies are compared with the proposed Semi3-Net. For makes the SBIR task more challenging. fair comparison, the proposed network architecture is fixed To further verify the effectiveness of the proposed Semi3- when different weight-sharing strategies are verified. In other Net, the t-SNE [57] visualization of natural images and words, both the proposed co-attention model and the hybrid- sketches from ten categories on Sketchy are reported. As loss mechanism are applied in the networks with different shown in Fig. 4, the ten categories are selected randomly in weight-sharing strategies. The comparison results are shown the dataset, and the labels of the selected categories are also in Table II, where “all sharing” indicates the weights of illustrated in the upper right corner. We run t-SNE visualization the three branches are completely shared, “only FC layer on both images and sketches together, then separate the sharing” indicates only the weights of the three branches are projected data points to Fig. 4 (a) and Fig. 4 (b), respectively. shared in the semantic embedding part, and “only sketch- Specifically, the circles in Fig. 4 represent clusters of different edgemap sharing” indicates only the weights of the sketch categories, and the data in each circle belongs to the same cate- branch and edgemap branch are shared in the feature mapping gory in the high-level semantic space. In other words, circles in part. As can be seen, the method with all sharing strategy the same position of Fig. 4 (a) and Fig. 4 (b) correspond to the does not perform as well as the methods with partially-shared natural images and sketches with the same label, respectively. strategies. This supports our views illustrated previously that If data samples with the same label but from two different the bottom discriminative features should be learned separately domains are correctly aligned in the common feature space, it for different domains. Compared with the the proposed Semi3- indicates that the cross-domain learning is successful. As can Net, the method with only FC layer sharing strategy obtains be seen, the natural images and sketches with the same label 0.879 MAP. That is because the intrinsic relevance between the 9 (b) t-SNE visualization of the query sketches (a) t-SNE visualization of the natural images Fig. 4. t-SNE visualization of the natural images and query sketches from ten categories in the Sketchy dataset, during the test phase. Symbols with the same color and shape in (a) and (b) represent the natural images and query sketches with the same label, respectively. sketches and edgemaps is ignored. Meanwhile, experimental TABLE III EVALUATION OF KEY COMPONENTS ON SKETCHY DATASET. results show that the method with only sketch-edgemap shar- ing strategy also results in a decrease in MAP. That is because Methods MAP the three branches are not fully aligned in the common high- w/o CAM and w/o HLM 0:851 level semantic space. Importantly, the proposed Semi3-Net, w/o HLM 0:880 with both the FC layer sharing and sketch-edgemap sharing w/o CAM 0:896 strategies, achieves the best performance, which proves the Semi3-Net 0:916 effectiveness of the proposed network architecture. 2) Evaluation of key components: In order to illustrate TABLE IV the contributions of the key components in the proposed EVALUATION OF DIFFERENT FEATURE SELECTIONS IN THE method, i.e., the self-heterogeneous three-way framework, the TEST PHASE ON SKETCHY DATASET. co-attention model and the hybrid-loss mechanism, leaving- one-out evaluations are conducted on the Sketchy dataset. Methods MAP The experimental results are shown in Table III. Note that Edgemap feature 0:914 Natural image feature 0:916 “w/o CAM and w/o HLM” refers to the proposed self- heterogeneous three-way framework, which has neither a co- attention model nor a hybrid-loss mechanism. For “w/o HLM”, only two types of typical SBIR losses, the cross-entropy loss image branch or the edgemap branch can be used as the and the sketch-image contrastive loss, are used in the training final retrieval feature representation. To evaluate the impact phase. As shown in Table III, the proposed self-heterogeneous of different feature selections in the test phase, we obtain three-way framework obtains 0.851 MAP, outperforming the the 256-d feature vectors from the embedding layers of the other methods in Table I. Additionally, leaving out either CAM natural image branch and the edgemap branch, respectively. or HLM results in a lower MAP, which verifies that each The experiments are conducted on the Sketchy dataset, and component contributes to the overall performance. Specifi- the experimental results are reported in Table IV. As can be cally, the informative features common between natural images seen from the table, there is a small difference between the and the corresponding edgemaps are captured effectively by retrieval performances when using the edgemap feature and the proposed co-attention mechanism, and the three different natural image feature. This is mainly because the invariant inputs can be effectively aligned into a common feature space cross-domain representations of different domains are indeed by introducing the hybrid-loss mechanism. learned by the proposed method, and the cross-domain gap is narrowed by the joint embedding learning. Besides, the feature extracted from the natural image branch performs a little better D. Discussions than that from the edgemap branch, which is also consistent In this section, the impact of different feature selections with the research in [30], [38]. in the test phase is discussed for the Sketchy dataset. As In addition, to verify the performance without edgemap mentioned above, either features extracted from the natural branch, we conducted experiments on a two-branch frame- 10 work by only using the sketch and natural image branches. [6] K. Li, K. Pang, Y.-Z. Song, T. Hospedales, T. Xiang, and H. Zhan, “Synergistic instance-level subspace alignment for fine-grained sketch- Considering that the sketches and natural images belong to two based image retrieval,” IEEE Transactions on Image Processing, vol. different domains, a non-shared setting is first exploited on the 26, no. 12, pp. 5908-5921, 2017. two-branch framework. Note that without the edgemap branch, [7] G. Tolias and O. Chum, “Asymmetric feature maps with application to sketch based retrieval,” in Proc. IEEE Conference on Computer Vision neither the co-attention model nor the hybrid-loss mechanism and Pattern Recognition (CVPR), Jul. 2017, pp. 6185-6193. is added into the two-branch framework. Compared with [8] J. Lei, K. Zheng, H. Zhang, X. Cao, N. Ling, and Y. Hou, “Sketch based the proposed Semi3-Net with 0.916 MAP, the two-branch image retrieval via image-aided cross domain learning,” in Proc. IEEE International Conference on Image Processing (ICIP), Sept. 2017, pp. framework obtains 0.837 MAP on Sketchy dataset, which 3685-3689. sufficiently proves the effectiveness of the edgemap branch in [9] J. Song, K. Pang, Y-Z Song, T. Xiang, and T M. Hospedales, “Learning the proposed Semi3-Net. In addition, a partially-shared setting to sketch with shortcut cycle consistency,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. is also exploited on the two-branch framework, in which the 801-810. weights of the convolutional layers are independent while the [10] P. Xu, Y. Huang, T. Yuan, K. Pang, Y.-Z Song, T. Xiang, T. M. weights of the fully-connected layers are shared. Compared Hospedales, Z. Ma, and J. Guo, “SketchMate: deep hashing for million- scale human sketch retrieval,” in Proc. IEEE Conference on Computer with the non-shared setting, the partially-shared setting obtains Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 8090-8098. 0.861 MAP. This further verifies the importance of the joint [11] R. Hu, M. Barnard, and J. Collomosse, “Gradient field descriptor for semantic embedding for SBIR. sketch based retrieval and localization,” in in Proc. IEEE International Conference on Image Processing (ICIP), Sept. 2010, pp. 1025-1028. [12] J. M. Saavedra, “Sketch based image retrieval using a soft computation of the histogram of edge local orientations (S-HELO),” in Proc. IEEE V. CONCLUSION International Conference on Image Processing (ICIP), Oct. 2014, pp. 2998-3002. In this paper, we propose a novel semi-heterogeneous [13] J. M. Saavedra and J. M. Barrios, “Sketch based image retrieval using three-way joint embedding network, where auxiliary edgemap learned keyshapes (LKS),” in Proc. British Machine Vision Conference information is introduced as a bridge to narrow the cross- (BMVC), Sept. 2015, pp. 164.1-164.11. [14] D. Xu, X. Alameda-Pineda, J. Song, E. Ricci, and N. Sebe, “Academic domain gap between sketches and natural images. The semi- coupled dictionary learning for sketch-based image retrieval,” in Proc. heterogeneous feature mapping and the joint semantic embed- ACM International Conference on Multimedia (ACM MM), Oct. 2016, ding are proposed to learn the specific bottom features from pp. 1326-1335. [15] X. Qian, X. Tan, Y. Zhang, R. Hong, and M. Wang, “Enhancing sketch- each domain and embed different domains into a common based image retrieval by re-ranking and relevance feedback,” IEEE high-level semantic space, respectively. Besides, a co-attention Transactions on Image Processing, vol. 25, no. 1, pp. 195-208, 2016. model is proposed to capture the informative features common [16] S. Wang, J. Zhang, T. Han, and Z. Miao, “Sketch-based image retrieval through hypothesis-driven object boundary selection with hlr descriptor,” between natural images and the corresponding edgemaps, by IEEE Transactions on Multimedia, vol. 17, no. 7, pp. 1045-1057, 2015. recalibrating corresponding channel-wise feature responses. In [17] J. M. Saavedra, B. Bustos, and S. Orand, “Sketch-based image retrieval addition, a hybrid-loss mechanism is designed to construct the using keyshapes,” in Proc. British Machine Vision Conference (BMVC), Sept. 2015, pp. 164.1-164.11. correlation among sketches, edgemaps, and natural images, [18] Y. Zhang, X. Qian, X. Tan, J. Han, and Y. Tang, “Sketch-based so that the invariant cross-domain representations of different image retrieval by salient contour reinforcement,” IEEE Transactions domains can be effectively learned. Experimental results on on Multimedia, vol. 18, no. 8, pp. 1604-1615, 2016. [19] Y. Qi, Y.-Z. Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li, and J. Guo, two datasets demonstrate that Semi3-Net outperforms the “Making better use of edges via perceptual grouping,” in Proc. IEEE state-of-the-art methods, which proves the effectiveness of the Conference on Computer Vision and Pattern Recognition (CVPR), Jun. proposed method. 2015, pp. 1856-1865. [20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human In the future, we will focus on extending the proposed cross- detection,” in Proc. IEEE Conference on Computer Vision and Pattern domain network to fine-grained image retrieval, and learning Recognition (CVPR), Jun. 2005, pp. 886-893. the correspondence of the fine-grained details for sketch-image [21] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. IEEE International Conference on Computer Vision (ICCV), Sept. pairs. Besides, further study may also include extending our 1999, pp. 1-8. method to other cross-domain learning problems. [22] G. Mori, S. Belongie, and J. Malik, “Efficient shape matching using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1832-1837, 2005. REFERENCES [23] M. Eitz, J. Hays, and M. Alexa, “How do humans sketch objects?” ACM Transactions on Graphics, vol. 31, no. 4, pp. 1-10, 2012. [1] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian, “Semantic-aware co- [24] U. R. Muhammad, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales, indexing for image retrieval,” IEEE Transactions on Pattern Analysis “Learning deep sketch abstraction,” in Proc. IEEE Conference on and Machine Intelligence, vol. 37, no. 12, pp. 2573-2587, 2015. Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 8014- [2] D. Wu, Z. Lin, B. Li, J. Liu, and W. Wang, “Deep uniqueness- aware hashing for fine-grained multi-label image retrieval,” in Proc. [25] J. A. Landay and B. A. Myers, “Sketching interfaces: toward more International Conference on Acoustics, Speech and Signal Processing human interface design,” IEEE Computer, vol. 34, no. 3, pp. 56-64, (ICASSP), Apr. 2018, pp. 1683-1687. [3] B. Peng, J. Lei, H. Fu, C. Zhang, T.-S. Chua, and X. Li, “Unsupervised [26] J. Lei, L. Niu, H. Fu, Bo, Peng, Q. Huang, and C. Hou, “Person video action clustering via motion-scene interaction constraint,” IEEE re-identification by semantic region representation and topology con- Transactions on Circuits and Systems for Video Technology, DOI: straint,” IEEE Transactions on Circuits and Systems for Video Technol- 10.1109/TCSVT.2018.2889514, 2018. ogy, DOI: 10.1109/TCSVT.2018.2866260, 2018. [4] X. Shang, H. Zhang, and T.-S. Chua, “Deep learning generic features for [27] J. Cao, Y. Pang, and X. Li, “Learning multi-layer channel features for cross-media retrieval,” in Proc. International Conference on MultiMedia pedestrian detection,” IEEE Transactions on Image Processing, vol. 26, Modeling (MMM), Jun. 2018, pp. 15-24. no. 7, pp. 3210-3220, 2017. [5] J. M. Saavedra, “Rst-shelo: sketch-based image retrieval using sketch [28] Y. Pang, M. Sun, X. Jiang, and X. Li, “Convolution in convolution tokens and square root normalization,” Multimedia Tools and Applica- for network in network,” IEEE Transactions on Neural Networks and tions, vol. 76, no. 1, pp. 931-951, 2017. Learning Systems, vol. 29, no. 5, pp. 1587-1597, 2018. 11 [29] T. Han, H. Yao, C. Xu, X. Sun, Y. Zhang, and J. J. Corso, “Dancelets [51] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Videos captioning mining for video recommendation based on dance styles,” IEEE Trans- with attention-based LSTM and semantic consistency,” IEEE Transac- actions on Multimedia, vol. 19, no. 4, pp. 712-724, 2017. tions on Multimedia, vol. 19, no. 9, pp. 2045-2055, 2017. [30] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse, “Sketching out the de- [52] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adap- tails: Sketch-based image retrieval using convolutional neural networks tive attention for visual captioning,” IEEE Transactions on Pattern Anal- with multi-stage regression,” Computers & Graphics, vol. 77, pp. 77-87, ysis and Machine Intelligence, DOI: 10.1109/TPAMI.2019.2894139, 2018. 2019. [31] H. Zhang, C. Zhang, and M. Wu, “Sketch-based cross-domain image [53] K. Simonyan and A. Zisserman, “Very deep convolutional networks for retrieval via heterogeneous network,” in Proc. IEEE International Con- large-scale image recognition,” in Proc. International Conference on ference on Visual Communications and Image Processing (VCIP), Dec. Learning Representations (ICLR), May. 2015, pp. 1-14. 2017, pp. 1-4, [54] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: [32] Q. Yu, F. Liu, Y.-Z. Song, and T. Xiang, “Sketch me that shoe,” in Proc. a large-scale hierarchical image database,” in Proc. IEEE Conference on IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Computer Vision and Pattern Recognition (CVPR), Jun. 2009, pp. 248- Jun. 2016, pp. 799-807. 255. [33] Y. Qi, Y.-Z. Song, H. Zhang, and J. Liu, “Sketch-based image retrieval [55] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, via siamese convolutional neural network,” in Proc. IEEE International S. Guadarrama, and T. Darrell, “Caffe: convolutional architecture for Conference on Image Processing (ICIP), Sept. 2016, pp. 2460-2464. fast feature embedding,” in Proc. ACM International Conference on [34] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao, “Deep sketch hashing: Multimedia (ACM MM), Nov. 2014, pp. 675-678. fast free-hand sketch-based image retrieval,” in Proc. IEEE Conference [56] M. Leordeanu, R. Sukthankar, and C. Sminchisescu, “Efficient closed- on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. form solution to generalized boundary detection,” in Proc. European 2298-2307. Conference on Computer Vision (ECCV), Oct. 2012, pp. 516-529. [35] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T M. Hospedales, “Deep [57] L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal spatial-semantic attention for fine-grained sketch-based image retrieval,” of Machine Learning Research, vol. 9, pp. 2579-2605, 2008. in Proc. IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 5552-5561. [36] F. Wang, L. Kang, and Y. Li, “Sketch-based 3d shape retrieval using convolutional neural networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1875-1883. [37] Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. Hospedales, “Sketch- a-net that beats humans,” in Proc. British Machine Vision Conference Jianjun Lei (M’11-SM’17) received the Ph.D. de- (BMVC), Sept. 2015, pp. 1-12. gree in signal and information processing from Bei- [38] P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy database: jing University of Posts and Telecommunications, learning to retrieve badly drawn bunnies,” ACM Transactions on Graph- Beijing, China, in 2007. He was a visiting researcher ics, vol. 35, no. 4, pp. 1-12, 2016. at the Department of Electrical Engineering, Uni- [39] J. Zhang, F. Shen, L. Liu, F. Zhu, M. Yu, L. Shao, H. T. Shen, and versity of Washington, Seattle, WA, from August L. V. Gool, “Generative domain-migration hashing for sketch-to-image 2012 to August 2013. He is currently a Professor retrieval,” in Proc. European Conference on Computer Vision (ECCV), at Tianjin University, Tianjin, China. His research Sept. 2018, pp. 297-314. interests include 3D video processing, virtual reality, [40] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification and artificial intelligence. with deep convolutional neural networks,” in Proc. Conference and Workshop on Neural Information Processing Systems (NIPS), Dec. 2012, pp. 1097-1105. [41] J. Song, H. Zhang, X. Li, L. Gao, M. Wang, and R. Hong, “Self- supervised video hashing with hierarchical binary auto-encoder,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3210-3221, 2018. [42] J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe, “Quantization based hashing: A general framework for scalable image and video retrieval,” Pattern Recognition, vol. 75, pp. 175-187, 2018. Yuxin Song received the B.S. degree in com- [43] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: munication engineering from Hefei University of adaptive attention via a visual sentinel for image captioning,” in Proc. Technology, Hefei, Anhui, China, in 2017. She is IEEE Conference on Computer Vision and Pattern Recognition (CVPR), currently pursuing the M.S. degree with the School Jul. 2017, pp. 375-383. of Electrical and Information Engineering, Tianjin [44] M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-class University, Tianjin, China. Her research interests constraint for fine-grained image recognition,” in Proc. European Con- include image retrieval and deep learning. ference on Computer Vision (ECCV), Sept. 2018, pp. 805-821. [45] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application of two-level attention models in deep convolutional neural network for fine-grained image classification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 842-850. [46] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, “Attentive moment retrieval in videos,” in Proc. International Conference on Research on Development in Information Retrieval (SIGIR), Jun. 2018, pp. 15-24. [47] H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal Bo Peng received the M.S. degree in communication reasoning and matching,” in Proc. IEEE Conference on Computer Vision and information systems from Xidian University, and Pattern Recognition (CVPR), Jul. 2017, pp. 2156-2164. Xi’an, Shaanxi, China, in 2016. Currently, she is [48] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Z, X. Wang, and X. pursuing the Ph.D. degree at the School of Elec- Tang, “Residual attention network for image classification,” in Proc. trical and Information Engineering, Tianjin Univer- IEEE Conference on Computer Vision and Pattern Recognition (CVPR), sity, Tianjin, China. Her research interests include Jul. 2017, pp. 9156-3164. computer vision, image processing and video action [49] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person analysis. re-identification,” in Proc. European Conference on Computer Vision (ECCV), Sept. 2018, pp. 2285-2294. [50] J. Hu, L, Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 7132-7141. 12 Zhanyu Ma has been an Associate Professor at Bei- jing University of Posts and Telecommunications, Beijing, China, since 2014. He is also an adjunct Associate Professor at Aalborg University, Aalborg, Denmark, since 2015. He received his Ph.D. degree in Electrical Engineering from KTH (Royal Institute of Technology), Sweden, in 2011. From 2012 to 2013, he has been a Postdoctoral research fellow in the School of Electrical Engineering, KTH, Sweden. His research interests include pattern recognition and machine learning fundamentals with a focus on applications in computer vision, multimedia signal processing, and data mining. Ling Shao is the CEO and Chief Scientist of the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates. His research inter- ests include computer vision, machine learning, and medical imaging. He is a fellow of IAPR, IET, and BCS. He is an Associate Editor of the IEEE Trans- actions on Neural Networks and Learning Systems, and several other journals. Yi-Zhe Song is a Reader of Computer Vision and Machine Learning at the Centre for Vision Speech and Signal Processing (CVSSP), UKs largest academic research centre for Artificial Intelligence with approx. 200 researchers. Previously, he was a Senior Lecturer at the Queen Mary University of London, and a Research and Teaching Fellow at the University of Bath. He obtained his PhD in 2008 on Computer Vision and Machine Learning from the University of Bath, and received a Best Dissertation Award from his MSc degree at the University of Cambridge in 2004, after getting a First Class Honours degree from the University of Bath in 2003. He is a Senior Member of IEEE, and a Fellow of the Higher Education Academy. He is a full member of the review college of the Engineering and Physical Sciences Research Council (EPSRC), the UK’s main agency for funding research in engineering and the physical sciences, and serves as an expert reviewer for the Czech National Science Foundation. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval

Loading next page...
 
/lp/arxiv-cornell-university/semi-heterogeneous-three-way-joint-embedding-network-for-sketch-based-BWu3inDRxU

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

ISSN
1051-8215
eISSN
ARCH-3348
DOI
10.1109/TCSVT.2019.2936710
Publisher site
See Article on Publisher Site

Abstract

Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval Jianjun Lei, Senior Member, IEEE, Yuxin Song, Bo Peng, Zhanyu Ma, Senior Member, IEEE, Ling Shao, Senior Member, IEEE, and Yi-Zhe Song Abstract—Sketch-based image retrieval (SBIR) is a challenging community [1]-[4]. With the popularization of touch screen task due to the large cross-domain gap between sketches and devices, sketch-based image retrieval (SBIR) has attracted natural images. How to align abstract sketches and natural extensive attention and achieved remarkable performance [5]- images into a common high-level semantic space remains a [8]. Given a hand-drawn sketch, the SBIR task aims to key problem in SBIR. In this paper, we propose a novel semi- retrieve the natural target images from the image database. heterogeneous three-way joint embedding network (Semi3-Net), which integrates three branches (a sketch branch, a natural However, compared with the natural target images, which are image branch, and an edgemap branch) to learn more dis- full of color and texture information, sketches only contain criminative cross-domain feature representations for the SBIR simple black and white pixels [9], [10]. Therefore, hand-drawn task. The key insight lies with how we cultivate the mutual and sketches and natural images belong to two heterogeneous data subtle relationships amongst the sketches, natural images, and domains, and aligning these two domains into a common edgemaps. A semi-heterogeneous feature mapping is designed to extract bottom features from each domain, where the sketch and feature space remains the most challenging problem in SBIR. edgemap branches are shared while the natural image branch is Traditional SBIR methods describe sketches and natural heterogeneous to the other branches. In addition, a joint semantic images using hand-crafted features [11]-[19]. Edgemaps of embedding is introduced to embed the features from different natural images are usually first extracted as sketch approxima- domains into a common high-level semantic space, where all of tions. Then, hand-designed features, such as HOG [20], SIFT the three branches are shared. To further capture informative features common to both natural images and the corresponding [21], and Shape Context [22], are exploited to describe both edgemaps, a co-attention model is introduced to conduct common the sketches and edgemaps. Finally, the K-Nearest Neighbor channel-wise feature recalibration between different domains. A (KNN) ranking process is utilized to evaluate the similarity hybrid-loss mechanism is designed to align the three branches, between the sketches and natural images to obtain the final where an alignment loss and a sketch-edgemap contrastive loss retrieval results. However, as mentioned above, hand-drawn are presented to encourage the network to learn invariant cross- domain representations. Experimental results on two widely sketches and natural images belong to two heterogeneous data used category-level datasets (Sketchy and TU-Berlin Extension) domains. It is difficult to design a common type of feature demonstrate that the proposed method outperforms state-of-the- applicable to two different data domains. Besides, sketches art methods. are usually drawn by non-professionals, making them full Index Terms—SBIR, cross-domain learning, co-attention of intra-class variations [23]-[25]. Most hand-crafted features model, hybrid-loss mechanism have difficulties in dealing with these intra-class variations and ambiguities of hand-drawn sketches, which also negatively impacts the performance of SBIR. I. I NTRODUCTION In recent years, convolutional neural networks (CNNs) have INCE the number of digital images on the Internet has in- been widely used across fields [26]-[29], such as person re- creased dramatically in recent years, content-based image identification, object detection, and video recommendation. retrieval technology has become a hot topic in computer vision In contrast to traditional hand-crafted methods, CNNs can automatically aggregate shallow features learned from the This work was supported in part by the Natural Science Foundation of Tian- bottom convolutional layers. Inspired by the learning ability jin ( No.18ZXZNGX00110, 18JCJQJC45800 ), and National Natural Science Foundation of China ( No.61931014, 61922015, 61722112 ). Copyright 20xx of CNNs, several Siamese networks and Triplet networks IEEE. Personal use of this material is permitted. However, permission to use have been proposed for the SBIR task [30]-[33]. Most of this material for any other purposes must be obtained from the IEEE by these methods encode a sketch-image or a sketch-edgemap sending an email to [email protected]. (Corresponding author: Bo Peng) pair, and learn the similarity between the input pair using a J. Lei, Y. Song, and B. Peng are with the School of Electrical and contrastive loss or triplet loss. However, there are still several Information Engineering, Tianjin University, Tianjin 300072, China (e-mail: difficulties and challenges to be solved in these methods. 1) [email protected]; [email protected]; [email protected]). Z. Ma is with the Pattern Recognition and Intelligent System Laboratory, The different characteristics of sketches and natural images Beijing University of Posts and Telecommunications, Beijing 100876, China make the SBIR task challenging. Generally, a sketch only (e-mail: [email protected]). contains the object to be retrieved, meaning the sketches L. Shao is with the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates (e-mail: [email protected]). tend to have relatively clean backgrounds. In addition, since Y.-Z. Song is with the SketchX Lab, Centre for Vision, Speech and Signal sketches are usually drawn by non-professionals, the shapes Processing, University of Surrey, Guildford, Surrey GU2 7XH, U.K. (e-mail: of objects in sketches are usually deformed and relatively [email protected]). Digital Object Identifier abstract. For natural images, although the objects are not arXiv:1911.04470v1 [cs.CV] 10 Nov 2019 2 usually significantly deformed, natural images taken by cam- learn joint feature representations for sketches, natural images, eras usually have complex backgrounds. Therefore, creating and edgemaps. a network that can learn more discriminative features for 2) To capture informative features common between nat- both sketches and natural images remains a challenge. 2) ural images and the corresponding edgemaps, a co-attention Since sketches and natural images are from two different data model is developed between the natural image and edgemap domains, there exists a significant domain gap between the branches. features of these two. Most deep SBIR methods that adopt 3) A hybrid-loss mechanism is designed to mitigate the domain gap, where an alignment loss and a sketch-edgemap a contrastive loss or a triplet loss to learn the cross-domain contrastive loss are presented to encourage the network to learn similarity are not effetive enough to cope with the intrinsic invariant cross-domain representations. domain gap. Therefore, finding a way to eliminate or reduce the cross-domain gap and embedding features from different 4) Experiments on two widely-used datasets, Sketchy and domains into a common high-level semantic space is critical TU-Berlin Extension, demonstrate that the proposed method for SBIR. 3) More importantly, most existing methods achieve outperforms state-of-the-art methods. SBIR by exploring the matching relationship between either The rest of the paper is organized as follows. Section II sketch-edgemap pairs or sketch-image pairs. However, the reviews the related works. Section III introduces the proposed methods using sketch-edgemap pairs ignore the discriminative method in detail. The experimental results and analysis are features contained in natural images, while the methods using presented in Section IV. Finally, the conclusion is drawn in sketch-image pairs ignore the auxiliary role of edgemaps. Section V. Enabling full use of the joint relationships among sketches, natural images, and edgemaps provides a novel way to solve II. R ELATED WORK the cross-domain learning problem. A. Traditional SBIR Methods To address the above issues, a novel semi-heterogeneous Traditional SBIR methods usually utilize edge extraction three-way joint embedding network (Semi3-Net) is proposed methods to extract edgemaps from natural images first. Then, in this paper to mitigate the domain gap and align sketches, hand-crafted features are used as descriptors for both sketches natural images, and edgemaps into a high-level semantic space. and edgemaps. Finally, a KNN ranking process within a The key insight behind our design is how we enforce mu- Bag-of-Words (BoW) framework is usually utilized to rank tual cooperation amongst the three branches. We importantly the candidate natural images for each sketch. For instance, recognize that when measured in terms of visual abstraction, Hu et al. [11] incorporated the Gradient Field HOG (GF- sketches and edgemaps are more closely linked compared with HOG) into a BoW scheme for SBIR, and obtained promising sketches and natural images. The sketches are highly abstract performance. Saavedra et al. [12] introduced Soft-Histogram and iconic representations of natural images, and edgemaps are of Edge Local Orientations (SHELO) as the descriptors for reduced versions of natural images, where detailed appearance sketches and edgemaps extracted from natural images, which information such as texture and color are removed. How- effectively improves the retrieval accuracy. In [13], a novel ever, compared with edgemaps, natural images contain more method for describing hand-drawn sketches was proposed by discriminative features for SBIR. Motivated by this insight, detecting learned keyshapes (LKS). Xu et al. [14] proposed we purposefully design a semi-heterogeneous joint embedding an academic coupled dictionary learning method to address network, where a semi-heterogeneous weight-sharing setting the cross-domain learning problem in SBIR. Qian et al. [15] among the three branches is adopted in the feature mapping introduced re-ranking and relevance feedback schemes to find part, while a three-branch all-sharing setting is conducted in more similar natural images based on initial retrieval results, the joint semantic embedding part. This design essentially thus improving the retrieval performance. promotes edgemaps to act as a “bridge” to help narrow the domain gap between the natural images and sketches. Fig. 1 B. Deep SBIR Methods offers a visualization of the proposed Semi3-Net architecture. More specifically, the semi-heterogeneous feature mapping Recently, many frameworks based on CNNs have been part is designed to extract the bottom features for each domain, proposed to address the challenges in SBIR [33]-[39]. Aiming where a co-attention model is introduced to learn informa- to learn the cross-domain similarity between the sketch and tive features common between different domains. Meanwhile, natural image domains, several Siamese networks have been the joint semantic embedding part is proposed to embed proposed to improve the retrieval performance. Qi et al. [33] the features from different domains into a common high- introduced a novel Siamese CNN architecture for SBIR, which level semantic space. In addition, a hybrid-loss mechanism is learns the features of sketches and edgemaps by jointly tuning proposed to achieve a more discriminative embedding, where two CNNs. Liu et al. [34] proposed a Siamese-AlexNet based an alignment loss and a sketch-edgemap contrastive loss are on two AlexNet [40] branches to learn the cross-domain introduced to encourage the network to learn invariant cross- similarity and mitigate the domain gap. Wang et al. [36] domain representations. The main contributions of this paper proposed a Siamese network to learn the similarity between the are summarized as follows. input sketches and edgemaps of 3D models, which is originally 1) A novel semi-heterogeneous three-way joint embedding designed for sketch-based 3D shape retrieval. Meanwhile, network is proposed, in which the semi-heterogeneous feature several Triplet architectures have also been proposed, which mapping and the joint semantic embedding are designed to include the sketch branch, the positive natural image branch, 3 FMB SEB I I cross-entropy Scale Natural image Co-attention Shared weights alignment model SI contrastive Scale cross-entropy FMB S SE EB B E E E Edgemap SE SE SE L L L ccco o on n ntttrrra a assstttiiivvveee Shared weights Shared weights Attention Scale module cross-entropy Sketch FMB SEB S S (a) (b) (c) (d) Co Con nv vo ol lu ut ti io on na al l+ +p po oo ol li in ng g l la ay ye ers rs Fully-connected layer Embedding layer L2 normalization layer Fig. 1. Illustration of the proposed Semi3-Net. (a) Three-way inputs. (b) Semi-heterogeneous feature mapping. (c) Joint semantic embedding. (d) Hybrid-loss mechanism. Blocks with the same color represent their weights are shared. and the negative natural image branch. In these methods, region at a time, using reinforcement learning. For instance, a ranking loss function is utilized to constrain the feature Hu et al. [50] proposed a channel attention model to recalibrate distance between a sketch and a positive natural image to be the weights of different channels, which effectively enhances smaller than the one between the sketch and a negative natural the discriminative power of features and achieves promising image. Sangkloy et al. [38] learned a cross-domain mapping classification performance. Gao et al. [51] proposed a novel through a pre-training strategy to embed natural images and aLSTMs framework for video captioning, which integrates the sketches in the same semantic space, and achieved superior attention mechanism and LSTM to capture salient structures retrieval performance. Recently, deep hashing methods [41], for video. In [52], a hierarchical LSTM with adaptive attention [42] have been exploited for retrieval task and have achieved approach was proposed for image and video captioning, and significant improvement on retrieval performance. Liu et. al. achieved the state-of-the-art performances on both tasks. Be- [34] integrated a deep architecture into the hashing framework sides, Li et al. [49] proposed a harmonious attention network to capture the cross-domain similarities and speed up the SBIR for person re-identification, where soft attention is used to process. Zhang et al. [39] proposed a Generative Domain- learn important pixels for fine-gained information matching, migration Hashing (GDH) approach, which uses a generative and hard attention is applied to search latent discriminative re- model to migrate sketches to their indistinguishable natural gions. Song et al. [35] proposed a soft attention spatial method image counterparts and achieves the best-performing results for fine-grained SBIR to capture more discriminative fine- on two SBIR datasets. grained features. This model reweights the different spatial regions of the feature map by learning a weight mask for each branch of the triplet network. However, although the attention C. Attention Models mechanisms above have strong abilities for feature learning, Attention models have recently been successfully applied they generally learn discriminative features using only the to various deep learning tasks, such as natural language pro- input itself. For the SBIR task, we are more concerned about cessing (NLP) [43], fine-grained image recognition [44], [45], learning discriminative cross-domain features for retrieval. In video moment retrieval [46], and visual question answering other words, the common features of different domains should (VQA) [47]. In the field of image and video processing, be considered simultaneously. To address this, a co-attention the two most commonly used attention models are soft- model is exploited in this paper for the SBIR task, to focus attention models [48] and hard-attention models [49]. Soft- on capturing informative features common between natural attention models assign different weights to different regions images and the corresponding edgemaps, and further mitigate or channels in an image or a video, by learning an attention the cross-domain gap. mask. In contrast, hard attention models only indicate one 4 is not important for the retrieval task. For the sketch branch FMB , a channel-wise attention module is also introduced to learn more discriminative features. The co-attention model Pooling layer Pooling layer will be discussed in Section III-B in detail. aa__ii aa__ii Particularly, the semi-heterogeneous in the proposed archi- XXII XXE E tecture refers to the three-branch semi-heterogeneous weight Co-attention model sharing strategy. The weights of the sketch and edgemap GAP branches are shared while the natural image branch is in- Attention Attention 1 1 c module module dependent of the others in the semi-heterogeneous feature FC mapping part. The three branches are integrated into the semi- c heterogeneous feature mapping architecture and interact with 11 Image Edgemap M M I E each other, which ensures that not only are the bottom features mask mask ReLU of each domain preserved, but that the features from different 11 domains are also prealigned through the co-attention model. FC B. Joint Semantic Embedding 1 1 c Co-mask CO In the joint semantic embedding part, as shown in Fig. Sigmoid 1 (c), the natural image branch SEB , the edgemap branch Attention module SEB , and the sketch branch SEB are developed to embed E S the features from different domains into a common high-level Scale Scale semantic space. Each branch of the joint semantic embedding aa__oo aa__oo XXII XXE E part includes several fully-connected (FC) layers. An extra embedding layer, followed by an L2 normalization layer, is also introduced in each branch. As previously stated, the bot- Fig. 2. Structure of the co-attention model. Yellow blocks represent the tom features of different branches are learned respectively, by layers that belong to the natural image branch, while green blocks represent the semi-heterogeneous feature mapping. In order to achieve the layers that belong to the edgemap branch. feature alignment between the natural images, edgemaps, and sketches in a common high-level semantic space, the weights of SEB , SEB , and SEB are completely shared in the joint III. SEMI-H ETEROGENEOUS T HREE-WAY J OINT I E S semantic embedding part. Based on the features learned in the E MBEDDING NETWORK common high-level semantic space, a hybrid-loss mechanism A. Semi-Heterogeneous Feature Mapping is proposed to learn the invariant cross-domain representations, As shown in Fig. 1 (b), the semi-heterogeneous feature and achieve a more discriminative embedding. mapping part consists of the natural image branch FMB , the edgemap branch FMB , and the sketch branch FMB . E S C. Co-attention Model Each branch includes a series of convolutional and pooling layers, which aim to learn the bottom features for each domain. To capture the informative features common between nat- As mentioned above, sketches and edgemaps have similar ural images and the corresponding edgemaps, a co-attention characteristics, and both lack detailed appearance information model is proposed between the natural image and edgemap such as texture and color, thus the weights of FMB and branches. As previously stated, edgemaps are reduced versions FMB are shared. Besides, since the amount of the sketch of natural images, where detailed appearance information are training data is much smaller than natural image training data, removed. Therefore, a natural image and its corresponding sharing weights between the sketch and edgemap branches can edgemap are spatially aligned, which can be made full use of partly alleviate problems that arise from the lack of sketch to narrow the gap between the natural images and edgemaps. training data. Additionally, since the weights of the sketch and edgemap Meanwhile, since there exist obvious features representation branches are shared, the introduction of the co-attention model differences between natural images and sketches (edgemaps), actually enforces mutual and subtle cooperation amongst three the bottom convolutional layers of the natural image branch branches. Specifically, before the bottom features of the three should be learned separately. Accordingly, the natural image branches are fed into the joint semantic embedding part, the branch FMB does not share weights with the other two co-attention model prealigns different domains by conducting branches in the feature mapping part. Additionally, with the common feature recalibration. As shown in Fig. 2, the pro- aim of learning the informative features common between posed co-attention model includes two attention modules and different domains, a co-attention model is introduced to as- a co-mask learning module. The co-attention model takes the sociate the FMB and FMB branches. By applying the output feature maps of the last pooling layers in the natural I E proposed co-attention model, the network is able to focus on image and edgemap branches as the inputs, and learns a the discriminative features common to both natural images common mask which is used to re-weight each channel of and the corresponding edgemaps, and discard information that the feature maps in both two branches. 5 In the attention module, a channel-wise soft attention mech- channel-wise relationship between different domains, simul- anism is adopted to capture discriminative features of each taneously. Specifically, by introducing the co-attention model input. Specifically, the attention module on the right of Fig. 2 between the natural image and edgemap branches, the common consists of a global average pooling (GAP) layer, two FC lay- informative features from different domains are highlighted, a i hwc ers, a ReLU layer, and a sigmoid layer. Let X 2 R and the cross-domain gap is effectively reduced. a i hwc and X 2 R denote the inputs of the attention module for the natural image branch and edgemap branch, respectively, D. Hybrid-Loss Mechanism where h, w, and c represent the height, width, and channel SBIR aims to obtain a retrieval ranking by learning the dimensions of the feature maps. Through the GAP layer, the similarity between a query sketch and the natural images in feature descriptors for natural image and edgemap branches are obtained by aggregating the global spatial information of a dataset. The distance between the positive sketch-image a i a i X and X , which can be formulated as: pairs should be smaller than the distance between the neg- I E ative sketch-image pairs. In the hybrid-loss mechanism, the h w P P gap 1 a i alignment loss and the sketch-edgemap contrastive loss are (1) X = X (u; v) I hw I u=1 v=1 presented to learn the invariant cross-domain representations, and mitigate the domain gap. Besides, we also introduce h w P P gap 1 a i (2) X = X (u; v) the cross-entropy loss [38] and the sketch-image contrastive E hw u=1 v=1 loss [30], which are two typical losses in SBIR. The four gap gap Based on X and X , two FC layers and a ReLU types of loss functions are complementary to each other, thus I E layer are applied to model the channel-wise dependencies, improving the separability between the different sample pairs. and the attention maps in both two domains are obtained. By To be clear, the feature maps produced by the L2 normal- deploying a sigmoid layer, each channel of the attention map ization layers for the three branches are denoted as f (I), is normalized to [0; 1]. For different domains, the final learned f (E), and f (S), where f () denotes the mapping func- E S 11c image attention mask M 2 R and edgemap attention tion learned by the network branch, and  ,  , and I E S 11c mask M 2 R are respectively formulated as: denote the weights of the natural image, edgemap, and sketch branches, respectively. gap 2 1 M = sigmoid(W  ReLU(W  X )) (3) I I 1) Alignment loss. To learn the invariant cross-domain gap 2 1 representations and align different domains into a high-level M = sigmoid(W  ReLU(W  X )) (4) E E semantic space, a novel alignment loss is proposed between 1 1 where W and W denote the weights of the first FC layer, the natural image branch and the edgemap branch. Although I E 2 2 and W and W denote the weights of the second FC layer. the image and the corresponding edgemap come from different I E In SBIR, the key challenge is to capture the discriminative data domains, they should have similar high-level semantics information common to different domains and then align the after the processing of joint semantic embedding. Motivated by different domains into a common high-level semantic space. this, aiming to minimize the feature distance between an image Therefore, unlike most existing works that use the obtained and its corresponding edgemap in the high-level semantic attention mask directly to reweight the channel responses, the space, the proposed alignment loss function is defined as: proposed co-attention model tries to capture the channel-wise L (I; E) = jjf (I) f (E)jj (8) dependencies common between different domains by learning alignment I E 2 a co-mask. Based on the learned image attention mask and By introducing the alignment loss, the cross-domain invari- edgemap attention mask, the co-mask M is defined as: CO ant representations between a natural image and its corre- M = M M (5) sponding edgemap are captured for the SBIR task. In other CO I E words, the proposed alignment loss provides a novel way of where denotes the element-wise product. Elements in dealing with the domain gap by constructing a correlation 11c M 2 R represent the joint weights of the corre- CO between the natural image and its corresponding edgemap. It a i a i sponding channels in X and X . I E potentially encourages the network to learn the discriminative a o hwc Afterwards, the output feature maps X 2 R and features common to both natural image and sketch domains, a o hwc X 2 R for the image and edgemap branches are and successfully aligns different domains into a common high- calculated respectively by rescaling the input feature maps level semantic space. a i a i X and X with the obtained channel-wise co-mask: I E 2) Sketch-edgemap contrastive loss. Considering the a o a i one-to-one correlation between an image and its corre- X = f (M ;X ) (6) scale CO I I sponding edgemap, the sketch-edgemap contrastive loss a o a i SE X = f (M ;X ) (7) scale CO L (S; E; l ) between the sketch and edgemap E sim contrastive branches is proposed to further constrain the matching rela- where f () denotes the channel-wise multiplication be- scale tionship between the sketch-image pair as follows. tween the co-mask and the input feature maps. SE + The proposed co-attention model not only considers the L (S; E; l ) = l d(f (S); f (E )) sim sim contrastive S E channel-wise feature responses of each domain by introduc- +(1 l ) max(0; m d(f (S); f (E ))) sim 1 S E ing the attention mechanism, but also captures the common (9) 6 where l denotes the similarity label, with 1 indicating a loss mechanism, the proposed network is able to learn more sim positive sketch-edgemap pair and 0 being a negative sketch- discriminative feature representations and effectively align the edgemap pair, E and E denote the edgemap corresponding sketches, natural images, and edgemaps into a common feature to the positive and negative natural image, respectively, d() space, thus improving the retrieval accuracy. denotes Euclidean distance, and m denotes the margin. The sketch-edgemap contrastive loss aims to measure the similarity E. Implementation Details and Training Procedure between input pairs from the sketch and edgemap branches, In the proposed Semi3-Net, each branch is constructed thus further aligning different domains into the high-level based on VGG19 [53], which consists of sixteen convolutional semantic space. layers, five pooling layers and three FC layers. In terms of 3) Cross-entropy loss. In order to learn the discrimi- network architecture, the convolutional layers and pooling native features from each domain, the cross-entropy losses layers in the semi-heterogeneous feature mapping part, and the [38] for the three branches are introduced. For each branch first two FC layers in the joint semantic embedding part are of the proposed network, a softmax cross-entropy loss the same as the ones in VGG19 in structure. Specifically, the L (p;y) is exploited, which is formulated as: crossentropy extra embedding layer in each branch is a 256-dimension fully- X connected layer, which aims to embed the different domains L (p;y) = y log p crossentropy k k into the common semantic feature space. Meanwhile, the size of the last FC layer is modified to the number of categories in the SBIR datasets. The entire training process includes the exp(z ) = y log( ) k P pre-training stage on each individual branch and the training exp(z ) j=1 stage on the proposed Semi3-Net. (10) In the pre-training stage, each individual branch, including where p = (p ; :::p ) denotes the discrete probability distri- 1 K the convolutional layers and pooling layers in the semi- bution of one data sample over K categories, y = (y ; :::y ) 1 K heterogeneous feature mapping part, and the FC layers in denotes the typical one-hot label corresponding to each cate- the joint semantic embedding part, is trained independently. gory, and z = (z ; :::z ) denotes the feature vector produced 1 K Specifically, the cross-entropy loss is adopted to pre-train each by the last FC layer. In the proposed Semi3-Net, the cross- branch using the corresponding source data in the training entropy loss forces the network to extract the discriminative dataset. Without learning the common embedding, the pre- features for each domain. training stage aims to learn the weights appropriate for sketch, 4) Sketch-image contrastive loss. Intuitively, in the SBIR natural image and edgemap recognition, respectively. task, the positive sketch-image pair should be close together, In the training stage, the weights of the three branches while the negative sketch-image pair should be far apart. Given are learned jointly, and the cross-domain representations are a sketch S and a natural image I , the sketch-image contrastive obtained by training the whole Semi3-Net. The overall loss loss [30] can be represented as: function in Eq. (12) is utilized in the training stage. As for SI + L (S; I; l ) = l d(f (S); f (I )) sim sim contrastive S I the sketch-edgemap and the sketch-image contrastive losses +(1 l ) max(0; m d(f (S); f (I ))) sim 2 S I illustrated above, the sketch-image and sketch-edgemap pairs (11) for training need to be generated. To this end, for each sketch where I and I denote the positive and negative natural in the training dataset, we randomly select a natural image images, respectively, and m denotes the margin. By utilizing (edgemap) from the same category to form the positive pair the sketch-image contrastive loss, the cross-domain similarity and a natural image (edgemap) from the other categories to between sketches and natural images is effectively measured. form the negative pair. In the training process, the ratio of Finally, the alignment loss in Eq. (8), the sketch-edgemap positive and negative sample pairs is set to 1:1, and the positive contrastive loss in Eq. (9), the cross-entropy loss in Eq. (10), and negative pairs are randomly selected following this rule, and the sketch-image contrastive loss in Eq. (11) are com- for each training batch. bined, thus the overall loss function L(S; I; E;p ;y ; l ) D D sim is derived as: IV. E XPERIM ENTS L(S; I; E;p ;y ; l ) = L (p ;y ) D D sim crossentropy D D A. Experimental Settings D=I;E;S In this paper, two category-level SBIR benchmarks, Sketchy SI + L (S; I; l ) contrastive sim [38] and TU-Berlin Extension [34], are adopted to evaluate the + L (I; E) alignment proposed Semi3-Net. Sketchy consists of 75,471 sketches and SE 12,500 natural images, from 125 categories with 100 objects + L (S; E; l ) sim contrastive (12) per category. In our experiments, we utilize the extended where , , and denote the hyper-parameters that control Sketchy dataset [34] with 73,002 natural images in total, which the trade-off among different types of losses. adds an extra 60,502 natural images collected from ImageNet The proposed hybrid-loss mechanism constructs the correla- [54]. TU-Berlin Extension [34] consists of 204,489 natural tion among sketches, edgemaps, and natural images, providing images and 20k free-hand drawn sketches from 250 categories, a novel way to deal with the domain gap by learning the in- with 80 sketches per category. The natural images in these variant cross-domain representations. By adopting the hybrid- two datasets are all realistic with complex backgrounds and 7 large variations, thus bringing great challenges to the SBIR TABLE I COMPARISON WITH THE STATE-OF-THE-ART SBIR METHODS task. Importantly, for fair comparison against state-of-the-art ON SKETCHY AND TU-BERLIN EXTENSION DATASETS. methods, the same training-testing splits used for the existing methods are adopted in our experiments. For Sketchy and TU- TU-Berlin Sketchy berlin Extension, 50 and 10 sketches, respectively, of each Methods Extension category are utilized as the testing queries, while the rest are MAP MAP used for training. 3D shape [36] 0:084 0:054 All experiments are performed under the simulation envi- HOG [20] 0:115 0:091 ronments of GeForce GTX 1080 Ti GPU and Intel i7-8700K GF-HOG [11] 0:157 0:119 processor @3.70 GHz. The training process of the proposed SHELO [12] 0:161 0:123 LKS [13] 0:190 0:157 Semi3-Net is implemented using SGD on Caffe [55] with SaN [37] 0:208 0:154 a batch size of 32. The initial learning rate is set to 2e-4, Siamese CNN [33] 0:481 0:322 and the weight decay and the momentum are set to 5e-4 and Siamese-AlexNet [34] 0:518 0:367 0.9, respectively. For both datasets, the balance parameters GN Triplet [38] 0:529 0:187 , , and are set to 10, 100, and 10, respectively, as they Triplet-AlexNet [34] 0:573 0:448 consistently yield promising result. The margins m and m 1 2 DSH [34] 0:783 0:570 for the two contrastive losses are both set to 0.3. GDH [39] 0:810 0:690 In the proposed method, the Gb method [56] is applied Our method 0:916 0:800 to extract the edgemap of each natural image. During the testing phase, sketches, natural images, and the corresponding edgemaps are fed into the trained model, and the feature semi-heterogeneous joint embedding network architecture. Be- vectors of the three branches are obtained. Then, the cosine sides, by integrating the co-attention model and hybrid-loss distance between the feature vectors of the query sketch and mechanism, the proposed Semi3-Net is encouraged to learn each natural image in the dataset is calculated. Finally, KNN a more discriminative embedding and invariant cross-domain is utilized to sort all the natural images for the final retrieval representations, simultaneously. 4) The proposed Semi3-Net result. Similar to the existing SBIR methods, mean average not only achieves superior performance over all traditional precision (MAP) is used to evaluate the retrieval performance. methods, but also outperforms the current best state-of-the- art deep learning method GDH [39] by 0.106 and 0.110 in B. Comparison Results MAP on Sketchy and TU-Berlin Extension, respectively. This further validates the effectiveness of the proposed Semi3-Net Table I shows the performance of the proposed method for the SBIR task. compared with the state-of-the-art methods, on the Sketchy and TU-Berlin Extension datasets. The comparison methods Figure 3 shows some retrieval examples by the proposed include traditional methods (i.e., LKS [13], HOG [20], SHELO Semi3-Net for the Sketchy dataset. Specifically, 10 relatively [12], and GF-HOG [11]) and deep learning methods (i.e., challenging query sketches with top-15 retrieval rank lists are Siamese CNN [33], SaN [37], GN Triplet [38], 3D shape [36], presented, where the incorrect retrieval results are marked with Siamese-AlexNet [34], Triplet-AlexNet [34], DSH [34] and red bounding boxes. As can be seen, the proposed method GDH [39]). It can be observed from the table that: 1) Com- performs well for retrieving natural images with complex pared with the methods that utilize edgemaps to replace natural backgrounds, such as the queries “Umbrella” and “Racket”. images for retrieval [33], [36], the proposed method obtains This is mainly because the key information common to dif- better retrieval accuracy. This is mainly because the edgemaps ferent domains can be effectively captured by the proposed extracted from natural images may lose certain information semi-heterogeneous feature mapping part with the co-attention useful for retrieval, and the CNNs pre-trained on ImageNet are model. Additionally, for relatively abstract sketches, such as more effective for natural images than edgemaps. 2) Compared the sketch “Cat”, the proposed method still achieves a con- with the methods that only utilize sketch-image pairs for SBIR sistent superior performance. This indicates that the semantic [38], the proposed Semi3-Net achieves better performance, features of the abstract sketches are extracted by embedding by introducing edgemap information. It demonstrates that an the features from different domains into a high-level semantic edgemap can be utilized as a bridge to effectively narrow the space. In other words, although the abstract sketches only distance between the natural image and sketch domains. 3) consist of black and white pixels, the proposed Semi3-Net Compared with DSH [34], which fuses the natural image and can gain a better understanding of the sketch domain by edgemap features into one feature representation, the proposed introducing the joint semantic embedding part. However, there Semi3-Net achieves 0.133 and 0.230 improvements in terms are also some incorrect retrieval images in Fig. 3. For example, of MAP, on Sketchy and TU-Berlin Extension, respectively. for the query “Fish”, a dolphin is retrieved with the proposed This is mainly because the proposed Semi3-Net makes full method. Meanwhile, for the query “Duck”, several swans are use of the one-to-one matching relationship between a natu- retrieved. However, the incorrect retrieved natural images are ral image and its corresponding edgemap to align different quite similar to the query. Specifically, the fish and duck domains into a high-level semantic space. Meanwhile, the sketches, which lack color and texture information, are very domain shift process is well achieved under the proposed similar in shape to a dolphin and a swan, respectively, which 8 Query Retrieval results Umbrella Eyeglasses Cat Racket Flower Fish Bicycle Elephant Duck Piano Fig. 3. Some retrieval examples for the proposed Semi3-Net on the Sketchy dataset. Incorrect retrieval results are marked with red bounding boxes. TABLE II scatter into nearly the same clusters. This further demonstrates EVALUATION OF THE SEMI-HETEROGENEOUS that the proposed Semi3-Net effectively aligns sketches and ARCHITECTURE ON SKETCHY DATASET. natural images into a common high-level semantic space, thus improving the retrieval accuracy. Methods MAP All sharing 0:857 Only FC layer sharing 0:879 C. Ablation Study Only sketch-edgemap sharing 0:890 1) Evaluation of the semi-heterogeneous architecture: To Semi3-Net 0:916 evaluate the superiority of the proposed semi-heterogeneous network architecture, networks with different weight sharing strategies are compared with the proposed Semi3-Net. For makes the SBIR task more challenging. fair comparison, the proposed network architecture is fixed To further verify the effectiveness of the proposed Semi3- when different weight-sharing strategies are verified. In other Net, the t-SNE [57] visualization of natural images and words, both the proposed co-attention model and the hybrid- sketches from ten categories on Sketchy are reported. As loss mechanism are applied in the networks with different shown in Fig. 4, the ten categories are selected randomly in weight-sharing strategies. The comparison results are shown the dataset, and the labels of the selected categories are also in Table II, where “all sharing” indicates the weights of illustrated in the upper right corner. We run t-SNE visualization the three branches are completely shared, “only FC layer on both images and sketches together, then separate the sharing” indicates only the weights of the three branches are projected data points to Fig. 4 (a) and Fig. 4 (b), respectively. shared in the semantic embedding part, and “only sketch- Specifically, the circles in Fig. 4 represent clusters of different edgemap sharing” indicates only the weights of the sketch categories, and the data in each circle belongs to the same cate- branch and edgemap branch are shared in the feature mapping gory in the high-level semantic space. In other words, circles in part. As can be seen, the method with all sharing strategy the same position of Fig. 4 (a) and Fig. 4 (b) correspond to the does not perform as well as the methods with partially-shared natural images and sketches with the same label, respectively. strategies. This supports our views illustrated previously that If data samples with the same label but from two different the bottom discriminative features should be learned separately domains are correctly aligned in the common feature space, it for different domains. Compared with the the proposed Semi3- indicates that the cross-domain learning is successful. As can Net, the method with only FC layer sharing strategy obtains be seen, the natural images and sketches with the same label 0.879 MAP. That is because the intrinsic relevance between the 9 (b) t-SNE visualization of the query sketches (a) t-SNE visualization of the natural images Fig. 4. t-SNE visualization of the natural images and query sketches from ten categories in the Sketchy dataset, during the test phase. Symbols with the same color and shape in (a) and (b) represent the natural images and query sketches with the same label, respectively. sketches and edgemaps is ignored. Meanwhile, experimental TABLE III EVALUATION OF KEY COMPONENTS ON SKETCHY DATASET. results show that the method with only sketch-edgemap shar- ing strategy also results in a decrease in MAP. That is because Methods MAP the three branches are not fully aligned in the common high- w/o CAM and w/o HLM 0:851 level semantic space. Importantly, the proposed Semi3-Net, w/o HLM 0:880 with both the FC layer sharing and sketch-edgemap sharing w/o CAM 0:896 strategies, achieves the best performance, which proves the Semi3-Net 0:916 effectiveness of the proposed network architecture. 2) Evaluation of key components: In order to illustrate TABLE IV the contributions of the key components in the proposed EVALUATION OF DIFFERENT FEATURE SELECTIONS IN THE method, i.e., the self-heterogeneous three-way framework, the TEST PHASE ON SKETCHY DATASET. co-attention model and the hybrid-loss mechanism, leaving- one-out evaluations are conducted on the Sketchy dataset. Methods MAP The experimental results are shown in Table III. Note that Edgemap feature 0:914 Natural image feature 0:916 “w/o CAM and w/o HLM” refers to the proposed self- heterogeneous three-way framework, which has neither a co- attention model nor a hybrid-loss mechanism. For “w/o HLM”, only two types of typical SBIR losses, the cross-entropy loss image branch or the edgemap branch can be used as the and the sketch-image contrastive loss, are used in the training final retrieval feature representation. To evaluate the impact phase. As shown in Table III, the proposed self-heterogeneous of different feature selections in the test phase, we obtain three-way framework obtains 0.851 MAP, outperforming the the 256-d feature vectors from the embedding layers of the other methods in Table I. Additionally, leaving out either CAM natural image branch and the edgemap branch, respectively. or HLM results in a lower MAP, which verifies that each The experiments are conducted on the Sketchy dataset, and component contributes to the overall performance. Specifi- the experimental results are reported in Table IV. As can be cally, the informative features common between natural images seen from the table, there is a small difference between the and the corresponding edgemaps are captured effectively by retrieval performances when using the edgemap feature and the proposed co-attention mechanism, and the three different natural image feature. This is mainly because the invariant inputs can be effectively aligned into a common feature space cross-domain representations of different domains are indeed by introducing the hybrid-loss mechanism. learned by the proposed method, and the cross-domain gap is narrowed by the joint embedding learning. Besides, the feature extracted from the natural image branch performs a little better D. Discussions than that from the edgemap branch, which is also consistent In this section, the impact of different feature selections with the research in [30], [38]. in the test phase is discussed for the Sketchy dataset. As In addition, to verify the performance without edgemap mentioned above, either features extracted from the natural branch, we conducted experiments on a two-branch frame- 10 work by only using the sketch and natural image branches. [6] K. Li, K. Pang, Y.-Z. Song, T. Hospedales, T. Xiang, and H. Zhan, “Synergistic instance-level subspace alignment for fine-grained sketch- Considering that the sketches and natural images belong to two based image retrieval,” IEEE Transactions on Image Processing, vol. different domains, a non-shared setting is first exploited on the 26, no. 12, pp. 5908-5921, 2017. two-branch framework. Note that without the edgemap branch, [7] G. Tolias and O. Chum, “Asymmetric feature maps with application to sketch based retrieval,” in Proc. IEEE Conference on Computer Vision neither the co-attention model nor the hybrid-loss mechanism and Pattern Recognition (CVPR), Jul. 2017, pp. 6185-6193. is added into the two-branch framework. Compared with [8] J. Lei, K. Zheng, H. Zhang, X. Cao, N. Ling, and Y. Hou, “Sketch based the proposed Semi3-Net with 0.916 MAP, the two-branch image retrieval via image-aided cross domain learning,” in Proc. IEEE International Conference on Image Processing (ICIP), Sept. 2017, pp. framework obtains 0.837 MAP on Sketchy dataset, which 3685-3689. sufficiently proves the effectiveness of the edgemap branch in [9] J. Song, K. Pang, Y-Z Song, T. Xiang, and T M. Hospedales, “Learning the proposed Semi3-Net. In addition, a partially-shared setting to sketch with shortcut cycle consistency,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. is also exploited on the two-branch framework, in which the 801-810. weights of the convolutional layers are independent while the [10] P. Xu, Y. Huang, T. Yuan, K. Pang, Y.-Z Song, T. Xiang, T. M. weights of the fully-connected layers are shared. Compared Hospedales, Z. Ma, and J. Guo, “SketchMate: deep hashing for million- scale human sketch retrieval,” in Proc. IEEE Conference on Computer with the non-shared setting, the partially-shared setting obtains Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 8090-8098. 0.861 MAP. This further verifies the importance of the joint [11] R. Hu, M. Barnard, and J. Collomosse, “Gradient field descriptor for semantic embedding for SBIR. sketch based retrieval and localization,” in in Proc. IEEE International Conference on Image Processing (ICIP), Sept. 2010, pp. 1025-1028. [12] J. M. Saavedra, “Sketch based image retrieval using a soft computation of the histogram of edge local orientations (S-HELO),” in Proc. IEEE V. CONCLUSION International Conference on Image Processing (ICIP), Oct. 2014, pp. 2998-3002. In this paper, we propose a novel semi-heterogeneous [13] J. M. Saavedra and J. M. Barrios, “Sketch based image retrieval using three-way joint embedding network, where auxiliary edgemap learned keyshapes (LKS),” in Proc. British Machine Vision Conference information is introduced as a bridge to narrow the cross- (BMVC), Sept. 2015, pp. 164.1-164.11. [14] D. Xu, X. Alameda-Pineda, J. Song, E. Ricci, and N. Sebe, “Academic domain gap between sketches and natural images. The semi- coupled dictionary learning for sketch-based image retrieval,” in Proc. heterogeneous feature mapping and the joint semantic embed- ACM International Conference on Multimedia (ACM MM), Oct. 2016, ding are proposed to learn the specific bottom features from pp. 1326-1335. [15] X. Qian, X. Tan, Y. Zhang, R. Hong, and M. Wang, “Enhancing sketch- each domain and embed different domains into a common based image retrieval by re-ranking and relevance feedback,” IEEE high-level semantic space, respectively. Besides, a co-attention Transactions on Image Processing, vol. 25, no. 1, pp. 195-208, 2016. model is proposed to capture the informative features common [16] S. Wang, J. Zhang, T. Han, and Z. Miao, “Sketch-based image retrieval through hypothesis-driven object boundary selection with hlr descriptor,” between natural images and the corresponding edgemaps, by IEEE Transactions on Multimedia, vol. 17, no. 7, pp. 1045-1057, 2015. recalibrating corresponding channel-wise feature responses. In [17] J. M. Saavedra, B. Bustos, and S. Orand, “Sketch-based image retrieval addition, a hybrid-loss mechanism is designed to construct the using keyshapes,” in Proc. British Machine Vision Conference (BMVC), Sept. 2015, pp. 164.1-164.11. correlation among sketches, edgemaps, and natural images, [18] Y. Zhang, X. Qian, X. Tan, J. Han, and Y. Tang, “Sketch-based so that the invariant cross-domain representations of different image retrieval by salient contour reinforcement,” IEEE Transactions domains can be effectively learned. Experimental results on on Multimedia, vol. 18, no. 8, pp. 1604-1615, 2016. [19] Y. Qi, Y.-Z. Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li, and J. Guo, two datasets demonstrate that Semi3-Net outperforms the “Making better use of edges via perceptual grouping,” in Proc. IEEE state-of-the-art methods, which proves the effectiveness of the Conference on Computer Vision and Pattern Recognition (CVPR), Jun. proposed method. 2015, pp. 1856-1865. [20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human In the future, we will focus on extending the proposed cross- detection,” in Proc. IEEE Conference on Computer Vision and Pattern domain network to fine-grained image retrieval, and learning Recognition (CVPR), Jun. 2005, pp. 886-893. the correspondence of the fine-grained details for sketch-image [21] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. IEEE International Conference on Computer Vision (ICCV), Sept. pairs. Besides, further study may also include extending our 1999, pp. 1-8. method to other cross-domain learning problems. [22] G. Mori, S. Belongie, and J. Malik, “Efficient shape matching using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1832-1837, 2005. REFERENCES [23] M. Eitz, J. Hays, and M. Alexa, “How do humans sketch objects?” ACM Transactions on Graphics, vol. 31, no. 4, pp. 1-10, 2012. [1] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian, “Semantic-aware co- [24] U. R. Muhammad, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales, indexing for image retrieval,” IEEE Transactions on Pattern Analysis “Learning deep sketch abstraction,” in Proc. IEEE Conference on and Machine Intelligence, vol. 37, no. 12, pp. 2573-2587, 2015. Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 8014- [2] D. Wu, Z. Lin, B. Li, J. Liu, and W. Wang, “Deep uniqueness- aware hashing for fine-grained multi-label image retrieval,” in Proc. [25] J. A. Landay and B. A. Myers, “Sketching interfaces: toward more International Conference on Acoustics, Speech and Signal Processing human interface design,” IEEE Computer, vol. 34, no. 3, pp. 56-64, (ICASSP), Apr. 2018, pp. 1683-1687. [3] B. Peng, J. Lei, H. Fu, C. Zhang, T.-S. Chua, and X. Li, “Unsupervised [26] J. Lei, L. Niu, H. Fu, Bo, Peng, Q. Huang, and C. Hou, “Person video action clustering via motion-scene interaction constraint,” IEEE re-identification by semantic region representation and topology con- Transactions on Circuits and Systems for Video Technology, DOI: straint,” IEEE Transactions on Circuits and Systems for Video Technol- 10.1109/TCSVT.2018.2889514, 2018. ogy, DOI: 10.1109/TCSVT.2018.2866260, 2018. [4] X. Shang, H. Zhang, and T.-S. Chua, “Deep learning generic features for [27] J. Cao, Y. Pang, and X. Li, “Learning multi-layer channel features for cross-media retrieval,” in Proc. International Conference on MultiMedia pedestrian detection,” IEEE Transactions on Image Processing, vol. 26, Modeling (MMM), Jun. 2018, pp. 15-24. no. 7, pp. 3210-3220, 2017. [5] J. M. Saavedra, “Rst-shelo: sketch-based image retrieval using sketch [28] Y. Pang, M. Sun, X. Jiang, and X. Li, “Convolution in convolution tokens and square root normalization,” Multimedia Tools and Applica- for network in network,” IEEE Transactions on Neural Networks and tions, vol. 76, no. 1, pp. 931-951, 2017. Learning Systems, vol. 29, no. 5, pp. 1587-1597, 2018. 11 [29] T. Han, H. Yao, C. Xu, X. Sun, Y. Zhang, and J. J. Corso, “Dancelets [51] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Videos captioning mining for video recommendation based on dance styles,” IEEE Trans- with attention-based LSTM and semantic consistency,” IEEE Transac- actions on Multimedia, vol. 19, no. 4, pp. 712-724, 2017. tions on Multimedia, vol. 19, no. 9, pp. 2045-2055, 2017. [30] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse, “Sketching out the de- [52] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adap- tails: Sketch-based image retrieval using convolutional neural networks tive attention for visual captioning,” IEEE Transactions on Pattern Anal- with multi-stage regression,” Computers & Graphics, vol. 77, pp. 77-87, ysis and Machine Intelligence, DOI: 10.1109/TPAMI.2019.2894139, 2018. 2019. [31] H. Zhang, C. Zhang, and M. Wu, “Sketch-based cross-domain image [53] K. Simonyan and A. Zisserman, “Very deep convolutional networks for retrieval via heterogeneous network,” in Proc. IEEE International Con- large-scale image recognition,” in Proc. International Conference on ference on Visual Communications and Image Processing (VCIP), Dec. Learning Representations (ICLR), May. 2015, pp. 1-14. 2017, pp. 1-4, [54] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: [32] Q. Yu, F. Liu, Y.-Z. Song, and T. Xiang, “Sketch me that shoe,” in Proc. a large-scale hierarchical image database,” in Proc. IEEE Conference on IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Computer Vision and Pattern Recognition (CVPR), Jun. 2009, pp. 248- Jun. 2016, pp. 799-807. 255. [33] Y. Qi, Y.-Z. Song, H. Zhang, and J. Liu, “Sketch-based image retrieval [55] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, via siamese convolutional neural network,” in Proc. IEEE International S. Guadarrama, and T. Darrell, “Caffe: convolutional architecture for Conference on Image Processing (ICIP), Sept. 2016, pp. 2460-2464. fast feature embedding,” in Proc. ACM International Conference on [34] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao, “Deep sketch hashing: Multimedia (ACM MM), Nov. 2014, pp. 675-678. fast free-hand sketch-based image retrieval,” in Proc. IEEE Conference [56] M. Leordeanu, R. Sukthankar, and C. Sminchisescu, “Efficient closed- on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. form solution to generalized boundary detection,” in Proc. European 2298-2307. Conference on Computer Vision (ECCV), Oct. 2012, pp. 516-529. [35] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T M. Hospedales, “Deep [57] L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal spatial-semantic attention for fine-grained sketch-based image retrieval,” of Machine Learning Research, vol. 9, pp. 2579-2605, 2008. in Proc. IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 5552-5561. [36] F. Wang, L. Kang, and Y. Li, “Sketch-based 3d shape retrieval using convolutional neural networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1875-1883. [37] Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. Hospedales, “Sketch- a-net that beats humans,” in Proc. British Machine Vision Conference Jianjun Lei (M’11-SM’17) received the Ph.D. de- (BMVC), Sept. 2015, pp. 1-12. gree in signal and information processing from Bei- [38] P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy database: jing University of Posts and Telecommunications, learning to retrieve badly drawn bunnies,” ACM Transactions on Graph- Beijing, China, in 2007. He was a visiting researcher ics, vol. 35, no. 4, pp. 1-12, 2016. at the Department of Electrical Engineering, Uni- [39] J. Zhang, F. Shen, L. Liu, F. Zhu, M. Yu, L. Shao, H. T. Shen, and versity of Washington, Seattle, WA, from August L. V. Gool, “Generative domain-migration hashing for sketch-to-image 2012 to August 2013. He is currently a Professor retrieval,” in Proc. European Conference on Computer Vision (ECCV), at Tianjin University, Tianjin, China. His research Sept. 2018, pp. 297-314. interests include 3D video processing, virtual reality, [40] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification and artificial intelligence. with deep convolutional neural networks,” in Proc. Conference and Workshop on Neural Information Processing Systems (NIPS), Dec. 2012, pp. 1097-1105. [41] J. Song, H. Zhang, X. Li, L. Gao, M. Wang, and R. Hong, “Self- supervised video hashing with hierarchical binary auto-encoder,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3210-3221, 2018. [42] J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe, “Quantization based hashing: A general framework for scalable image and video retrieval,” Pattern Recognition, vol. 75, pp. 175-187, 2018. Yuxin Song received the B.S. degree in com- [43] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: munication engineering from Hefei University of adaptive attention via a visual sentinel for image captioning,” in Proc. Technology, Hefei, Anhui, China, in 2017. She is IEEE Conference on Computer Vision and Pattern Recognition (CVPR), currently pursuing the M.S. degree with the School Jul. 2017, pp. 375-383. of Electrical and Information Engineering, Tianjin [44] M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-class University, Tianjin, China. Her research interests constraint for fine-grained image recognition,” in Proc. European Con- include image retrieval and deep learning. ference on Computer Vision (ECCV), Sept. 2018, pp. 805-821. [45] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application of two-level attention models in deep convolutional neural network for fine-grained image classification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 842-850. [46] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, “Attentive moment retrieval in videos,” in Proc. International Conference on Research on Development in Information Retrieval (SIGIR), Jun. 2018, pp. 15-24. [47] H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal Bo Peng received the M.S. degree in communication reasoning and matching,” in Proc. IEEE Conference on Computer Vision and information systems from Xidian University, and Pattern Recognition (CVPR), Jul. 2017, pp. 2156-2164. Xi’an, Shaanxi, China, in 2016. Currently, she is [48] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Z, X. Wang, and X. pursuing the Ph.D. degree at the School of Elec- Tang, “Residual attention network for image classification,” in Proc. trical and Information Engineering, Tianjin Univer- IEEE Conference on Computer Vision and Pattern Recognition (CVPR), sity, Tianjin, China. Her research interests include Jul. 2017, pp. 9156-3164. computer vision, image processing and video action [49] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person analysis. re-identification,” in Proc. European Conference on Computer Vision (ECCV), Sept. 2018, pp. 2285-2294. [50] J. Hu, L, Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 7132-7141. 12 Zhanyu Ma has been an Associate Professor at Bei- jing University of Posts and Telecommunications, Beijing, China, since 2014. He is also an adjunct Associate Professor at Aalborg University, Aalborg, Denmark, since 2015. He received his Ph.D. degree in Electrical Engineering from KTH (Royal Institute of Technology), Sweden, in 2011. From 2012 to 2013, he has been a Postdoctoral research fellow in the School of Electrical Engineering, KTH, Sweden. His research interests include pattern recognition and machine learning fundamentals with a focus on applications in computer vision, multimedia signal processing, and data mining. Ling Shao is the CEO and Chief Scientist of the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates. His research inter- ests include computer vision, machine learning, and medical imaging. He is a fellow of IAPR, IET, and BCS. He is an Associate Editor of the IEEE Trans- actions on Neural Networks and Learning Systems, and several other journals. Yi-Zhe Song is a Reader of Computer Vision and Machine Learning at the Centre for Vision Speech and Signal Processing (CVSSP), UKs largest academic research centre for Artificial Intelligence with approx. 200 researchers. Previously, he was a Senior Lecturer at the Queen Mary University of London, and a Research and Teaching Fellow at the University of Bath. He obtained his PhD in 2008 on Computer Vision and Machine Learning from the University of Bath, and received a Best Dissertation Award from his MSc degree at the University of Cambridge in 2004, after getting a First Class Honours degree from the University of Bath in 2003. He is a Senior Member of IEEE, and a Fellow of the Higher Education Academy. He is a full member of the review college of the Engineering and Physical Sciences Research Council (EPSRC), the UK’s main agency for funding research in engineering and the physical sciences, and serves as an expert reviewer for the Czech National Science Foundation.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Nov 10, 2019

There are no references for this article.