Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

An overview of multi-task learning

An overview of multi-task learning Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 National Science Review 5: 30–43, 2018 REVIEW doi: 10.1093/nsr/nwx105 Advance access publication 1 September 2017 COMPUTER SCIENCE Special Topic: Machine Learning ∗ ∗ Yu Zhang and Qiang Yang ABSTRACT As a promising area in machine learning, multi-task learning (MTL) aims to improve the performance of multiple related learning tasks by leveraging useful information among them. In this paper, we give an overview of MTL by first giving a definition of MTL. Then several different settings of MTL are introduced, including multi-task supervised learning, multi-task unsupervised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement learning, multi-task online learning and multi-task multi-view learning. For each setting, representative MTL models are presented. In order to speed up the learning process, parallel and distributed MTL models are introduced. Many areas, including computer vision, bioinformatics, health informatics, speech, natural language processing, web applications and ubiquitous computing, use MTL to improve the performance of the applications involved and some representative works are reviewed. Finally, recent theoretical analyses for MTL are presented. Keywords: multi-task learning fied into several settings, including multi-task su- INTRODUCTION pervised learning, multi-task unsupervised learning, Machine learning, which exploits useful information multi-task semi-supervised learning, multi-task ac- in historical data and utilizes the information to help tive learning, multi-task reinforcement learning, and analyze future data, usually needs a large amount of multi-task online learning. In multi-task supervised labeled data for training a good learner. One typical learning, each task, which can be a classification or learner in machine learning is deep-learning mod- regression problem, is to predict labels for unseen els, which are neural networks with many hidden lay- data instances given a training dataset consisting of Department of ers and also many parameters; these models usually training data instances and their labels. In multi- Computer Science and need millions of data instances to learn accurate pa- Engineering, Hong task unsupervised learning, each task, which can be rameters. However, some applications such as med- Kong University of a clustering problem, aims to identify useful patterns ical image analysis cannot satisfy this requirement Science and contained in a training dataset consisting of data in- since it needs more manual labor to label data in- Technology, Hong stances only. In multi-task semi-supervised learning, stances. In these cases, multi-task learning (MTL) Kong, China each task is similar to that in multi-task supervised [1] is a good recipe by exploiting useful information learning with the difference that the training set in- from other related learning tasks to help alleviate this ∗ cludes not only labeled data but also unlabeled ones. Corresponding data sparsity problem. In multi-task active learning, each task exploits un- authors. E-mails: As a promising area in machine learning, MTL yuzhangcse@cse.ust.hk, labeled data to help learn from labeled data similar aims to leverage useful information contained in qyang@cse.ust.hk to multi-task semi-supervised learning but in a dif- multiple learning tasks to help learn a more ac- ferent way by selecting unlabeled data instances to curate learner for each task. Based on an assump- actively query their labels. In multi-task reinforce- tion that all the tasks, or at least a subset of them, Received 27 June ment learning, each task aims to choose actions to 2017; Revised 22 are related, jointly learning multiple tasks is empir- maximize the cumulative reward. In multi-task on- July 2017; Accepted ically and theoretically found to lead to better per- line learning, each task handles sequential data. In 8 August 2017 formance than learning them independently. Based multi-task multi-view learning, each task handles on the nature of the tasks, MTL can be classi- The Author(s) 2017. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For permissions, plea se e-mail: journals.permissions@oup.com Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 31 multi-view data in which there are multiple sets of learning, multi-task online learning and multi-task features to describe each data instance. multi-view learning. The section entitled ‘Parallel MTL can be viewed as one way for machines to and distributed MTL’ discusses parallel and dis- mimic human learning activities since people often tributed MTL models. The section entitled ‘Appli- transfer knowledge from one task to another and cations of multi-task learning’ shows how MTL can vice versa when these tasks are related. One example help other areas and that entitled ‘Theoretical anal- from our own experience is that the skills for playing ysis’ focuses on theoretical analyses of MTL. Fi- squash and tennis can help improve each other. Sim- nally, the section entitled ‘Conclusions’ concludes ilar to human learning, it is useful to learn multiple the whole paper. learning tasks simultaneously since the knowledge in a task can be utilized by other related tasks. MTL is related to other areas in machine learn- MULTI-TASK LEARNING ing, including transfer learning [2], multi-label learn- To start with, we give a definition of MTL. ing [3] and multi-output regression, but exhibits dif- Definition 1. (Multi-task learning) Given m learning ferent characteristics. For example, similar to MTL, tasks {T } where all the tasks or a subset of them transfer learning also aims to transfer knowledge i i =1 are related but not identical, multi-task learning aims from one task to another but the difference lies in to help improve the learning of a model forT by us- that transfer learning hopes to use one or more ing the knowledge contained in the m tasks. tasks to help a target task while MTL uses multi- Based on this definition, we can see that there are ple tasks to help each other. When different tasks two elementary factors for MTL. in multi-task supervised learning share the training The first factor is the task relatedness. The task re- data, it becomes multi-label learning or multi-output latedness is based on the understanding of how dif- regression. In this sense, MTL can be viewed as ferent tasks are related, which will be encoded into a generalization of multi-label learning and multi- the design of MTL models, as we will see later. output regression. The second factor is the definition of task. In ma- In this paper, we give an overview of MTL. We chine learning, learning tasks mainly include super- first briefly introduce MTL by giving its definition. vised tasks such as classification and regression tasks, After that, based on the nature of each learning unsupervised tasks such as clustering tasks, semi- task, we discuss different settings of MTL, including supervised tasks, active learning tasks, reinforcement multi-task supervised learning, multi-task unsuper- learning tasks, online learning tasks and multi-view vised learning, multi-task semi-supervised learning, learning tasks. Hence different learning tasks lead to multi-task active learning, multi-task reinforcement different settings in MTL, which is what the follow- learning, multi-task online learning and multi-task ing sections focus on. In the following sections, we multi-view learning. For each setting of MTL, rep- will review representative MTL models in different resentative MTL models are presented. When the MTL settings. number of tasks is large or data in different tasks are located in different machines, parallel and dis- tributed MTL models become necessary and sev- MULTI-TASK SUPERVISED LEARNING eral models are introduced. As a promising learning paradigm, MTL has been applied to several areas, in- The multi-task supervised learning (MTSL) setting cluding computer vision, bioinformatics, health in- means that each task in MTL is a supervised learn- formatics, speech, natural language processing, web ing task, which models the functional mapping from applications and ubiquitous computing, and sev- data instances to labels. Mathematically, suppose eral representative applications in each area are pre- there are m supervised learning tasksT for i = 1, ..., sented. Moreover, theoretical analyses for MTL, m and each supervised task is associated with a train- which can give us a deep understanding of MTL, are i i i ing datasetD ={(x , y )} , where each data in- j j j =1 reviewed. i i stance x lies in a d-dimensional space and y is the j j The remainder of this paper is organized as fol- label for x . So, for the ith taskT , there are n pairs of i i lows. The section entitled ‘Multi-task learning’ in- data instances and labels. When y is in a continuous troduces the definition of MTL. From the section space or equivalently a real scalar, the correspond- entitled ‘Multi-task supervised learning’ to that en- ing task is a regression task and if y is discrete, i.e. titled ‘Multi-task multi-view learning’, we give an y ∈{−1, 1}, the corresponding task is a classifica- overview of different settings in MTL, including tion task. multi-task supervised learning, multi-task unsuper- vised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement For a more technical or complete survey on MTL, please refer to [4]. Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 32 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW MTSL aims to learn m functions { f (x)} for i =1 Input layer Hidden layer Output layer the m tasks from the training set such that f (x )is Output for task 1 Input 1 a good approximation of y for all the i and j. After learning the m functions, MTSL uses f (·)topredict labels of unseen data instances from the ith task. Input d Output for task m As discussed before, the understanding of task relatedness affects the design of MTSL models. Specifically, existing MTSL models reflect the task Figure 1. A multi-task feedforward neural network with one relatedness in three aspects: feature, parameter and input layer, hidden layer and output layer. instance, leading to three categories of MTSL mod- els including feature-based, parameter-based, and instance-based MTSL models. Specifically, feature- activation units and receives the transformed output based MTSL models assume that different tasks of the input layer as the input where the transforma- share identical or similar feature representations, tion depends on the weights connecting the input which can be a subset or a transformation of the and hidden layers. As a transformation of the origi- original features. Parameter-based MTSL models nal features, the output of the hidden layer is the fea- aim to encode the task relatedness into the learning ture representation shared by all the tasks. The out- model via the regularization or prior on model put of the hidden layer is first transformed based on parameters. Instance-based MTSL models propose the weights connecting the hidden and output lay- to use data instances from all the tasks to construct ers, and then fed into the output layer, which has m a learner for each task via instance weighting. In the units, each of which corresponds to a task. following, we will review representative models in Unlike multi-layer feedforward neural networks, the three categories. which are based on neural networks, the multi- task feature learning (MTFL) method [5,6] and the multi-task sparse coding (MTSC) method [7] Feature-based MTSL are formulated under the regularization framework i T i by first transforming data instances as xˆ = U x In this category, all MTL models assume that differ- j j ent tasks share a feature representation, which is in- and then learning a linear function as f (x ) = i T i duced by the original feature representation. Based (a ) xˆ + b . Based on this formulation, we can see on how the shared feature representation appears, that these two methods aim to learn a linear transfor- we further categorize multi-task models into three mation U instead of the nonlinear transformation in approaches, including the feature transformation ap- multi-layer feedforward neural networks. Moreover, proach, the feature selection approach and the deep- for the MTFL and MTSC methods, there exist sev- learning approach. The feature transformation ap- eral differences. For example, in the MTFL method, proach learns the shared feature representation as U is supposed to be orthogonal and the parame- 1 m a linear or nonlinear transformation of the origi- ter matrix A = (a ,..., a ) is row-sparse via the nal features. The feature selection approach assumes regularization, while in the MTSC method, U is 2,1 that the shared feature representation is a subset of overcomplete, implying that the number of columns the original features. The deep-learning approach ap- in U is much larger than the number of rows, and A plies deep neural networks to learn the shared fea- is sparse via the  regularization. ture representation, which is encoded in the hidden layers, for multiple tasks. Feature selection approach The feature selection approach aims to select a sub- Feature transformation approach set of original features as the shared feature repre- In this approach, the shared feature representation sentation for different tasks. There are two ways to is a linear or nonlinear transformation of the orig- do the multi-task feature selection. The first way is 1 m inal feature representation. A representative model based on the regularization on W = (w ,..., w ), i T is the multi-layer feedforward neural network [1] where f (x) = (w ) x + b defines the linear learn- i i and an example of a multi-layer feedforward neu- ing function for T , and another one is based on ral network is shown in Fig. 1. In this example, the sparse probabilistic priors on W. In the following, we multi-layer feedforward neural network consists of will give details of these two ways. an input layer, a hidden layer, and an output layer. Among all the regularized methods for multi-task The input layer has d units to receive data instances feature selection, the most widely used technique from the m tasks as inputs with one unit for a fea- is  regularization to minimize W ,the p, q p,q p, q ture. The hidden layer contains multiple nonlinear norm of W, plus the training loss on the training Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 33 set, where w denotes the jth row of W, · de- multi-layer feedforward neural network, most deep- j q notes the  norm of a vector, and W equals learning models [18–22] in this category treat the q p,q (w  ,..., w  ) . The effect of the  reg- output of one hidden layer as the shared fea- 1 p d p p, q ture representation. Unlike these deep models, the ularization is to make W row-sparse and hence some cross-stitch network proposed in [23] combines the unimportant features for all the tasks can be filtered hidden feature representations of two tasks to con- out. Concrete instances of the  regularization in- p, q struct more powerful hidden feature representa- clude the  regularization proposed in [8,9] and 2,1 tions. Specifically, given two deep neural networks the  regularization proposed in [10]. In order ∞,1 A and B with the same network architecture for two to obtain a smaller subset of useful features for mul- A B tasks, where x and x denote the hidden features tiple tasks, a capped- penalty, which is defined p,1 i , j i , j contained in the jth unit of the ith hidden layer for as min(w  ,θ), is proposed in [11]. It is i p i =1 networks A and B, the cross-stitch operation on x easy to see that when θ becomes large enough, this i , j and x can be defined as capped- penalty will degenerate to the  regu- p,1 p,1 i , j larization. Besides the  regularization, there is an- A A p, q x α α x 11 12 i , j i , j other type of regularized method, which can select a = , B B x α α x 21 22 i , j i , j feature for MTL. For example, in [12], a multi-level lasso is proposed by decomposing w ,the(j, i)th en- ji A B where x˜ and x˜ denote new hidden features af- i , j i , j try in W,as w = θ w ˆ . It is easy to see that when ji j ji ter the joint learning of the two tasks. Matrix α = θ equals 0, w becomes a zero row, implying that the j j α α 11 12 ( ) as well as the parameters in the two net- jth feature is not useful for all the tasks, and hence θ α α 21 22 is an indicator of the usefulness of the jth feature for works are learned from data via the back propagation all the tasks. Moreover, when w ˆ becomes 0, w will ji ji method and hence this method is more flexible than also become 0 and hence w ˆ is an indicator of the ji directly sharing hidden layers. usefulness of the jth feature forT only. By regulariz- ing θ and w ˆ via the  norm to enforce them to be j ji 1 sparse, the multi-level lasso can learn sparse features Parameter-based MTSL in two levels. This model is extended in [ 13,14]to Parameter-based MTSL uses model parameters to more general settings. relate the learning of different tasks. Based on how For multi-task feature selection methods based the model parameters of different tasks are related, on the  regularization, a probabilistic inter- p,1 we classify them into five approaches, including the pretation is proposed in [15], which shows that low-rank approach, the task-clustering approach, the the  regularizer corresponds to a prior: w ∼ p,1 ji task-relation learning approach, the dirty approach GN (0,ρ , p), where GN (·, ·, ·) denotes the gen- and the multi-level approach. Specifically, since tasks eralized normal distribution. Then this prior is ex- are assumed to be related, the parameter matrix W tended in [15] to the matrix-variate generalized nor- is likely to be low-rank, which is the motivation for mal prior to learn relations among tasks and identify the low-rank approach. The task-clustering approach outlier tasks simultaneously. In [16,17], the horse- aims to divide tasks into several clusters and all the shoe prior is utilized to select features for MTL. The tasks in a cluster are assumed to share identical or difference between [ 16] and [17]isthatin[16], the similar model parameters. The task-relation learning horseshoe prior is generalized to learn feature covari- approach directly learns the pairwise task relations ance, while in [17], the horseshoe prior is used as a from data. The dirty approach assumes the decom- basic prior and the whole model is to identify outlier position of the parameter matrix W into two com- tasks in a way different from [ 15]. ponent matrices, each of which is regularized by a type of the sparsity. As a generalization of the dirty approach, the multi-level approach decomposes the Deep-learning approach parameter matrix into more than 2 component ma- Similar to the multi-layer feedforward neural net- trices to model complex relations among all the work model in the feature transformation approach, tasks. In the following sections, we will discuss each basic models in the deep-learning approach include approach in detail. advanced neural network models such as convo- lutional neural networks and recurrent neural net- works. However, unlike the multi-layer feedforward Low-rank approach neural network with a small number of hidden lay- Similar tasks usually have similar model param- ers (e.g. 2 or 3), the deep-learning approach in- eters, which makes W likely to be low-rank. In volves neural networks with tens of or even hun- [24], the model parameters of the m tasks are as- dreds of hidden layers. Moreover, similar to the sumed to share a low-rank subspace, leading to a Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 34 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW i i i T i task clustering. Inspired by the k-means cluster- parametrization of w as w = u +  v , where h×d ing method, Jacob et al. [31] devise a regularizer, ∈ R is a low-rank subspace shared by all the −1 T i.e. tr(W W ), to identify task clusters by tasks with h < d and u is specific to task T . With considering between-cluster and within-cluster vari- an assumption on  that  is orthonormal (i.e. ances, where tr(·) gives the trace of a square ma- = I where I denotes an identity matrix with trix,  denotes an m × m centering matrix, A an appropriate size) to remove the redundancy, u , B for two square matrices A, B means that B − v and are learned by minimizing the training loss A is positive semidefinite (PSD), and with three on all the tasks. This model is then generalized in hyperparameters α, β, γ ,  is required to satisfy [25] by adding a squared Frobenius regularization αI    βI and tr() = γ.TheMTFLmethod on W and this generalized model can be relaxed to is extended in [32] to the case of multiple clusters, have a convex objective function. where each cluster applies the MTFL method, and Based on the analysis in optimization, regu- in order to learn the cluster structure, a regularizer, larizing with the trace norm, which is defined as min(m,d) i.e. WQ  , is employed, where a 0/1di- i =1 W = μ (W), can make a matrix S(1) S(1) i i =1 agonal matrix Q satisfying Q = I can help low-rank and hence trace-norm regularization is i i i =1 identify the structure of the ith cluster. In order to widely used in MTL with [26] as a representative automatically determine the number of clusters, a work. Similar to what the capped- penalty did to p,1 i j structurally sparse regularizer, w − w  ,is the  norm, a variant of the trace-norm regulariza- 2 p,1 j >i proposed in [34] to enforce any pair of model pa- tion called the capped-trace regularizer is proposed min(m,d) rameters to be fused. After learning the parameter in [27] and defined as min(μ (W),θ), i =1 matrix W, the cluster structure can be determined by where θ is a parameter defined by users. Based on θ, i j comparing whether w − w  is below a thresh- only small singular values of W will be penalized and old or not for any pair (i, j). Both works [33,35]de- hence it can lead to a matrix with a lower rank. When compose W as W = LS wherecolumnsin L consist θ becomes large enough, the capped-trace regular- of basis parameter vectors in different clusters and izer will reduce to the trace norm. S contains combination coefficients. Both methods penalize the complexity of L via the squared Frobe- Task-clustering approach nius norm but they learn S in different ways. Specif- The task-clustering approach applies the idea of ically, the method in [33] aims to identify overlap- data-clustering methods to group tasks into several ping task clusters where each task can belong to mul- clusters, each of which has similar tasks in terms of tiple clusters and hence it learns a sparse S via the model parameters. regularization, while in [35], each task lies in only The first task-clustering algorithm proposed in one cluster and hence the  norm of each column [28] decouples the task-clustering procedure and in the 0/1 matrix S is enforced to be 1. the model-learning procedure. Specifically, it first clusters tasks based on the model parameters learned Task-relation learning approach separately under the single-task setting and then In this approach, task relations are used to reflect the pools the training data of all the tasks in a task clus- task relatedness and some examples for the task re- ter to learn a more accurate learner for all the tasks lations include task similarities and task covariances, in this task cluster. This two-stage method may be just to name a few. suboptimal since model parameters learned under In earlier studies on this approach, task relations the single-task setting may be inaccurate, making the are either defined by model assumptions [ 36,37]or task-clustering procedure not so good. So follow-up given by a priori information [38–41]. These two research aims to identify the task clusters and learn ways are not ideal and practical since model assump- model parameters together. tions are hard to verify for real-world applications A multi-task Bayesian neural network, whose and a priori information is difficult to obtain. A more structure is similar to that of the multi-layer neural advanced way is to learn the task relations from data, network shown in Fig. 1,isproposed in [29]toclus- which is the focus of this section. ter tasks based on the Gaussian mixture model in A multi-task Gaussian process is proposed in terms of model parameters (i.e. weights connecting [42] to define a prior on f , the functional value the hidden and output layers). The Dirichlet process, corresponding to x ,as f ∼ N (0,), where f = which is widely used in Bayesian learning to do data 1 m T ( f ,..., f ) .Theentryin  corresponding to clustering, is employed in [30] to do task clustering 1 n the covariance between f and f is defined as based on model parameters {w }. q p p i i Unlike [29,30], which are Bayesian models, there σ ( f , f ) = ω k(x , x ), where k( ·, ·)definesa q ip q j j are several regularized methods [31–35]todo kernel function and ω is the covariance between ip Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 35 tasks T and T . Then, based on the Gaussian like- Dirty approach i p lihood on labels given f, the marginal likelihood, The dirty approach assumes the decomposition of which has an analytical form, is used to learn ,the the parameter matrix W as W = U + V, where U task covariance to reflect the task relatedness, with its and V capture different parts of the task relatedness. (i, p)th entry as ω . In order to utilize Bayesian av- The objective functions of different models in this ip eraging to achieve better performance, a multi-task approach can be unified to minimize the training loss generalized t process is proposed in [43] by placing on all the tasks as well as two regularizers, g (U) and an inverse-Wishart prior on. h(V), on U and V, respectively. Hence, the differ- A regularized model called multi-task- ent methods belonging to this approach differ in the relationship learning (MTRL) method is proposed choices of g (U) and h(V). in [44,45] by placing a matrix-variate normal prior Here we introduce five methods in this approach, on W: W ∼ MN (0, I,), where MN (M, A, B) i.e. [53–57]. Different choices of g (U) and h(V) denotes a matrix-variate normal distribution with for the five methods are shown in Table 1. Based M, A, B as the mean, row covariance and column on Table 1, we can see that the choices of g (U) covariance. This prior corresponds to a regularizer in [53,56]make U row-sparse via the  and ∞,1 2,1 −1 T tr(W W ) where the PSD task covariance  is norms, respectively. The choices of g (U)in[54,55] required to satisfy tr() ≤ 1. The MTRL method enforce U to be low-rank via the trace norm as the is generalized to multi-task boosting [46] and regularizer and constraint, respectively. Unlike these multi-label learning [47], where each label is treated methods, g (U)in[57] penalizes its complexity via as a task, and extended to learn sparse task relations the squared Frobenius norm and clusters feature in [48]. Amodel similartothe MTRL method in different tasks based on the fused lasso regular- is proposed in [49] by assigning a prior on W as izer. For V, h(V) makes it sparse via the  norm W ∼ MN (0, , ), and it learns the sparse in [53,54] and column-sparse via the  norm in 1 2 2,1 inverse of  and  . Since the prior used in the [55,56], while in [57], h(V) penalizes the complex- 1 2 MTRL method implies that W W follows a Wishart ity of V via the squared Frobenius norm. distribution as W(0,), the MTRL method In the decomposition, U mainly identifies the is generalized in [50] by studying a high-order task relatedness among tasks similar to the feature T t prior: (W W) ∼ W(0,), where t is a positive selection approach or low-rank approach while V is integer. In [51], a similar regularizer to that of the capable of capturing noises or outliers via the spar- MTRL method is proposed by assuming a para- sity. The combination of U and V can help the learner −1 T metric form of  as  = (I − A)(I − A) , become more robust. m m where A is an asymmetric task relation claimed in [51]. Unlike the aforementioned methods, Multi-level approach which rely on global learning models, local learning As a generalization of the dirty approach, the multi- methodssuchasthe k-nearest-neighbor (kNN) level approach decomposes the parameter matrix classifier are extended in [ 52] to the multi-task W into h component matrices {W } ,i.e. W = setting and the learning function is defined i i =1 p p i i as f (x ) = σ s (x , x )y , where W , where the number of levels, h,isno ip q q i j ( p,q)∈N (i , j ) j k i =1 N (i, j) denotes the set of task and instance indices smaller than 2. In the following, we show how the for k nearest neighbors of x , s(·, ·) defines the multi-level decomposition can help model complex similarity between instances, and σ represents the task structures. ip similarity of task T to T . By enforcing σ to be In the task-clustering approach, different task p i ip T 2 close to σ , a regularizer  −   is proposed clusters usually have no overlap, which may re- pi in [52] to learn task similarities, where each σ strict the expressive power of the resulting learners. ip needs to satisfy that σ ≥ 0 and |σ |≤ σ for i = p. In [58], all possible task clusters are enumerated, ii ip ii Table 1. Choices of g(U) and h(V) for different methods in the dirty approach. Method g (U) h(V) [53] g (U) = λ U h(V) = λ V 1 ∞,1 2 1 0, if U ≤ λ S(1) 1 [54] g (U) = h(V) = λ V 2 1 +∞, otherwise. [55] g (U) = λ U h(V) = λ V 1 S(1) 2 2,1 [56] g (U) = λ U h(V) = λ V 1 2,1 2 2,1 d 2 2 [57] g (U) = λ |u − u |+ λ U h(V) = λ V 1 ij ik 2 3 i =1 k> j F F Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 36 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW leading to 2 − 1 task clusters, and they are or- for each task based on weighted instances from all ganized in a tree with the root node as a dummy the tasks. node, where the parent–child relation in the tree is the ‘subset of’ relation. This tree has 2 nodes, Discussion each of which corresponds to a level, and hence an Feature-based MTSL can learn a common feature index t denotes both a node in the tree and the representation for different tasks and it is more corresponding level. In order to handle a tree with suitable for applications whose original feature such a large number of nodes, authors make an as- representation is not so informative and discrim- sumption that if a cluster is not useful then none of inative, e.g. in computer vision, natural language its supersets are either, which means that if a node processing and speech. However, feature-based in the tree is not helpful then none of its descen- MTSL can easily be affected by outlier tasks that dants are either. Based on this assumption, a regu- are unrelated to other tasks, since it is difficult to larizer based on the squared  norm is devised, p,1 learn a common feature representation for outlier p p i.e. λ s (W ) , where V de- v t v∈V t ∈D(v) tasks that are unrelated to each other. Given a notes the set of nodes in the tree, λ is a regulariza- good feature representation, parameter-based tion parameter for node v, and D(v) denotes the set MTSL can learn more accurate model parameters of descendants of v.Here s (W ) uses the regularizer and it is more robust to outlier tasks via a robust proposed in [36] to enforce different columns in W representation of model parameters. Hence feature- to be close to their average. Unlike [58] where each based MTSL is complemental to parameter-based level involves a subset of tasks, a multi-level task- MTSL. Instance-based MTSL, which is currently clustering method is proposed in [34] to cluster all being explored, seems parallel to the other two the tasks at each level based on a structurally sparse categories. λ k regularizer w − w  . i −1 2 i =1 k> j i i In summary, the MTSL setting is the most im- In [59], each component matrix is assumed to portant one in the research of MTL since it sets the be jointly sparse and row-sparse but in different stage for research in other settings. Among the ex- proportions, which are more similar in successive isting research efforts in MTL, about 90% of works component matrices. In order to achieve this, a study the MTSL setting, while in the MTSL setting, h−i i −1 regularizer, i.e. W  + W  , i 2,1 i 1 the feature-based and parameter-based MTSL at- i =1 h−1 h−1 is constructed. tract most attention from the community. Unlike the aforementioned methods where dif- ferent component matrices have no direct interac- tion, in [60], with direct connections between com- MULTI-TASK UNSUPERVISED LEARNING ponent matrices at successive levels, the complex hi- Unlike multi-task supervised learning where each erarchical/tree structure among tasks can be learned data instance is associated with a label, in multi-task from data. Specifically, built on the multi-level task- unsupervised learning, the training set D of the ith clustering method [34], a sequential constraint, i.e. task consists of only n data instances {x } and the j j j k k |w − w |≥|w − w |∀i ≥ 2 ∀k > j,isde- i −1 i −1 i i goal of multi-task unsupervised learning is to exploit vised in [60] to help make the whole structure be- the information contained in D . Typical unsuper- come a tree. vised learning tasks include clustering, dimension- Compared with the dirty approach that focuses ality reduction, manifold learning, visualization and on identifying noises or outliers, the multi-level ap- so on, but multi-task unsupervised learning mainly proach is capable of modeling more complex task focuses on multi-task clustering. Clustering is to di- structures such as complex task clusters and tree vide a set of data instances into several groups, each structures. of which has similar instances, and hence multi-task clustering aims to conduct clustering on multiple datasets by leveraging useful information contained Instance-based MTSL in different datasets. There are few works in this category with the Not very many studies on multi-task clustering multi-task distribution matching method proposed exist. In [62], two multi-task-clustering methods are in [61] as a representative work. Specifically, it first proposed. These two methods extend the MTFL and estimates the ratio between probabilities that each MTRL methods [5,44], two models in the MTSL instance is from its own task and from a mixture of setting, to the clustering scenario and the formu- all the tasks. After determining ratios via softmax lations in the proposed two multi-task-clustering functions, this method uses ratios to determine the methods are almost identical to those in the MTFL instance weights and then learns model parameters and MTRL methods, with the only difference being Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 37 that the labels are treated as unknown cluster indica- tradeoff between the learning risk of a low-rank MTL tors that need to be learned from data. model based on the trace-norm regularization and a confidence bound similar to multi-armed bandits, is proposed in [68]. MULTI-TASK SEMI-SUPERVISED LEARNING MULTI-TASK REINFORCEMENT LEARNING In many applications, data usually require a great deal of manual labor to label, making labeled data Inspired by behaviorist psychology, reinforcement not so sufficient, but in many situations, unlabeled learning studies how to take actions in an envi- data are ample. So in this case, unlabeled data are uti- ronment to maximize the cumulative reward and it lized to help improve the performance of supervised shows good performance in many applications with learning, leading to semi-supervised learning, whose AlphaGo, which beats humans in the Go game, as a training set consists of a mixture of labeled and un- representative application. When environments are labeled data. In multi-task semi-supervised learning, similar, different reinforcement learning tasks can the goal is the same, where unlabeled data are used use similar policies to make decisions, which is a mo- to improve the performance of supervised learning tivation of the proposal of multi-task reinforcement while different supervised tasks share useful informa- learning [69–73]. tion to help each other. Specifically, in [ 69], each reinforcement learn- Based on the nature of each task, multi-task semi- ing task is modeled by a Markov decision process supervised learning can be classified into two cate- (MDP) and different MDPs in all the tasks are gories: multi-task semi-supervised classification and related via a hierarchical Bayesian infinite mixture multi-task semi-supervised regression. For multi- model. In [70], each task is characterized via a re- task semi-supervised classification, a method pro- gionalized policy and a Dirichlet process is used to posed in [63,64] follows the task-clustering ap- cluster tasks. In [71], the reinforcement learning proach to do task clustering on different tasks based model for each task is a Gaussian process temporal- on a relaxed Dirichlet process, while in each task, difference value function model and a hierarchical random walk is used to exploit useful information Bayesian model relates value functions of different contained in the unlabeled data. Unlike [63,64], tasks. In [72], the value functions in different tasks a semi-supervised multi-task regression method is are assumed to share sparse parameters and it ap- proposed in [65], where each task adopts a Gaus- plies the multi-task feature selection method with sian process and unlabeled data are used to define the  regularization [8] and the MTFL method 2,1 the kernel function, and Gaussian processes in all the [5] to learn all the value functions simultaneously. In tasks share a common prior on kernel parameters. [73], an actor–mimic method, which is a combina- tion of deep reinforcement learning and model com- pression techniques, is proposed to learn policy net- MULTI-TASK ACTIVE LEARNING works for multiple tasks. The setting of multi-task active learning, where each task has a small number of labeled data and MULTI-TASK ONLINE LEARNING a large amount of unlabeled data in the train- ing set, is almost identical to that of multi-task When the training data in multiple tasks come in a semi-supervised learning. However, unlike multi- sequential way, traditional MTL models cannot han- task semi-supervised learning, which exploits infor- dle them but multi-task online learning is capable mation contained in the unlabeled data, in multi-task of doing this job, as shown in some representative active learning, each task selects informative unla- works [74–79]. beled data to query an oracle to actively acquire their Specifically, in [ 74,75], where different tasks are labels. Hence the criterion for the selection of unla- assumed to have a common goal, a global loss func- beled data is the main research focus in multi-task ac- tion, a combination of individual losses on each task, tive learning [66–68]. measures the relations between tasks, and by using Specifically, two criteria are proposed in [ 66]to absolute norms for the global loss function, several make sure that the selected unlabeled instances are online MTL algorithms are proposed. In [76], the informative for all the tasks instead of only one task. proposed online MTL algorithms model task rela- Unlike [66], in [67] where the learner in each task tions by placing constraints on actions taken for all is a supervised latent Dirichlet allocation model, the the tasks. In [77], online MTL algorithms, which selection criterion for unlabeled data is the expected adopt perceptrons as a basic model and measure error reduction. Moreover, a selection strategy, a task relations based on shared geometric structures Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 38 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW among tasks, are proposed for multi-task classifica- hinge, -insensitive and square losses, are studied in tion problems. In [78], a Bayesian online algorithm [82], making this parallel method applicable to both is proposed for a multi-task Gaussian process that classification and regression problems in MTSL. shares kernel parameters among tasks. In [79], an In some cases, training data for different tasks online algorithm is proposed for the MTRL method may exist in different machines, which makes it dif- [44] by updating model parameters and task covari- ficult for conventional MTL models to work, even ance together. though all the training data can be moved to one machine, which incurs additional transmission and storage costs. A better option is to devise distributed MULTI-TASK MULTI-VIEW LEARNING MTL models that can directly operate on data dis- In some applications such as computer vision, each tributed on multiple machines. In [83], a distributed data point can be described by different feature algorithm is proposed based on a debiased lasso representations; one example is image data, whose model and by learning one task in a machine, this al- features include SIFT and wavelet, to name just gorithm achieves efficient communications. a few. In this case, each feature representation is called a view and multi-view learning, a learning APPLICATIONS OF MULTI-TASK paradigm in machine learning, is proposed to han- LEARNING dle such data with multiple views. Similar to super- vised learning, each multi-view data point is usu- Several areas, including computer vision, bioinfor- ally associated with a label. Multi-view learning aims matics, health informatics, speech, natural language to exploit useful information contained in multi- processing, web applications and ubiquitous com- ple views to further improve the performance over puting, use MTL to boost the performance of their supervised learning, which can be considered as a respective applications. In this section, we review single-view learning paradigm. As a multi-task ex- some related works. tension of multi-view learning, multi-task multi-view learning [80,81] hopes to exploit multiple multi- Computer vision view learning problems to improve the performance over each multi-view learning problem by leveraging The applications of MTL in computer vision can be useful information contained in related tasks. divided into two categories, including image-based Specifically, in [ 80], the first multi-task multi- and video-based applications. view classifier is proposed to utilize the task related- Image-based MTL applications include two sub- ness based on common views shared by tasks and categories: facial images and non-facial images. view consistency among views in each task. In [81], Specifically, applications of MTL based on facial different views in each task achieve consensus on un- images include face verification [ 84], personalized labeled data and different tasks are learned by ex- age estimation [85], multi-cue face recognition ploiting a priori information as in [38] or learning [86], head-pose estimation [22,87], facial landmark task relations as the MTRL method did. detection [18], and facial image rotation [88]. Ap- plications of MTL based on non-facial images in- clude object categorization [86], image segmenta- PARALLEL AND DISTRIBUTED MTL tion [89,90], identifying brain imaging predictors When the number of tasks is large, if we directly [91], saliency detection [92], action recognition apply a multi-task learner, the computational com- [93], scene classification [ 94], multi-attribute pre- plexity may be high. Nowadays the computational diction [95], multi-camera person re-identification capacity of a computer is very powerful due to the [96], and immediacy prediction [97]. multi-CPU or multi-GPU architecture involved. So Applications of MTL based on videos include we can make use of these powerful computing facil- visual tracking [98–100] and thumbnail selection ities to devise parallel MTL algorithms to accelerate [19]. the training process. In [82], a parallel MTL method is devised to solve a subproblem of the MTRL model Bioinformatics and health informatics [44], which also occurs in many regularized meth- ods belonging to the task-relation learning approach. Applications of MTL in bioinformatics and health Specifically, this method utilizes the FISTA algo- informatics include organism modeling [101], rithm to design a decomposable surrogate function mechanism identification of response to therapeu- with respect to all the tasks and this surrogate func- tic targets [102], cross-platform siRNA efficacy tion can be parallelized to speed up the learning pro- prediction [103], detection of causal genetic cess. Moreover, three loss functions, including the markers through association analysis of multiple Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 39 difficult to model, the generalization performance populations [104], construction of personalized cannot be computed and instead the generalization brain–computer interfaces [105], MHC-I binding prediction [106], splice-site prediction [106], bound is used to provide an upper bound for the gen- protein subcellular location prediction [107], eralization performance. Alzheimer’s disease assessment scale cognitive The first generalization bound for MTL is derived subscale [108], prediction of cognitive outcomes in [133] for a general MTL model. Then there are from neuroimaging measures in Alzheimer’s disease many studies to analyze generalization bounds of dif- [109], identification of longitudinal phenotypic ferent MTL approaches, including e.g. [7,134]for markers for Alzheimer’s disease progression pre- the feature transform approach, [135]for thefea- diction [110], prioritization of disease genes [111], ture selection approach, [24,135–138] for the low- biological image analysis based on natural images rank approach, [136] for the task-relation learning [20], survival analysis [112], and multiple genetic approach, and [138] for the dirty approach. trait prediction [113]. CONCLUSIONS In this paper, we give an overview of MTL. Firstly, Speech and natural language processing we give a definition of MTL. After that, different Applications of MTL in speech include speech syn- settings of MTL are presented, including multi-task thesis [114,115] and those for natural language supervised learning, multi-task unsupervised learn- processing include joint learning of six NLP tasks ing, multi-task semi-supervised learning, multi-task (i.e. part-of-speech tagging, chunking, named en- active learning, multi-task reinforcement learning, tity recognition, semantic role labeling, language multi-task online learning and multi-task multi-view modeling and semantically related words) [116], learning. For each setting, we introduce its represen- multi-domain sentiment classification [ 117], multi- tative models. Then parallel and distributed MTL domain dialog state tracking [21], machine transla- models, which can help speed up the learning pro- tion [118], syntactic parsing [118], and microblog cess, are discussed. Finally, we review the applica- analysis [119,120]. tions of MTL in various areas and present theoretical analyses for MTL. Recently deep learning has become popular in Web applications many applications and several deep models have Web applications based on MTL include learning been devised for MTL. Almost all the deep mod- to rank in web searches [121], web search ranking els just share hidden layers for different tasks; this [122], multi-domain collaborative filtering [ 123], way of sharing knowledge among tasks is very use- behavioral targeting [124], and conversion maxi- ful when all the tasks are very similar, but when this mization in display advertising [125]. assumption is violated, the performance will signif- icantly deteriorate. We think one future direction Ubiquitous computing for multi-task deep models is to design more flex- ible architectures that can tolerate dissimilar tasks Applications of MTL in ubiquitous computing in- and even outlier tasks. Moreover, the deep-learning, clude stock prediction [126], multi-device local- task-clustering and multi-level approaches lack theo- ization [127], the inverse dynamics problem for retical foundations and more analyses are needed to robotics [128,129], estimation of travel costs on guide the research in these approaches. road networks [130], travel-time prediction on road networks [131], and traffic-sign recognition [ 132]. FUNDING This work was supported by the National Basic Research Pro- THEORETICAL ANALYSIS gram of China (973 Program) (2014CB340304), the Hong Kong CERG projects (16211214, 16209715 and 16244616), Learning theory, an area in machine learning, studies the National Natural Science Foundation of China (61473087 the theoretical aspect of learning models including and 61673202), and the Natural Science Foundation of Jiangsu MTL models. In the following, we introduce some Province (BK20141340). representative works. The theoretical analysis in MTL mainly focuses on deriving the generalization bound of MTL mod- REFERENCES els. It is well known that the generalization per- 1. Caruana R. Multitask learning. Mach Learn 1997; 28: 41–75. formance of MTL models on unseen test data is 2. Pan SJ and Yang Q. A survey on transfer learning. IEEE Trans the main concern in MTL and machine learning. Knowl Data Eng 2010; 22: 1345–59. However, since the underlying data distribution is Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 40 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW 3. Zhang M and Zhou Z. A review on multi-label learning algorithms. IEEE Trans 24. Ando RK and Zhang T. A framework for learning predictive structures from Knowl Data Eng 2014; 26: 1819–37. multiple tasks and unlabeled data. J Mach Learn Res 2005; 6: 1817–53. 4. Zhang Y and Yang Q. A survey on multi-task learning. arXiv:1707.08114. 25. Chen J, Tang L and Liu J et al. A convex formulation for learning shared struc- 5. Argyriou A, Evgeniou T and Pontil M. Multi-task feature learning. In: Advances tures from multiple tasks. In:Proceedingsofthe26thInternationalConference in Neural Information Processing Systems 19. 2006, 41–8. on Machine Learning. 2009, 137–44. 6. Argyriou A, Evgeniou T and Pontil M. Convex multi-task feature learning.Mach 26. Pong TK, Tseng P and Ji S et al. Trace norm regularization: reformu- Learn 2008; 73: 243–72. lations, algorithms, and multi-task learning. SIAM J Optim 2010; 20: 7. Maurer A, Pontil M and Romera-Paredes B. Sparse coding for multitask and 3465–89. transfer learning. In: Proceedings of the 30th International Conference on Ma- 27. Han L and Zhang Y. Multi-stage multi-task learning with reduced rank. In: chine Learning. 2013, 343–51. Proceedings of the 30th AAAI Conference on Artificial Intelligence . 2016. 8. Obozinski G, Taskar B and Jordan M. Multi-task feature selection. Ph.D. The- 28. Thrun S and O’Sullivan J. Discovering structure in multiple learning tasks: the sis. University of California, Berkeley Department of Statistics 2006. TC algorithm. In: Proceedings of the 13th International Conference on Ma- 9. Obozinski G, Taskar B and Jordan M. Joint covariate selection and joint sub- chine Learning. 1996, 489–97. space selection for multiple classification problems. Stat Comput 2010; 20: 29. Bakker B and Heskes T. Task clustering and gating for Bayesian multitask 231–52. learning. J Mach Learn Res 2003; 4: 83–99. 10. Liu H, Palatucci M and Zhang J. Blockwise coordinate descent procedures for 30. Xue Y, Liao X and Carin L et al. Multi-task learning for classification with the multi-task lasso, with applications to neural semantic basis discovery. In: Dirichlet process priors. J Mach Learn Res 2007; 8: 35–63. Proceedings of the 26th International Conference on Machine Learning. 2009, 31. Jacob L, Bach FR and Vert JP. Clustered multi-task learning: a convex for- 649–56. mulation. In: Advances in Neural Information Processing Systems 21. 2008, 11. Gong P, Ye J and Zhang C. Multi-stage multi-task feature learning. JMach 745–52. Learn Res 2013; 14: 2979–3010. 32. Kang Z, Grauman K and Sha F. Learning with whom to share in multi-task 12. Lozano AC and Swirszcz G. Multi-level lasso for sparse multi-task regression. feature learning. In: Proceedings of the 28th International Conference on Ma- In: Proceedings of the 29th International Conference on Machine Learning. chine Learning. 2011, 521–8. 2012. 33. Kumar A and III HD. Learning task grouping and overlap in multi-task learning. 13. Wang X, Bi J and Yu S et al. On multiplicative multitask feature learning. In: In: Proceedings of the 29th International Conference on Machine Learning. Advances in Neural Information Processing Systems 27. 2014, 2411–9. 2012. 14. Han L, Zhang Y and Song G et al. Encoding tree sparsity in multi-task learning: 34. Han L and Zhang Y. Learning multi-level task groups in multi-task learning. In: a probabilistic framework. In: Proceedings of the 28th AAAI Conference on Proceedings of the 29th AAAI Conference on Artificial Intelligence . 2015. Artificial Intelligence . 2014, 1854–60. 35. Barzilai A and Crammer K. Convex multi-task learning by clustering. In: Pro- 15. Zhang Y, Yeung DY and Xu Q. Probabilistic multi-task feature selection. In: ceedings of the 18th International Conference on Artificial Intelligence and Advances in Neural Information Processing Systems 23. 2010, 2559–67. Statistics. 2015. 16. Hernandez-Lobato ´ D and Hernandez-Lobato ´ JM. Learning feature selection 36. Evgeniou T and Pontil M. Regularized multi-task learning. In: Proceedings of dependencies in multi-task learning. In: Advances in Neural Information Pro- the 10th ACM SIGKDD International Conference on Knowledge Discovery and cessing Systems 26. 2013, 746–54. Data Mining. 2004, 109–17. 17. Hernandez-Lobato ´ D, Hernandez-Lobato ´ JM and Ghahramani Z. A probabilis- 37. Parameswaran S and Weinberger KQ. Large margin multi-task metric learn- tic model for dirty multi-task feature selection. In: Proceedings of the 32nd ing. In: Advances in Neural Information Processing Systems 23. 2010, International Conference on Machine Learning. 2015, 1073–82. 1867–75. 18. Zhang Z, Luo P and Loy CC et al. Facial landmark detection by deep multi- 38. Evgeniou T, Micchelli CA and Pontil M. Learning multiple tasks with kernel task learning. In: Proceedings of the 13th European Conference on Computer methods. J Mach Learn Res 2005; 6: 615–37. Vision. 2014, 94–108. 39. Kato T, Kashima H and Sugiyama M et al. Multi-task learning via conic pro- 19. Liu W, Mei T and Zhang Y et al. Multi-task deep visual-semantic embedding gramming. In: Advances in Neural Information Processing Systems 20. 2007, for video thumbnail selection. In:ProceedingsofIEEEConferenceonComputer 737–44. Vision and Pattern Recognition. 2015, 3707–15. 40. Kato T, Kashima H and Sugiyama M et al. Conic programming for multitask 20. Zhang W, Li R and Zeng T et al. Deep model based transfer and multi- learning. IEEE Trans Knowl Data Eng 2010; 22: 957–68. task learning for biological image analysis. In: Proceedings of the 21th ACM 41. Gornitz ¨ N, Widmer C and Zeller G et al. Hierarchical multitask structured out- SIGKDD International Conference on Knowledge Discovery and Data Mining. put learning for large-scale sequence segmentation. In: Advances in Neural 2015, 1475–84. Information Processing Systems 24. 2011, 2690–8. 21. Mrksic N, Seaghdha ´ DO and Thomson Betal. Multi-domain dialog state track- 42. Bonilla EV, Chai KMA and Williams CKI. Multi-task Gaussian process predic- ing using recurrent neural networks. In: Proceedings of the 53rd Annual Meet- tion. In: Advances in Neural Information Processing Systems 20. 2007, 153– ing of the Association for Computational Linguistics. 2015, 794–9. 60. 22. Li S, Liu Z and Chan AB. Heterogeneous multi-task learning for human pose 43. Zhang Y and Yeung DY. Multi-task learning using generalized t process. In: estimation with deep convolutional neural network. Int J Comput Vis 2015; Proceedings of the 13th International Conference on Artificial Intelligence and 113: 19–36. Statistics. 2010, 964–71. 23. Misra I, Shrivastava A and Gupta A et al. Cross-stitch networks for multi-task 44. Zhang Y and Yeung DY. A convex formulation for learning task relationships learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern in multi-task learning. In: Proceedings of the 26th Conference on Uncertainty Recognition. 2016, 3994–4003. in Artificial Intelligence . 2010, 733–42. Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 41 45. Zhang Y and Yeung DY. A regularization approach to learning task relation- 66. Reichart R, Tomanek K and Hahn U et al. Multi-task active learning for linguis- ships in multitask learning. ACM Trans Knowl Discov Data 2014; 8: 12. tic annotations. In:Proceedingsofthe46thAnnualMeetingoftheAssociation 46. Zhang Y and Yeung DY. Multi-task boosting by exploiting task relationships. for Computational Linguistics. 2008, 861–9. In: Proceedings of European Conference on Machine Learning and Principles 67. Acharya A, Mooney RJ and Ghosh J. Active multitask learning using both and Practice of Knowledge Discovery in Databases. 2012, 697–710. latent and supervised shared topics. In: Proceedings of the 2014 SIAM Inter- 47. Zhang Y and Yeung DY. Multilabel relationship learning. ACM Trans Knowl national Conference on Data Mining. 2014, 190–8. Discov Data 2013; 7:7. 68. Fang M and Tao D. Active multi-task learning via bandits. In: Proceedings of 48. Zhang Y and Yang Q. Learning sparse task relations in multi-task learning. In: the 2015 SIAM International Conference on Data Mining. 2015, 505–13. Proceedings of the 31st AAAI Conference on Artificial Intelligence . 2017. 69. Wilson A, Fern A and Ray S et al. Multi-task reinforcement learning: a hierar- 49. Zhang Y and Schneider JG. Learning multiple tasks with a sparse matrix- chical Bayesian approach. In: Proceedings of the Twenty-Fourth International normal penalty. In: Advances in Neural Information Processing Systems 23. Conference on Machine Learning. 2007, 1015–22. 2010, 2550–8. 70. Li H, Liao X and Carin L. Multi-task reinforcement learning in partially observ- 50. Zhang Y and Yeung DY. Learning high-order task relationships in multi-task able stochastic environments. J Mach Learn Res 2009; 10: 1131–86. learning. In: Proceedings of the 23rd International Joint Conference on Artifi- 71. Lazaric A and Ghavamzadeh M. Bayesian multi-task reinforcement learning. cial Intelligence. 2013. In: Proceedings of the 27th International Conference on Machine Learning. 51. Lee G, Yang E and Hwang SJ. Asymmetric multi-task learning based on task 2010, 599–606. relatedness and loss. In: Proceedings of the 33rd International Conference on 72. Calandriello D, Lazaric A and Restelli M. Sparse multi-task reinforcement Machine Learning. 2016, 230–8. learning. In: Advances in Neural Information Processing Systems 27. 2014, 52. Zhang Y. Heterogeneous-neighborhood-based multi-task local learning algo- 819–27. rithms. In: Advances in Neural Information Processing Systems 26. 2013. 73. Parisotto E, Ba J and Salakhutdinov R. Actor-mimic: deep multitask and trans- 53. Jalali A, Ravikumar P and Sanghavi S et al. A dirty model for multi-task learn- fer reinforcement learning. In:Proceedingsofthe4thInternationalConference ing. In: Advances in Neural Information Processing Systems 23. 2010, 964–72. on Learning Representations. 2016. 54. Chen J, Liu J and Ye J. Learning incoherent sparse and low-rank patterns 74. Dekel O, Long PM and Singer Y. Online multitask learning. In: Proceedings of from multiple tasks. In: Proceedings of the 16th ACM SIGKDD International the 19th Annual Conference on Learning Theory. 2006, 453–67. Conference on Knowledge Discovery and Data Mining. 2010, 1179–88. 75. Dekel O, Long PM and Singer Y. Online learning of multiple tasks with a shared 55. Chen J, Zhou J and Ye J. Integrating low-rank and group-sparse structures for loss. J Mach Learn Res 2007; 8: 2233–64. robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD Interna- 76. Lugosi G, Papaspiliopoulos O and Stoltz G. Online multi-task learning with tional Conference on Knowledge Discovery and Data Mining. 2011, 42–50. hard constraints. In: Proceedings of the 22nd Conference on Learning Theory. 56. Gong P, Ye J and Zhang C. Robust multi-task feature learning. In: Proceedings 2009. of the 18th ACM SIGKDD International Conference on Knowledge Discovery 77. Cavallanti G, Cesa-Bianchi N and Gentile C. Linear algorithms for online mul- and Data Mining. 2012, 895–903. titask classification. J Mach Learn Res 2010; 11: 2901–34. 57. Zhong W and Kwok JT. Convex multitask learning with flexible task clusters. 78. Pillonetto G, Dinuzzo F and Nicolao GD. Bayesian online multitask learn- In: Proceedings of the 29th International Conference on Machine Learning. ing of Gaussian processes. IEEE Trans Pattern Anal Mach Intell 2010; 32: 2012. 193–205. 58. Jawanpuria P and Nath JS. A convex feature learning formulation for latent 79. Saha A, Rai P, Daume´ H and Venkatasubramanian S. Online learning of task structure discovery. In: Proceedings of the 29th International Conference multiple tasks and their relationships. In: Proceedings of the Fourteenth on Machine Learning. 2012. International Conference on Artificial Intelligence and Statistics . 2011, 59. Zweig A and Weinshall D. Hierarchical regularization cascade for joint learn- 643–51. ing. In: Proceedings of the 30th International Conference on Machine Learn- 80. He J and Lawrence R. A graph-based framework for multi-task multi-view ing. 2013, 37–45. learning. In: Proceedings of the 28th International Conference on Machine 60. Han L and Zhang Y. Learning tree structure in multi-task learning. In: Proceed- Learning. 2011, 25–32. ings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data 81. Zhang J and Huan J. Inductive multi-task learning with multiple view data. In: Mining. 2015. Proceedingsofthe18thACMSIGKDDInternationalConferenceonKnowledge 61. Bickel S, Bogojeska J and Lengauer T et al. Multi-task learning for HIV therapy Discovery and Data Mining. 2012, 543–51. screening. In: Proceedings of the Twenty-Fifth International Conference on 82. Zhang Y. Parallel multi-task learning. In: Proceedings of the IEEE International Machine Learning. 2008, 56–63. Conference on Data Mining. 2015. 62. Zhang X. Convex discriminative multitask clustering. IEEE Trans Pattern Anal 83. Wang J, Kolar M and Srebro N. Distributed multi-task learning. In: Proceed- Mach Intell 2015; 37: 28–40. ings of the 19th International Conference on Artificial Intelligence and Statis- 63. Liu Q, Liao X and Carin L. Semi-supervised multitask learning. In: Advances in tics. 2016, 751–60. Neural Information Processing Systems 20. 2007, 937–44. 84. Wang X, Zhang C and Zhang Z. Boosted multi-task learning for face verifica- 64. Liu Q, Liao X and Li H et al. Semisupervised multitask learning. IEEE Trans tion with applications to web image and video search. In: Proceedings of IEEE Pattern Anal Mach Intell 2009; 31: 1074–86. Conference on Computer Vision and Pattern Recognition. 2009, 142–9. 65. Zhang Y and Yeung D. Semi-supervised multi-task regression. In: Proceedings 85. Zhang Y and Yeung DY. Multi-task warped Gaussian process for personalized of European Conference on Machine Learning and Knowledge Discovery in age estimation. In: Proceedings of IEEE Conference on Computer Vision and Databases. 2009, 617–31. Pattern Recognition. 2010. Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 42 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW 86. Yuan X and Yan S. Visual classification with multi-task joint sparse represen- 106. Widmer C, Toussaint NC and Altun Y et al. Inferring latent task structure tation. In: Proceedings of IEEE Conference on Computer Vision and Pattern for multitask learning by multiple kernel learning. BMC Bioinformatics 2010; Recognition. 2010, 3493–500. 11:S5. 87. Yan Y, Ricci E and Ramanathan S et al. No matter where you are: flexible 107. Xu Q, Pan SJ and Xue HH et al. Multitask learning for protein subcellu- graph-guided multi-task learning for multi-view head pose classification under lar location prediction. IEEE ACM Trans Comput Biol Bioinformatics 2011; 8: target motion. In: Proceedings of IEEE International Conference on Computer 748–59. Vision. 2013, 1177–84. 108. Zhou J, Yuan L and Liu J et al. A multi-task learning formulation for pre- 88. Yim J, Jung H and Yoo B et al. Rotating your face using multi-task deep neural dicting disease progression. In: Proceedings of the 17th ACM SIGKDD In- network. In: Proceedings of IEEE Conference on Computer Vision and Pattern ternational Conference on Knowledge Discovery and Data Mining. 2011, Recognition. 2015, 676–84. 814–22. 89. An Q, Wang C and Shterev I et al. Hierarchical kernel stick-breaking process 109. Wan J, Zhang Z and Yan J et al. Sparse Bayesian multi-task learning for for multi-task image analysis. In: Proceedings of the 25th International Con- predicting cognitive outcomes from neuroimaging measures in Alzheimer’s ference on Machine Learning. 2008, 17–24. disease. In: Proceedings of IEEE Conference on Computer Vision and Pattern 90. Cheng B, Liu G and Wang J et al. Multi-task low-rank affinity pursuit for image Recognition. 2012, 940–7. segmentation. In: Proceedings of IEEE International Conference on Computer 110. Wang H, Nie F and Huang H et al. High-order multi-task feature learning to Vision. 2011, 2439–46. identify longitudinal phenotypic markers for alzheimer’s disease progression 91. Wang H, Nie F and Huang H et al. Sparse multi-task regression and feature prediction. In: Advances in Neural Information Processing Systems 25. 2012, selection to identify brain imaging predictors for memory performance. In:Pro- 1286–94. ceedings of IEEE International Conference on Computer Vision. 2011, 557–62. 111. Mordelet F and Vert J. ProDiGe: Prioritization of disease genes with multitask 92. Lang C, Liu G and Yu J et al. Saliency detection by multitask sparsity pursuit. machine learning from positive and unlabeled examples. BMC Bioinformatics IEEE Trans Image Process 2012; 21: 1327–38. 2011; 12: 389. 93. Yuan C, Hu W and Tian G et al. Multi-task sparse learning with beta process 112. Li Y, Wang J and Ye J et al. A multi-task learning formulation for survival prior for action recognition. In: Proceedings of IEEE Conference on Computer analysis. In: Proceedings of the 22nd ACM SIGKDD International Conference Vision and Pattern Recognition. 2013, 423–9. on Knowledge Discovery and Data Mining. 2016, 1715–24. 94. Lapin M, Schiele B and Hein M. Scalable multitask representation learning for 113. He D, Kuhn D and Parida L. Novel applications of multitask learning and multi- scene classification. In: Proceedings of IEEE Conference on Computer Vision ple output regression to multiple genetic trait prediction. Bioinformatics 2016; and Pattern Recognition. 2014, 1434–41. 32: 37–43. 95. Abdulnabi AH, Wang G and Lu J et al. Multi-task CNN model for attribute 114. Wu Z, Valentini-Botinhao C and Watts O et al. Deep neural networks employ- prediction. IEEE Trans Multimed 2015; 17: 1949–59. ing multi-task learning and stacked bottleneck features for speech synthe- 96. Su C, Yang F and Zhang S et al. Multi-task learning with low rank attribute sis. In: Proceedings of the 2015 IEEE International Conference on Acoustics, embedding for person re-identification. In: Proceedings of IEEE International Speech and Signal Processing. 2015, 4460–4. Conference on Computer Vision. 2015, 3739–47. 115. Hu Q, Wu Z and Richmond K et al. Fusion of multiple parameterisations for 97. Chu X, Ouyang W and Yang W et al. Multi-task recurrent neural network for DNN-based sinusoidal speech synthesis with multi-task learning. In: Proceed- immediacy prediction. In: Proceedings of IEEE International Conference on ings of the 16th Annual Conference of the International Speech Communica- Computer Vision. 2015, 3352–60. tion Association. 2015, 854–8. 98. Zhang T, Ghanem B and Liu Setal. Robust visual tracking via multi-task sparse 116. Collobert R and Weston J. A unified architecture for natural language pro- learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern cessing: deep neural networks with multitask learning. In: Proceedings of the Recognition. 2012, 2042–9. 25th International Conference on Machine Learning. 2008, 160–7. 99. Zhang T, Ghanem B and Liu S et al. Robust visual tracking via structured multi- 117. Wu F and Huang Y. Collaborative multi-domain sentiment classification. In: task sparse learning. Int J Comput Vis 2013; 101: 367–83. Proceedings of the 2015 IEEE International Conference on Data Mining. 2015, 100. Hong Z, Mei X and Prokhorov DV et al. Tracking via robust multi-task multi- 459–68. view joint sparse representation. In: Proceedings of IEEE International Confer- 118. Luong M, Le QV and Sutskever I et al. Multi-task sequence to sequence learn- ence on Computer Vision. 2013, 649–56. ing. In: Proceedings of the 4th International Conference on Learning Repre- 101. Widmer C, Leiva J and Altun Y et al. Leveraging sequence classification by sentations. 2016. taxonomy-based multitask learning. In: Proceedings of the 14th Annual Inter- 119. Zhao L, Sun Q and Ye J et al. Multi-task learning for spatio-temporal event national Conference on Research in Computational Molecular Biology. 2010, forecasting. In: Proceedings of the 21th ACM SIGKDD International Confer- 522–34. ence on Knowledge Discovery and Data Mining. 2015, 1503–12. 102. Zhang K, Gray JW and Parvin B. Sparse multitask regression for identifying 120. Zhao L, Sun Q and Ye J et al. Feature constrained multi-task learning models common mechanism of response to therapeutic targets. Bioinformatics 2010; for spatiotemporal event forecasting. IEEE Trans Knowl Data Eng 2017; 29: 26: 97–105. 1059–72. 103. Liu Q, Xu Q and Zheng VW et al. Multi-task learning for cross-platform siRNA 121. Bai J, Zhou K and Xue G et al. Multi-task learning for learning to rank in efficacy prediction: an in-silico study. BMC Bioinformatics 2010; 11: 181. web search. In: Proceedings of the 18th ACM Conference on Information and 104. Puniyani K, Kim S and Xing EP. Multi-population GWA mapping via multi-task Knowledge Management. 2009, 1549–52. regularized regression. Bioinformatics 2010; 26: 208–16. 122. Chapelle O, Shivaswamy PK and Vadrevu Setal. Multi-task learning for boost- 105. Alamgir M, Grosse-Wentrup M and Altun Y. Multitask learning for brain- ing with application to web search ranking. In: Proceedings of the 16th ACM computer interfaces. In: Proceedings of the 13th International Conference on SIGKDD International Conference on Knowledge Discovery and Data Mining. Artificial Intelligence and Statistics . 2010, 17–24. 2010, 1189–98. Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 43 123. Zhang Y, Cao B and Yeung DY. Multi-domain collaborative filtering. In: Pro- 130. Zheng J and Ni LM. Time-dependent trajectory regression on road networks ceedingsofthe26thConferenceonUncertaintyinArtificialIntelligence . 2010, via multi-task learning. In: Proceedings of the 27th AAAI Conference on Arti- 725–32. ficial Intelligence . 2013. 124. Ahmed A, Aly M and Das A et al. Web-scale multi-task feature se- 131. Huang A, Xu L and Li Y et al. Robust dynamic trajectory regression on road net- lection for behavioral targeting. In: Proceedings of the 21st ACM Inter- works: a multi-task learning framework. In: Proceedings of IEEE International national Conference on Information and Knowledge Management. 2012, Conference on Data Mining. 2014, 857–62. 1737–41. 132. Lu X, Wang Y and Zhou X et al. Traffic sign recognition via multi-modal tree- 125. Ahmed A, Das A and Smola AJ. Scalable hierarchical multitask learning al- structure embedded multi-task learning. IEEE Trans Intell Transport Syst 2017; gorithms for conversion optimization in display advertising. In: Proceedings 18: 960–72. of the 7th ACM International Conference on Web Search and Data Mining. 133. Baxter J. A model of inductive bias learning. J Artif Intell Res 2000; 12: 149– 2014, 153–62. 98. 126. Ghosn J and Bengio Y. Multi-task learning for stock selection. In: Advances 134. Maurer A. Bounds for linear multi-task learning. J Mach Learn Res 2006; 7: in Neural Information Processing Systems 9. 1996, 946–52. 117–39. 127. Zheng VW, Pan SJ and Yang Q et al. Transferring multi-device localization 135. Kakade SM, Shalev-Shwartz S and Tewari A. Regularization techniques for models using latent multi-task learning. In: Proceedings of the 23rd AAAI Con- learning with matrices. J Mach Learn Res 2012; 13: 1865–90. ference on Artificial Intelligence . 2008, 1427–32. 136. Maurer A. The Rademacher complexity of linear transformation classes. In: 128. Chai KMA, Williams CKI and Klanke S et al. Multi-task Gaussian process Proceedings of the 19th Annual Conference on Learning Theory. 2006, 65–78. learning of robot inverse dynamics. In: Advances in Neural Information Pro- 137. Pontil M and Maurer A. Excess risk bounds for multitask learning with trace cessing Systems 21, December 8-11, 2008. 2008, 265–72. norm regularization. In: Proceedings of the 26th Annual Conference on Learn- 129. Yeung DY and Zhang Y. Learning inverse dynamics by Gaussian process re- ing Theory. 2013, 55–76. gression under the multi-task learning framework. In:ThePathtoAutonomous 138. Zhang Y. Multi-task learning and algorithmic stability. In: Proceedings of the Robots. Berlin: Springer, 2009, 131–42. 29th AAAI Conference on Artificial Intelligence . 2015. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png National Science Review Oxford University Press

An overview of multi-task learning

National Science Review , Volume 5 (1) – Jan 1, 2018

Loading next page...
1
 
/lp/ou_press/an-overview-of-multi-task-learning-oJSU536QO4

References (160)

Publisher
Oxford University Press
Copyright
Copyright © 2022 China Science Publishing & Media Ltd. (Science Press)
ISSN
2095-5138
eISSN
2053-714X
DOI
10.1093/nsr/nwx105
Publisher site
See Article on Publisher Site

Abstract

Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 National Science Review 5: 30–43, 2018 REVIEW doi: 10.1093/nsr/nwx105 Advance access publication 1 September 2017 COMPUTER SCIENCE Special Topic: Machine Learning ∗ ∗ Yu Zhang and Qiang Yang ABSTRACT As a promising area in machine learning, multi-task learning (MTL) aims to improve the performance of multiple related learning tasks by leveraging useful information among them. In this paper, we give an overview of MTL by first giving a definition of MTL. Then several different settings of MTL are introduced, including multi-task supervised learning, multi-task unsupervised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement learning, multi-task online learning and multi-task multi-view learning. For each setting, representative MTL models are presented. In order to speed up the learning process, parallel and distributed MTL models are introduced. Many areas, including computer vision, bioinformatics, health informatics, speech, natural language processing, web applications and ubiquitous computing, use MTL to improve the performance of the applications involved and some representative works are reviewed. Finally, recent theoretical analyses for MTL are presented. Keywords: multi-task learning fied into several settings, including multi-task su- INTRODUCTION pervised learning, multi-task unsupervised learning, Machine learning, which exploits useful information multi-task semi-supervised learning, multi-task ac- in historical data and utilizes the information to help tive learning, multi-task reinforcement learning, and analyze future data, usually needs a large amount of multi-task online learning. In multi-task supervised labeled data for training a good learner. One typical learning, each task, which can be a classification or learner in machine learning is deep-learning mod- regression problem, is to predict labels for unseen els, which are neural networks with many hidden lay- data instances given a training dataset consisting of Department of ers and also many parameters; these models usually training data instances and their labels. In multi- Computer Science and need millions of data instances to learn accurate pa- Engineering, Hong task unsupervised learning, each task, which can be rameters. However, some applications such as med- Kong University of a clustering problem, aims to identify useful patterns ical image analysis cannot satisfy this requirement Science and contained in a training dataset consisting of data in- since it needs more manual labor to label data in- Technology, Hong stances only. In multi-task semi-supervised learning, stances. In these cases, multi-task learning (MTL) Kong, China each task is similar to that in multi-task supervised [1] is a good recipe by exploiting useful information learning with the difference that the training set in- from other related learning tasks to help alleviate this ∗ cludes not only labeled data but also unlabeled ones. Corresponding data sparsity problem. In multi-task active learning, each task exploits un- authors. E-mails: As a promising area in machine learning, MTL yuzhangcse@cse.ust.hk, labeled data to help learn from labeled data similar aims to leverage useful information contained in qyang@cse.ust.hk to multi-task semi-supervised learning but in a dif- multiple learning tasks to help learn a more ac- ferent way by selecting unlabeled data instances to curate learner for each task. Based on an assump- actively query their labels. In multi-task reinforce- tion that all the tasks, or at least a subset of them, Received 27 June ment learning, each task aims to choose actions to 2017; Revised 22 are related, jointly learning multiple tasks is empir- maximize the cumulative reward. In multi-task on- July 2017; Accepted ically and theoretically found to lead to better per- line learning, each task handles sequential data. In 8 August 2017 formance than learning them independently. Based multi-task multi-view learning, each task handles on the nature of the tasks, MTL can be classi- The Author(s) 2017. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For permissions, plea se e-mail: journals.permissions@oup.com Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 31 multi-view data in which there are multiple sets of learning, multi-task online learning and multi-task features to describe each data instance. multi-view learning. The section entitled ‘Parallel MTL can be viewed as one way for machines to and distributed MTL’ discusses parallel and dis- mimic human learning activities since people often tributed MTL models. The section entitled ‘Appli- transfer knowledge from one task to another and cations of multi-task learning’ shows how MTL can vice versa when these tasks are related. One example help other areas and that entitled ‘Theoretical anal- from our own experience is that the skills for playing ysis’ focuses on theoretical analyses of MTL. Fi- squash and tennis can help improve each other. Sim- nally, the section entitled ‘Conclusions’ concludes ilar to human learning, it is useful to learn multiple the whole paper. learning tasks simultaneously since the knowledge in a task can be utilized by other related tasks. MTL is related to other areas in machine learn- MULTI-TASK LEARNING ing, including transfer learning [2], multi-label learn- To start with, we give a definition of MTL. ing [3] and multi-output regression, but exhibits dif- Definition 1. (Multi-task learning) Given m learning ferent characteristics. For example, similar to MTL, tasks {T } where all the tasks or a subset of them transfer learning also aims to transfer knowledge i i =1 are related but not identical, multi-task learning aims from one task to another but the difference lies in to help improve the learning of a model forT by us- that transfer learning hopes to use one or more ing the knowledge contained in the m tasks. tasks to help a target task while MTL uses multi- Based on this definition, we can see that there are ple tasks to help each other. When different tasks two elementary factors for MTL. in multi-task supervised learning share the training The first factor is the task relatedness. The task re- data, it becomes multi-label learning or multi-output latedness is based on the understanding of how dif- regression. In this sense, MTL can be viewed as ferent tasks are related, which will be encoded into a generalization of multi-label learning and multi- the design of MTL models, as we will see later. output regression. The second factor is the definition of task. In ma- In this paper, we give an overview of MTL. We chine learning, learning tasks mainly include super- first briefly introduce MTL by giving its definition. vised tasks such as classification and regression tasks, After that, based on the nature of each learning unsupervised tasks such as clustering tasks, semi- task, we discuss different settings of MTL, including supervised tasks, active learning tasks, reinforcement multi-task supervised learning, multi-task unsuper- learning tasks, online learning tasks and multi-view vised learning, multi-task semi-supervised learning, learning tasks. Hence different learning tasks lead to multi-task active learning, multi-task reinforcement different settings in MTL, which is what the follow- learning, multi-task online learning and multi-task ing sections focus on. In the following sections, we multi-view learning. For each setting of MTL, rep- will review representative MTL models in different resentative MTL models are presented. When the MTL settings. number of tasks is large or data in different tasks are located in different machines, parallel and dis- tributed MTL models become necessary and sev- MULTI-TASK SUPERVISED LEARNING eral models are introduced. As a promising learning paradigm, MTL has been applied to several areas, in- The multi-task supervised learning (MTSL) setting cluding computer vision, bioinformatics, health in- means that each task in MTL is a supervised learn- formatics, speech, natural language processing, web ing task, which models the functional mapping from applications and ubiquitous computing, and sev- data instances to labels. Mathematically, suppose eral representative applications in each area are pre- there are m supervised learning tasksT for i = 1, ..., sented. Moreover, theoretical analyses for MTL, m and each supervised task is associated with a train- which can give us a deep understanding of MTL, are i i i ing datasetD ={(x , y )} , where each data in- j j j =1 reviewed. i i stance x lies in a d-dimensional space and y is the j j The remainder of this paper is organized as fol- label for x . So, for the ith taskT , there are n pairs of i i lows. The section entitled ‘Multi-task learning’ in- data instances and labels. When y is in a continuous troduces the definition of MTL. From the section space or equivalently a real scalar, the correspond- entitled ‘Multi-task supervised learning’ to that en- ing task is a regression task and if y is discrete, i.e. titled ‘Multi-task multi-view learning’, we give an y ∈{−1, 1}, the corresponding task is a classifica- overview of different settings in MTL, including tion task. multi-task supervised learning, multi-task unsuper- vised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement For a more technical or complete survey on MTL, please refer to [4]. Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 32 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW MTSL aims to learn m functions { f (x)} for i =1 Input layer Hidden layer Output layer the m tasks from the training set such that f (x )is Output for task 1 Input 1 a good approximation of y for all the i and j. After learning the m functions, MTSL uses f (·)topredict labels of unseen data instances from the ith task. Input d Output for task m As discussed before, the understanding of task relatedness affects the design of MTSL models. Specifically, existing MTSL models reflect the task Figure 1. A multi-task feedforward neural network with one relatedness in three aspects: feature, parameter and input layer, hidden layer and output layer. instance, leading to three categories of MTSL mod- els including feature-based, parameter-based, and instance-based MTSL models. Specifically, feature- activation units and receives the transformed output based MTSL models assume that different tasks of the input layer as the input where the transforma- share identical or similar feature representations, tion depends on the weights connecting the input which can be a subset or a transformation of the and hidden layers. As a transformation of the origi- original features. Parameter-based MTSL models nal features, the output of the hidden layer is the fea- aim to encode the task relatedness into the learning ture representation shared by all the tasks. The out- model via the regularization or prior on model put of the hidden layer is first transformed based on parameters. Instance-based MTSL models propose the weights connecting the hidden and output lay- to use data instances from all the tasks to construct ers, and then fed into the output layer, which has m a learner for each task via instance weighting. In the units, each of which corresponds to a task. following, we will review representative models in Unlike multi-layer feedforward neural networks, the three categories. which are based on neural networks, the multi- task feature learning (MTFL) method [5,6] and the multi-task sparse coding (MTSC) method [7] Feature-based MTSL are formulated under the regularization framework i T i by first transforming data instances as xˆ = U x In this category, all MTL models assume that differ- j j ent tasks share a feature representation, which is in- and then learning a linear function as f (x ) = i T i duced by the original feature representation. Based (a ) xˆ + b . Based on this formulation, we can see on how the shared feature representation appears, that these two methods aim to learn a linear transfor- we further categorize multi-task models into three mation U instead of the nonlinear transformation in approaches, including the feature transformation ap- multi-layer feedforward neural networks. Moreover, proach, the feature selection approach and the deep- for the MTFL and MTSC methods, there exist sev- learning approach. The feature transformation ap- eral differences. For example, in the MTFL method, proach learns the shared feature representation as U is supposed to be orthogonal and the parame- 1 m a linear or nonlinear transformation of the origi- ter matrix A = (a ,..., a ) is row-sparse via the nal features. The feature selection approach assumes regularization, while in the MTSC method, U is 2,1 that the shared feature representation is a subset of overcomplete, implying that the number of columns the original features. The deep-learning approach ap- in U is much larger than the number of rows, and A plies deep neural networks to learn the shared fea- is sparse via the  regularization. ture representation, which is encoded in the hidden layers, for multiple tasks. Feature selection approach The feature selection approach aims to select a sub- Feature transformation approach set of original features as the shared feature repre- In this approach, the shared feature representation sentation for different tasks. There are two ways to is a linear or nonlinear transformation of the orig- do the multi-task feature selection. The first way is 1 m inal feature representation. A representative model based on the regularization on W = (w ,..., w ), i T is the multi-layer feedforward neural network [1] where f (x) = (w ) x + b defines the linear learn- i i and an example of a multi-layer feedforward neu- ing function for T , and another one is based on ral network is shown in Fig. 1. In this example, the sparse probabilistic priors on W. In the following, we multi-layer feedforward neural network consists of will give details of these two ways. an input layer, a hidden layer, and an output layer. Among all the regularized methods for multi-task The input layer has d units to receive data instances feature selection, the most widely used technique from the m tasks as inputs with one unit for a fea- is  regularization to minimize W ,the p, q p,q p, q ture. The hidden layer contains multiple nonlinear norm of W, plus the training loss on the training Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 33 set, where w denotes the jth row of W, · de- multi-layer feedforward neural network, most deep- j q notes the  norm of a vector, and W equals learning models [18–22] in this category treat the q p,q (w  ,..., w  ) . The effect of the  reg- output of one hidden layer as the shared fea- 1 p d p p, q ture representation. Unlike these deep models, the ularization is to make W row-sparse and hence some cross-stitch network proposed in [23] combines the unimportant features for all the tasks can be filtered hidden feature representations of two tasks to con- out. Concrete instances of the  regularization in- p, q struct more powerful hidden feature representa- clude the  regularization proposed in [8,9] and 2,1 tions. Specifically, given two deep neural networks the  regularization proposed in [10]. In order ∞,1 A and B with the same network architecture for two to obtain a smaller subset of useful features for mul- A B tasks, where x and x denote the hidden features tiple tasks, a capped- penalty, which is defined p,1 i , j i , j contained in the jth unit of the ith hidden layer for as min(w  ,θ), is proposed in [11]. It is i p i =1 networks A and B, the cross-stitch operation on x easy to see that when θ becomes large enough, this i , j and x can be defined as capped- penalty will degenerate to the  regu- p,1 p,1 i , j larization. Besides the  regularization, there is an- A A p, q x α α x 11 12 i , j i , j other type of regularized method, which can select a = , B B x α α x 21 22 i , j i , j feature for MTL. For example, in [12], a multi-level lasso is proposed by decomposing w ,the(j, i)th en- ji A B where x˜ and x˜ denote new hidden features af- i , j i , j try in W,as w = θ w ˆ . It is easy to see that when ji j ji ter the joint learning of the two tasks. Matrix α = θ equals 0, w becomes a zero row, implying that the j j α α 11 12 ( ) as well as the parameters in the two net- jth feature is not useful for all the tasks, and hence θ α α 21 22 is an indicator of the usefulness of the jth feature for works are learned from data via the back propagation all the tasks. Moreover, when w ˆ becomes 0, w will ji ji method and hence this method is more flexible than also become 0 and hence w ˆ is an indicator of the ji directly sharing hidden layers. usefulness of the jth feature forT only. By regulariz- ing θ and w ˆ via the  norm to enforce them to be j ji 1 sparse, the multi-level lasso can learn sparse features Parameter-based MTSL in two levels. This model is extended in [ 13,14]to Parameter-based MTSL uses model parameters to more general settings. relate the learning of different tasks. Based on how For multi-task feature selection methods based the model parameters of different tasks are related, on the  regularization, a probabilistic inter- p,1 we classify them into five approaches, including the pretation is proposed in [15], which shows that low-rank approach, the task-clustering approach, the the  regularizer corresponds to a prior: w ∼ p,1 ji task-relation learning approach, the dirty approach GN (0,ρ , p), where GN (·, ·, ·) denotes the gen- and the multi-level approach. Specifically, since tasks eralized normal distribution. Then this prior is ex- are assumed to be related, the parameter matrix W tended in [15] to the matrix-variate generalized nor- is likely to be low-rank, which is the motivation for mal prior to learn relations among tasks and identify the low-rank approach. The task-clustering approach outlier tasks simultaneously. In [16,17], the horse- aims to divide tasks into several clusters and all the shoe prior is utilized to select features for MTL. The tasks in a cluster are assumed to share identical or difference between [ 16] and [17]isthatin[16], the similar model parameters. The task-relation learning horseshoe prior is generalized to learn feature covari- approach directly learns the pairwise task relations ance, while in [17], the horseshoe prior is used as a from data. The dirty approach assumes the decom- basic prior and the whole model is to identify outlier position of the parameter matrix W into two com- tasks in a way different from [ 15]. ponent matrices, each of which is regularized by a type of the sparsity. As a generalization of the dirty approach, the multi-level approach decomposes the Deep-learning approach parameter matrix into more than 2 component ma- Similar to the multi-layer feedforward neural net- trices to model complex relations among all the work model in the feature transformation approach, tasks. In the following sections, we will discuss each basic models in the deep-learning approach include approach in detail. advanced neural network models such as convo- lutional neural networks and recurrent neural net- works. However, unlike the multi-layer feedforward Low-rank approach neural network with a small number of hidden lay- Similar tasks usually have similar model param- ers (e.g. 2 or 3), the deep-learning approach in- eters, which makes W likely to be low-rank. In volves neural networks with tens of or even hun- [24], the model parameters of the m tasks are as- dreds of hidden layers. Moreover, similar to the sumed to share a low-rank subspace, leading to a Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 34 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW i i i T i task clustering. Inspired by the k-means cluster- parametrization of w as w = u +  v , where h×d ing method, Jacob et al. [31] devise a regularizer, ∈ R is a low-rank subspace shared by all the −1 T i.e. tr(W W ), to identify task clusters by tasks with h < d and u is specific to task T . With considering between-cluster and within-cluster vari- an assumption on  that  is orthonormal (i.e. ances, where tr(·) gives the trace of a square ma- = I where I denotes an identity matrix with trix,  denotes an m × m centering matrix, A an appropriate size) to remove the redundancy, u , B for two square matrices A, B means that B − v and are learned by minimizing the training loss A is positive semidefinite (PSD), and with three on all the tasks. This model is then generalized in hyperparameters α, β, γ ,  is required to satisfy [25] by adding a squared Frobenius regularization αI    βI and tr() = γ.TheMTFLmethod on W and this generalized model can be relaxed to is extended in [32] to the case of multiple clusters, have a convex objective function. where each cluster applies the MTFL method, and Based on the analysis in optimization, regu- in order to learn the cluster structure, a regularizer, larizing with the trace norm, which is defined as min(m,d) i.e. WQ  , is employed, where a 0/1di- i =1 W = μ (W), can make a matrix S(1) S(1) i i =1 agonal matrix Q satisfying Q = I can help low-rank and hence trace-norm regularization is i i i =1 identify the structure of the ith cluster. In order to widely used in MTL with [26] as a representative automatically determine the number of clusters, a work. Similar to what the capped- penalty did to p,1 i j structurally sparse regularizer, w − w  ,is the  norm, a variant of the trace-norm regulariza- 2 p,1 j >i proposed in [34] to enforce any pair of model pa- tion called the capped-trace regularizer is proposed min(m,d) rameters to be fused. After learning the parameter in [27] and defined as min(μ (W),θ), i =1 matrix W, the cluster structure can be determined by where θ is a parameter defined by users. Based on θ, i j comparing whether w − w  is below a thresh- only small singular values of W will be penalized and old or not for any pair (i, j). Both works [33,35]de- hence it can lead to a matrix with a lower rank. When compose W as W = LS wherecolumnsin L consist θ becomes large enough, the capped-trace regular- of basis parameter vectors in different clusters and izer will reduce to the trace norm. S contains combination coefficients. Both methods penalize the complexity of L via the squared Frobe- Task-clustering approach nius norm but they learn S in different ways. Specif- The task-clustering approach applies the idea of ically, the method in [33] aims to identify overlap- data-clustering methods to group tasks into several ping task clusters where each task can belong to mul- clusters, each of which has similar tasks in terms of tiple clusters and hence it learns a sparse S via the model parameters. regularization, while in [35], each task lies in only The first task-clustering algorithm proposed in one cluster and hence the  norm of each column [28] decouples the task-clustering procedure and in the 0/1 matrix S is enforced to be 1. the model-learning procedure. Specifically, it first clusters tasks based on the model parameters learned Task-relation learning approach separately under the single-task setting and then In this approach, task relations are used to reflect the pools the training data of all the tasks in a task clus- task relatedness and some examples for the task re- ter to learn a more accurate learner for all the tasks lations include task similarities and task covariances, in this task cluster. This two-stage method may be just to name a few. suboptimal since model parameters learned under In earlier studies on this approach, task relations the single-task setting may be inaccurate, making the are either defined by model assumptions [ 36,37]or task-clustering procedure not so good. So follow-up given by a priori information [38–41]. These two research aims to identify the task clusters and learn ways are not ideal and practical since model assump- model parameters together. tions are hard to verify for real-world applications A multi-task Bayesian neural network, whose and a priori information is difficult to obtain. A more structure is similar to that of the multi-layer neural advanced way is to learn the task relations from data, network shown in Fig. 1,isproposed in [29]toclus- which is the focus of this section. ter tasks based on the Gaussian mixture model in A multi-task Gaussian process is proposed in terms of model parameters (i.e. weights connecting [42] to define a prior on f , the functional value the hidden and output layers). The Dirichlet process, corresponding to x ,as f ∼ N (0,), where f = which is widely used in Bayesian learning to do data 1 m T ( f ,..., f ) .Theentryin  corresponding to clustering, is employed in [30] to do task clustering 1 n the covariance between f and f is defined as based on model parameters {w }. q p p i i Unlike [29,30], which are Bayesian models, there σ ( f , f ) = ω k(x , x ), where k( ·, ·)definesa q ip q j j are several regularized methods [31–35]todo kernel function and ω is the covariance between ip Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 35 tasks T and T . Then, based on the Gaussian like- Dirty approach i p lihood on labels given f, the marginal likelihood, The dirty approach assumes the decomposition of which has an analytical form, is used to learn ,the the parameter matrix W as W = U + V, where U task covariance to reflect the task relatedness, with its and V capture different parts of the task relatedness. (i, p)th entry as ω . In order to utilize Bayesian av- The objective functions of different models in this ip eraging to achieve better performance, a multi-task approach can be unified to minimize the training loss generalized t process is proposed in [43] by placing on all the tasks as well as two regularizers, g (U) and an inverse-Wishart prior on. h(V), on U and V, respectively. Hence, the differ- A regularized model called multi-task- ent methods belonging to this approach differ in the relationship learning (MTRL) method is proposed choices of g (U) and h(V). in [44,45] by placing a matrix-variate normal prior Here we introduce five methods in this approach, on W: W ∼ MN (0, I,), where MN (M, A, B) i.e. [53–57]. Different choices of g (U) and h(V) denotes a matrix-variate normal distribution with for the five methods are shown in Table 1. Based M, A, B as the mean, row covariance and column on Table 1, we can see that the choices of g (U) covariance. This prior corresponds to a regularizer in [53,56]make U row-sparse via the  and ∞,1 2,1 −1 T tr(W W ) where the PSD task covariance  is norms, respectively. The choices of g (U)in[54,55] required to satisfy tr() ≤ 1. The MTRL method enforce U to be low-rank via the trace norm as the is generalized to multi-task boosting [46] and regularizer and constraint, respectively. Unlike these multi-label learning [47], where each label is treated methods, g (U)in[57] penalizes its complexity via as a task, and extended to learn sparse task relations the squared Frobenius norm and clusters feature in [48]. Amodel similartothe MTRL method in different tasks based on the fused lasso regular- is proposed in [49] by assigning a prior on W as izer. For V, h(V) makes it sparse via the  norm W ∼ MN (0, , ), and it learns the sparse in [53,54] and column-sparse via the  norm in 1 2 2,1 inverse of  and  . Since the prior used in the [55,56], while in [57], h(V) penalizes the complex- 1 2 MTRL method implies that W W follows a Wishart ity of V via the squared Frobenius norm. distribution as W(0,), the MTRL method In the decomposition, U mainly identifies the is generalized in [50] by studying a high-order task relatedness among tasks similar to the feature T t prior: (W W) ∼ W(0,), where t is a positive selection approach or low-rank approach while V is integer. In [51], a similar regularizer to that of the capable of capturing noises or outliers via the spar- MTRL method is proposed by assuming a para- sity. The combination of U and V can help the learner −1 T metric form of  as  = (I − A)(I − A) , become more robust. m m where A is an asymmetric task relation claimed in [51]. Unlike the aforementioned methods, Multi-level approach which rely on global learning models, local learning As a generalization of the dirty approach, the multi- methodssuchasthe k-nearest-neighbor (kNN) level approach decomposes the parameter matrix classifier are extended in [ 52] to the multi-task W into h component matrices {W } ,i.e. W = setting and the learning function is defined i i =1 p p i i as f (x ) = σ s (x , x )y , where W , where the number of levels, h,isno ip q q i j ( p,q)∈N (i , j ) j k i =1 N (i, j) denotes the set of task and instance indices smaller than 2. In the following, we show how the for k nearest neighbors of x , s(·, ·) defines the multi-level decomposition can help model complex similarity between instances, and σ represents the task structures. ip similarity of task T to T . By enforcing σ to be In the task-clustering approach, different task p i ip T 2 close to σ , a regularizer  −   is proposed clusters usually have no overlap, which may re- pi in [52] to learn task similarities, where each σ strict the expressive power of the resulting learners. ip needs to satisfy that σ ≥ 0 and |σ |≤ σ for i = p. In [58], all possible task clusters are enumerated, ii ip ii Table 1. Choices of g(U) and h(V) for different methods in the dirty approach. Method g (U) h(V) [53] g (U) = λ U h(V) = λ V 1 ∞,1 2 1 0, if U ≤ λ S(1) 1 [54] g (U) = h(V) = λ V 2 1 +∞, otherwise. [55] g (U) = λ U h(V) = λ V 1 S(1) 2 2,1 [56] g (U) = λ U h(V) = λ V 1 2,1 2 2,1 d 2 2 [57] g (U) = λ |u − u |+ λ U h(V) = λ V 1 ij ik 2 3 i =1 k> j F F Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 36 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW leading to 2 − 1 task clusters, and they are or- for each task based on weighted instances from all ganized in a tree with the root node as a dummy the tasks. node, where the parent–child relation in the tree is the ‘subset of’ relation. This tree has 2 nodes, Discussion each of which corresponds to a level, and hence an Feature-based MTSL can learn a common feature index t denotes both a node in the tree and the representation for different tasks and it is more corresponding level. In order to handle a tree with suitable for applications whose original feature such a large number of nodes, authors make an as- representation is not so informative and discrim- sumption that if a cluster is not useful then none of inative, e.g. in computer vision, natural language its supersets are either, which means that if a node processing and speech. However, feature-based in the tree is not helpful then none of its descen- MTSL can easily be affected by outlier tasks that dants are either. Based on this assumption, a regu- are unrelated to other tasks, since it is difficult to larizer based on the squared  norm is devised, p,1 learn a common feature representation for outlier p p i.e. λ s (W ) , where V de- v t v∈V t ∈D(v) tasks that are unrelated to each other. Given a notes the set of nodes in the tree, λ is a regulariza- good feature representation, parameter-based tion parameter for node v, and D(v) denotes the set MTSL can learn more accurate model parameters of descendants of v.Here s (W ) uses the regularizer and it is more robust to outlier tasks via a robust proposed in [36] to enforce different columns in W representation of model parameters. Hence feature- to be close to their average. Unlike [58] where each based MTSL is complemental to parameter-based level involves a subset of tasks, a multi-level task- MTSL. Instance-based MTSL, which is currently clustering method is proposed in [34] to cluster all being explored, seems parallel to the other two the tasks at each level based on a structurally sparse categories. λ k regularizer w − w  . i −1 2 i =1 k> j i i In summary, the MTSL setting is the most im- In [59], each component matrix is assumed to portant one in the research of MTL since it sets the be jointly sparse and row-sparse but in different stage for research in other settings. Among the ex- proportions, which are more similar in successive isting research efforts in MTL, about 90% of works component matrices. In order to achieve this, a study the MTSL setting, while in the MTSL setting, h−i i −1 regularizer, i.e. W  + W  , i 2,1 i 1 the feature-based and parameter-based MTSL at- i =1 h−1 h−1 is constructed. tract most attention from the community. Unlike the aforementioned methods where dif- ferent component matrices have no direct interac- tion, in [60], with direct connections between com- MULTI-TASK UNSUPERVISED LEARNING ponent matrices at successive levels, the complex hi- Unlike multi-task supervised learning where each erarchical/tree structure among tasks can be learned data instance is associated with a label, in multi-task from data. Specifically, built on the multi-level task- unsupervised learning, the training set D of the ith clustering method [34], a sequential constraint, i.e. task consists of only n data instances {x } and the j j j k k |w − w |≥|w − w |∀i ≥ 2 ∀k > j,isde- i −1 i −1 i i goal of multi-task unsupervised learning is to exploit vised in [60] to help make the whole structure be- the information contained in D . Typical unsuper- come a tree. vised learning tasks include clustering, dimension- Compared with the dirty approach that focuses ality reduction, manifold learning, visualization and on identifying noises or outliers, the multi-level ap- so on, but multi-task unsupervised learning mainly proach is capable of modeling more complex task focuses on multi-task clustering. Clustering is to di- structures such as complex task clusters and tree vide a set of data instances into several groups, each structures. of which has similar instances, and hence multi-task clustering aims to conduct clustering on multiple datasets by leveraging useful information contained Instance-based MTSL in different datasets. There are few works in this category with the Not very many studies on multi-task clustering multi-task distribution matching method proposed exist. In [62], two multi-task-clustering methods are in [61] as a representative work. Specifically, it first proposed. These two methods extend the MTFL and estimates the ratio between probabilities that each MTRL methods [5,44], two models in the MTSL instance is from its own task and from a mixture of setting, to the clustering scenario and the formu- all the tasks. After determining ratios via softmax lations in the proposed two multi-task-clustering functions, this method uses ratios to determine the methods are almost identical to those in the MTFL instance weights and then learns model parameters and MTRL methods, with the only difference being Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 37 that the labels are treated as unknown cluster indica- tradeoff between the learning risk of a low-rank MTL tors that need to be learned from data. model based on the trace-norm regularization and a confidence bound similar to multi-armed bandits, is proposed in [68]. MULTI-TASK SEMI-SUPERVISED LEARNING MULTI-TASK REINFORCEMENT LEARNING In many applications, data usually require a great deal of manual labor to label, making labeled data Inspired by behaviorist psychology, reinforcement not so sufficient, but in many situations, unlabeled learning studies how to take actions in an envi- data are ample. So in this case, unlabeled data are uti- ronment to maximize the cumulative reward and it lized to help improve the performance of supervised shows good performance in many applications with learning, leading to semi-supervised learning, whose AlphaGo, which beats humans in the Go game, as a training set consists of a mixture of labeled and un- representative application. When environments are labeled data. In multi-task semi-supervised learning, similar, different reinforcement learning tasks can the goal is the same, where unlabeled data are used use similar policies to make decisions, which is a mo- to improve the performance of supervised learning tivation of the proposal of multi-task reinforcement while different supervised tasks share useful informa- learning [69–73]. tion to help each other. Specifically, in [ 69], each reinforcement learn- Based on the nature of each task, multi-task semi- ing task is modeled by a Markov decision process supervised learning can be classified into two cate- (MDP) and different MDPs in all the tasks are gories: multi-task semi-supervised classification and related via a hierarchical Bayesian infinite mixture multi-task semi-supervised regression. For multi- model. In [70], each task is characterized via a re- task semi-supervised classification, a method pro- gionalized policy and a Dirichlet process is used to posed in [63,64] follows the task-clustering ap- cluster tasks. In [71], the reinforcement learning proach to do task clustering on different tasks based model for each task is a Gaussian process temporal- on a relaxed Dirichlet process, while in each task, difference value function model and a hierarchical random walk is used to exploit useful information Bayesian model relates value functions of different contained in the unlabeled data. Unlike [63,64], tasks. In [72], the value functions in different tasks a semi-supervised multi-task regression method is are assumed to share sparse parameters and it ap- proposed in [65], where each task adopts a Gaus- plies the multi-task feature selection method with sian process and unlabeled data are used to define the  regularization [8] and the MTFL method 2,1 the kernel function, and Gaussian processes in all the [5] to learn all the value functions simultaneously. In tasks share a common prior on kernel parameters. [73], an actor–mimic method, which is a combina- tion of deep reinforcement learning and model com- pression techniques, is proposed to learn policy net- MULTI-TASK ACTIVE LEARNING works for multiple tasks. The setting of multi-task active learning, where each task has a small number of labeled data and MULTI-TASK ONLINE LEARNING a large amount of unlabeled data in the train- ing set, is almost identical to that of multi-task When the training data in multiple tasks come in a semi-supervised learning. However, unlike multi- sequential way, traditional MTL models cannot han- task semi-supervised learning, which exploits infor- dle them but multi-task online learning is capable mation contained in the unlabeled data, in multi-task of doing this job, as shown in some representative active learning, each task selects informative unla- works [74–79]. beled data to query an oracle to actively acquire their Specifically, in [ 74,75], where different tasks are labels. Hence the criterion for the selection of unla- assumed to have a common goal, a global loss func- beled data is the main research focus in multi-task ac- tion, a combination of individual losses on each task, tive learning [66–68]. measures the relations between tasks, and by using Specifically, two criteria are proposed in [ 66]to absolute norms for the global loss function, several make sure that the selected unlabeled instances are online MTL algorithms are proposed. In [76], the informative for all the tasks instead of only one task. proposed online MTL algorithms model task rela- Unlike [66], in [67] where the learner in each task tions by placing constraints on actions taken for all is a supervised latent Dirichlet allocation model, the the tasks. In [77], online MTL algorithms, which selection criterion for unlabeled data is the expected adopt perceptrons as a basic model and measure error reduction. Moreover, a selection strategy, a task relations based on shared geometric structures Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 38 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW among tasks, are proposed for multi-task classifica- hinge, -insensitive and square losses, are studied in tion problems. In [78], a Bayesian online algorithm [82], making this parallel method applicable to both is proposed for a multi-task Gaussian process that classification and regression problems in MTSL. shares kernel parameters among tasks. In [79], an In some cases, training data for different tasks online algorithm is proposed for the MTRL method may exist in different machines, which makes it dif- [44] by updating model parameters and task covari- ficult for conventional MTL models to work, even ance together. though all the training data can be moved to one machine, which incurs additional transmission and storage costs. A better option is to devise distributed MULTI-TASK MULTI-VIEW LEARNING MTL models that can directly operate on data dis- In some applications such as computer vision, each tributed on multiple machines. In [83], a distributed data point can be described by different feature algorithm is proposed based on a debiased lasso representations; one example is image data, whose model and by learning one task in a machine, this al- features include SIFT and wavelet, to name just gorithm achieves efficient communications. a few. In this case, each feature representation is called a view and multi-view learning, a learning APPLICATIONS OF MULTI-TASK paradigm in machine learning, is proposed to han- LEARNING dle such data with multiple views. Similar to super- vised learning, each multi-view data point is usu- Several areas, including computer vision, bioinfor- ally associated with a label. Multi-view learning aims matics, health informatics, speech, natural language to exploit useful information contained in multi- processing, web applications and ubiquitous com- ple views to further improve the performance over puting, use MTL to boost the performance of their supervised learning, which can be considered as a respective applications. In this section, we review single-view learning paradigm. As a multi-task ex- some related works. tension of multi-view learning, multi-task multi-view learning [80,81] hopes to exploit multiple multi- Computer vision view learning problems to improve the performance over each multi-view learning problem by leveraging The applications of MTL in computer vision can be useful information contained in related tasks. divided into two categories, including image-based Specifically, in [ 80], the first multi-task multi- and video-based applications. view classifier is proposed to utilize the task related- Image-based MTL applications include two sub- ness based on common views shared by tasks and categories: facial images and non-facial images. view consistency among views in each task. In [81], Specifically, applications of MTL based on facial different views in each task achieve consensus on un- images include face verification [ 84], personalized labeled data and different tasks are learned by ex- age estimation [85], multi-cue face recognition ploiting a priori information as in [38] or learning [86], head-pose estimation [22,87], facial landmark task relations as the MTRL method did. detection [18], and facial image rotation [88]. Ap- plications of MTL based on non-facial images in- clude object categorization [86], image segmenta- PARALLEL AND DISTRIBUTED MTL tion [89,90], identifying brain imaging predictors When the number of tasks is large, if we directly [91], saliency detection [92], action recognition apply a multi-task learner, the computational com- [93], scene classification [ 94], multi-attribute pre- plexity may be high. Nowadays the computational diction [95], multi-camera person re-identification capacity of a computer is very powerful due to the [96], and immediacy prediction [97]. multi-CPU or multi-GPU architecture involved. So Applications of MTL based on videos include we can make use of these powerful computing facil- visual tracking [98–100] and thumbnail selection ities to devise parallel MTL algorithms to accelerate [19]. the training process. In [82], a parallel MTL method is devised to solve a subproblem of the MTRL model Bioinformatics and health informatics [44], which also occurs in many regularized meth- ods belonging to the task-relation learning approach. Applications of MTL in bioinformatics and health Specifically, this method utilizes the FISTA algo- informatics include organism modeling [101], rithm to design a decomposable surrogate function mechanism identification of response to therapeu- with respect to all the tasks and this surrogate func- tic targets [102], cross-platform siRNA efficacy tion can be parallelized to speed up the learning pro- prediction [103], detection of causal genetic cess. Moreover, three loss functions, including the markers through association analysis of multiple Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 39 difficult to model, the generalization performance populations [104], construction of personalized cannot be computed and instead the generalization brain–computer interfaces [105], MHC-I binding prediction [106], splice-site prediction [106], bound is used to provide an upper bound for the gen- protein subcellular location prediction [107], eralization performance. Alzheimer’s disease assessment scale cognitive The first generalization bound for MTL is derived subscale [108], prediction of cognitive outcomes in [133] for a general MTL model. Then there are from neuroimaging measures in Alzheimer’s disease many studies to analyze generalization bounds of dif- [109], identification of longitudinal phenotypic ferent MTL approaches, including e.g. [7,134]for markers for Alzheimer’s disease progression pre- the feature transform approach, [135]for thefea- diction [110], prioritization of disease genes [111], ture selection approach, [24,135–138] for the low- biological image analysis based on natural images rank approach, [136] for the task-relation learning [20], survival analysis [112], and multiple genetic approach, and [138] for the dirty approach. trait prediction [113]. CONCLUSIONS In this paper, we give an overview of MTL. Firstly, Speech and natural language processing we give a definition of MTL. After that, different Applications of MTL in speech include speech syn- settings of MTL are presented, including multi-task thesis [114,115] and those for natural language supervised learning, multi-task unsupervised learn- processing include joint learning of six NLP tasks ing, multi-task semi-supervised learning, multi-task (i.e. part-of-speech tagging, chunking, named en- active learning, multi-task reinforcement learning, tity recognition, semantic role labeling, language multi-task online learning and multi-task multi-view modeling and semantically related words) [116], learning. For each setting, we introduce its represen- multi-domain sentiment classification [ 117], multi- tative models. Then parallel and distributed MTL domain dialog state tracking [21], machine transla- models, which can help speed up the learning pro- tion [118], syntactic parsing [118], and microblog cess, are discussed. Finally, we review the applica- analysis [119,120]. tions of MTL in various areas and present theoretical analyses for MTL. Recently deep learning has become popular in Web applications many applications and several deep models have Web applications based on MTL include learning been devised for MTL. Almost all the deep mod- to rank in web searches [121], web search ranking els just share hidden layers for different tasks; this [122], multi-domain collaborative filtering [ 123], way of sharing knowledge among tasks is very use- behavioral targeting [124], and conversion maxi- ful when all the tasks are very similar, but when this mization in display advertising [125]. assumption is violated, the performance will signif- icantly deteriorate. We think one future direction Ubiquitous computing for multi-task deep models is to design more flex- ible architectures that can tolerate dissimilar tasks Applications of MTL in ubiquitous computing in- and even outlier tasks. Moreover, the deep-learning, clude stock prediction [126], multi-device local- task-clustering and multi-level approaches lack theo- ization [127], the inverse dynamics problem for retical foundations and more analyses are needed to robotics [128,129], estimation of travel costs on guide the research in these approaches. road networks [130], travel-time prediction on road networks [131], and traffic-sign recognition [ 132]. FUNDING This work was supported by the National Basic Research Pro- THEORETICAL ANALYSIS gram of China (973 Program) (2014CB340304), the Hong Kong CERG projects (16211214, 16209715 and 16244616), Learning theory, an area in machine learning, studies the National Natural Science Foundation of China (61473087 the theoretical aspect of learning models including and 61673202), and the Natural Science Foundation of Jiangsu MTL models. In the following, we introduce some Province (BK20141340). representative works. The theoretical analysis in MTL mainly focuses on deriving the generalization bound of MTL mod- REFERENCES els. It is well known that the generalization per- 1. Caruana R. Multitask learning. Mach Learn 1997; 28: 41–75. formance of MTL models on unseen test data is 2. Pan SJ and Yang Q. A survey on transfer learning. IEEE Trans the main concern in MTL and machine learning. Knowl Data Eng 2010; 22: 1345–59. However, since the underlying data distribution is Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 40 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW 3. Zhang M and Zhou Z. A review on multi-label learning algorithms. IEEE Trans 24. Ando RK and Zhang T. A framework for learning predictive structures from Knowl Data Eng 2014; 26: 1819–37. multiple tasks and unlabeled data. J Mach Learn Res 2005; 6: 1817–53. 4. Zhang Y and Yang Q. A survey on multi-task learning. arXiv:1707.08114. 25. Chen J, Tang L and Liu J et al. A convex formulation for learning shared struc- 5. Argyriou A, Evgeniou T and Pontil M. Multi-task feature learning. In: Advances tures from multiple tasks. In:Proceedingsofthe26thInternationalConference in Neural Information Processing Systems 19. 2006, 41–8. on Machine Learning. 2009, 137–44. 6. Argyriou A, Evgeniou T and Pontil M. Convex multi-task feature learning.Mach 26. Pong TK, Tseng P and Ji S et al. Trace norm regularization: reformu- Learn 2008; 73: 243–72. lations, algorithms, and multi-task learning. SIAM J Optim 2010; 20: 7. Maurer A, Pontil M and Romera-Paredes B. Sparse coding for multitask and 3465–89. transfer learning. In: Proceedings of the 30th International Conference on Ma- 27. Han L and Zhang Y. Multi-stage multi-task learning with reduced rank. In: chine Learning. 2013, 343–51. Proceedings of the 30th AAAI Conference on Artificial Intelligence . 2016. 8. Obozinski G, Taskar B and Jordan M. Multi-task feature selection. Ph.D. The- 28. Thrun S and O’Sullivan J. Discovering structure in multiple learning tasks: the sis. University of California, Berkeley Department of Statistics 2006. TC algorithm. In: Proceedings of the 13th International Conference on Ma- 9. Obozinski G, Taskar B and Jordan M. Joint covariate selection and joint sub- chine Learning. 1996, 489–97. space selection for multiple classification problems. Stat Comput 2010; 20: 29. Bakker B and Heskes T. Task clustering and gating for Bayesian multitask 231–52. learning. J Mach Learn Res 2003; 4: 83–99. 10. Liu H, Palatucci M and Zhang J. Blockwise coordinate descent procedures for 30. Xue Y, Liao X and Carin L et al. Multi-task learning for classification with the multi-task lasso, with applications to neural semantic basis discovery. In: Dirichlet process priors. J Mach Learn Res 2007; 8: 35–63. Proceedings of the 26th International Conference on Machine Learning. 2009, 31. Jacob L, Bach FR and Vert JP. Clustered multi-task learning: a convex for- 649–56. mulation. In: Advances in Neural Information Processing Systems 21. 2008, 11. Gong P, Ye J and Zhang C. Multi-stage multi-task feature learning. JMach 745–52. Learn Res 2013; 14: 2979–3010. 32. Kang Z, Grauman K and Sha F. Learning with whom to share in multi-task 12. Lozano AC and Swirszcz G. Multi-level lasso for sparse multi-task regression. feature learning. In: Proceedings of the 28th International Conference on Ma- In: Proceedings of the 29th International Conference on Machine Learning. chine Learning. 2011, 521–8. 2012. 33. Kumar A and III HD. Learning task grouping and overlap in multi-task learning. 13. Wang X, Bi J and Yu S et al. On multiplicative multitask feature learning. In: In: Proceedings of the 29th International Conference on Machine Learning. Advances in Neural Information Processing Systems 27. 2014, 2411–9. 2012. 14. Han L, Zhang Y and Song G et al. Encoding tree sparsity in multi-task learning: 34. Han L and Zhang Y. Learning multi-level task groups in multi-task learning. In: a probabilistic framework. In: Proceedings of the 28th AAAI Conference on Proceedings of the 29th AAAI Conference on Artificial Intelligence . 2015. Artificial Intelligence . 2014, 1854–60. 35. Barzilai A and Crammer K. Convex multi-task learning by clustering. In: Pro- 15. Zhang Y, Yeung DY and Xu Q. Probabilistic multi-task feature selection. In: ceedings of the 18th International Conference on Artificial Intelligence and Advances in Neural Information Processing Systems 23. 2010, 2559–67. Statistics. 2015. 16. Hernandez-Lobato ´ D and Hernandez-Lobato ´ JM. Learning feature selection 36. Evgeniou T and Pontil M. Regularized multi-task learning. In: Proceedings of dependencies in multi-task learning. In: Advances in Neural Information Pro- the 10th ACM SIGKDD International Conference on Knowledge Discovery and cessing Systems 26. 2013, 746–54. Data Mining. 2004, 109–17. 17. Hernandez-Lobato ´ D, Hernandez-Lobato ´ JM and Ghahramani Z. A probabilis- 37. Parameswaran S and Weinberger KQ. Large margin multi-task metric learn- tic model for dirty multi-task feature selection. In: Proceedings of the 32nd ing. In: Advances in Neural Information Processing Systems 23. 2010, International Conference on Machine Learning. 2015, 1073–82. 1867–75. 18. Zhang Z, Luo P and Loy CC et al. Facial landmark detection by deep multi- 38. Evgeniou T, Micchelli CA and Pontil M. Learning multiple tasks with kernel task learning. In: Proceedings of the 13th European Conference on Computer methods. J Mach Learn Res 2005; 6: 615–37. Vision. 2014, 94–108. 39. Kato T, Kashima H and Sugiyama M et al. Multi-task learning via conic pro- 19. Liu W, Mei T and Zhang Y et al. Multi-task deep visual-semantic embedding gramming. In: Advances in Neural Information Processing Systems 20. 2007, for video thumbnail selection. In:ProceedingsofIEEEConferenceonComputer 737–44. Vision and Pattern Recognition. 2015, 3707–15. 40. Kato T, Kashima H and Sugiyama M et al. Conic programming for multitask 20. Zhang W, Li R and Zeng T et al. Deep model based transfer and multi- learning. IEEE Trans Knowl Data Eng 2010; 22: 957–68. task learning for biological image analysis. In: Proceedings of the 21th ACM 41. Gornitz ¨ N, Widmer C and Zeller G et al. Hierarchical multitask structured out- SIGKDD International Conference on Knowledge Discovery and Data Mining. put learning for large-scale sequence segmentation. In: Advances in Neural 2015, 1475–84. Information Processing Systems 24. 2011, 2690–8. 21. Mrksic N, Seaghdha ´ DO and Thomson Betal. Multi-domain dialog state track- 42. Bonilla EV, Chai KMA and Williams CKI. Multi-task Gaussian process predic- ing using recurrent neural networks. In: Proceedings of the 53rd Annual Meet- tion. In: Advances in Neural Information Processing Systems 20. 2007, 153– ing of the Association for Computational Linguistics. 2015, 794–9. 60. 22. Li S, Liu Z and Chan AB. Heterogeneous multi-task learning for human pose 43. Zhang Y and Yeung DY. Multi-task learning using generalized t process. In: estimation with deep convolutional neural network. Int J Comput Vis 2015; Proceedings of the 13th International Conference on Artificial Intelligence and 113: 19–36. Statistics. 2010, 964–71. 23. Misra I, Shrivastava A and Gupta A et al. Cross-stitch networks for multi-task 44. Zhang Y and Yeung DY. A convex formulation for learning task relationships learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern in multi-task learning. In: Proceedings of the 26th Conference on Uncertainty Recognition. 2016, 3994–4003. in Artificial Intelligence . 2010, 733–42. Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 41 45. Zhang Y and Yeung DY. A regularization approach to learning task relation- 66. Reichart R, Tomanek K and Hahn U et al. Multi-task active learning for linguis- ships in multitask learning. ACM Trans Knowl Discov Data 2014; 8: 12. tic annotations. In:Proceedingsofthe46thAnnualMeetingoftheAssociation 46. Zhang Y and Yeung DY. Multi-task boosting by exploiting task relationships. for Computational Linguistics. 2008, 861–9. In: Proceedings of European Conference on Machine Learning and Principles 67. Acharya A, Mooney RJ and Ghosh J. Active multitask learning using both and Practice of Knowledge Discovery in Databases. 2012, 697–710. latent and supervised shared topics. In: Proceedings of the 2014 SIAM Inter- 47. Zhang Y and Yeung DY. Multilabel relationship learning. ACM Trans Knowl national Conference on Data Mining. 2014, 190–8. Discov Data 2013; 7:7. 68. Fang M and Tao D. Active multi-task learning via bandits. In: Proceedings of 48. Zhang Y and Yang Q. Learning sparse task relations in multi-task learning. In: the 2015 SIAM International Conference on Data Mining. 2015, 505–13. Proceedings of the 31st AAAI Conference on Artificial Intelligence . 2017. 69. Wilson A, Fern A and Ray S et al. Multi-task reinforcement learning: a hierar- 49. Zhang Y and Schneider JG. Learning multiple tasks with a sparse matrix- chical Bayesian approach. In: Proceedings of the Twenty-Fourth International normal penalty. In: Advances in Neural Information Processing Systems 23. Conference on Machine Learning. 2007, 1015–22. 2010, 2550–8. 70. Li H, Liao X and Carin L. Multi-task reinforcement learning in partially observ- 50. Zhang Y and Yeung DY. Learning high-order task relationships in multi-task able stochastic environments. J Mach Learn Res 2009; 10: 1131–86. learning. In: Proceedings of the 23rd International Joint Conference on Artifi- 71. Lazaric A and Ghavamzadeh M. Bayesian multi-task reinforcement learning. cial Intelligence. 2013. In: Proceedings of the 27th International Conference on Machine Learning. 51. Lee G, Yang E and Hwang SJ. Asymmetric multi-task learning based on task 2010, 599–606. relatedness and loss. In: Proceedings of the 33rd International Conference on 72. Calandriello D, Lazaric A and Restelli M. Sparse multi-task reinforcement Machine Learning. 2016, 230–8. learning. In: Advances in Neural Information Processing Systems 27. 2014, 52. Zhang Y. Heterogeneous-neighborhood-based multi-task local learning algo- 819–27. rithms. In: Advances in Neural Information Processing Systems 26. 2013. 73. Parisotto E, Ba J and Salakhutdinov R. Actor-mimic: deep multitask and trans- 53. Jalali A, Ravikumar P and Sanghavi S et al. A dirty model for multi-task learn- fer reinforcement learning. In:Proceedingsofthe4thInternationalConference ing. In: Advances in Neural Information Processing Systems 23. 2010, 964–72. on Learning Representations. 2016. 54. Chen J, Liu J and Ye J. Learning incoherent sparse and low-rank patterns 74. Dekel O, Long PM and Singer Y. Online multitask learning. In: Proceedings of from multiple tasks. In: Proceedings of the 16th ACM SIGKDD International the 19th Annual Conference on Learning Theory. 2006, 453–67. Conference on Knowledge Discovery and Data Mining. 2010, 1179–88. 75. Dekel O, Long PM and Singer Y. Online learning of multiple tasks with a shared 55. Chen J, Zhou J and Ye J. Integrating low-rank and group-sparse structures for loss. J Mach Learn Res 2007; 8: 2233–64. robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD Interna- 76. Lugosi G, Papaspiliopoulos O and Stoltz G. Online multi-task learning with tional Conference on Knowledge Discovery and Data Mining. 2011, 42–50. hard constraints. In: Proceedings of the 22nd Conference on Learning Theory. 56. Gong P, Ye J and Zhang C. Robust multi-task feature learning. In: Proceedings 2009. of the 18th ACM SIGKDD International Conference on Knowledge Discovery 77. Cavallanti G, Cesa-Bianchi N and Gentile C. Linear algorithms for online mul- and Data Mining. 2012, 895–903. titask classification. J Mach Learn Res 2010; 11: 2901–34. 57. Zhong W and Kwok JT. Convex multitask learning with flexible task clusters. 78. Pillonetto G, Dinuzzo F and Nicolao GD. Bayesian online multitask learn- In: Proceedings of the 29th International Conference on Machine Learning. ing of Gaussian processes. IEEE Trans Pattern Anal Mach Intell 2010; 32: 2012. 193–205. 58. Jawanpuria P and Nath JS. A convex feature learning formulation for latent 79. Saha A, Rai P, Daume´ H and Venkatasubramanian S. Online learning of task structure discovery. In: Proceedings of the 29th International Conference multiple tasks and their relationships. In: Proceedings of the Fourteenth on Machine Learning. 2012. International Conference on Artificial Intelligence and Statistics . 2011, 59. Zweig A and Weinshall D. Hierarchical regularization cascade for joint learn- 643–51. ing. In: Proceedings of the 30th International Conference on Machine Learn- 80. He J and Lawrence R. A graph-based framework for multi-task multi-view ing. 2013, 37–45. learning. In: Proceedings of the 28th International Conference on Machine 60. Han L and Zhang Y. Learning tree structure in multi-task learning. In: Proceed- Learning. 2011, 25–32. ings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data 81. Zhang J and Huan J. Inductive multi-task learning with multiple view data. In: Mining. 2015. Proceedingsofthe18thACMSIGKDDInternationalConferenceonKnowledge 61. Bickel S, Bogojeska J and Lengauer T et al. Multi-task learning for HIV therapy Discovery and Data Mining. 2012, 543–51. screening. In: Proceedings of the Twenty-Fifth International Conference on 82. Zhang Y. Parallel multi-task learning. In: Proceedings of the IEEE International Machine Learning. 2008, 56–63. Conference on Data Mining. 2015. 62. Zhang X. Convex discriminative multitask clustering. IEEE Trans Pattern Anal 83. Wang J, Kolar M and Srebro N. Distributed multi-task learning. In: Proceed- Mach Intell 2015; 37: 28–40. ings of the 19th International Conference on Artificial Intelligence and Statis- 63. Liu Q, Liao X and Carin L. Semi-supervised multitask learning. In: Advances in tics. 2016, 751–60. Neural Information Processing Systems 20. 2007, 937–44. 84. Wang X, Zhang C and Zhang Z. Boosted multi-task learning for face verifica- 64. Liu Q, Liao X and Li H et al. Semisupervised multitask learning. IEEE Trans tion with applications to web image and video search. In: Proceedings of IEEE Pattern Anal Mach Intell 2009; 31: 1074–86. Conference on Computer Vision and Pattern Recognition. 2009, 142–9. 65. Zhang Y and Yeung D. Semi-supervised multi-task regression. In: Proceedings 85. Zhang Y and Yeung DY. Multi-task warped Gaussian process for personalized of European Conference on Machine Learning and Knowledge Discovery in age estimation. In: Proceedings of IEEE Conference on Computer Vision and Databases. 2009, 617–31. Pattern Recognition. 2010. Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 42 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW 86. Yuan X and Yan S. Visual classification with multi-task joint sparse represen- 106. Widmer C, Toussaint NC and Altun Y et al. Inferring latent task structure tation. In: Proceedings of IEEE Conference on Computer Vision and Pattern for multitask learning by multiple kernel learning. BMC Bioinformatics 2010; Recognition. 2010, 3493–500. 11:S5. 87. Yan Y, Ricci E and Ramanathan S et al. No matter where you are: flexible 107. Xu Q, Pan SJ and Xue HH et al. Multitask learning for protein subcellu- graph-guided multi-task learning for multi-view head pose classification under lar location prediction. IEEE ACM Trans Comput Biol Bioinformatics 2011; 8: target motion. In: Proceedings of IEEE International Conference on Computer 748–59. Vision. 2013, 1177–84. 108. Zhou J, Yuan L and Liu J et al. A multi-task learning formulation for pre- 88. Yim J, Jung H and Yoo B et al. Rotating your face using multi-task deep neural dicting disease progression. In: Proceedings of the 17th ACM SIGKDD In- network. In: Proceedings of IEEE Conference on Computer Vision and Pattern ternational Conference on Knowledge Discovery and Data Mining. 2011, Recognition. 2015, 676–84. 814–22. 89. An Q, Wang C and Shterev I et al. Hierarchical kernel stick-breaking process 109. Wan J, Zhang Z and Yan J et al. Sparse Bayesian multi-task learning for for multi-task image analysis. In: Proceedings of the 25th International Con- predicting cognitive outcomes from neuroimaging measures in Alzheimer’s ference on Machine Learning. 2008, 17–24. disease. In: Proceedings of IEEE Conference on Computer Vision and Pattern 90. Cheng B, Liu G and Wang J et al. Multi-task low-rank affinity pursuit for image Recognition. 2012, 940–7. segmentation. In: Proceedings of IEEE International Conference on Computer 110. Wang H, Nie F and Huang H et al. High-order multi-task feature learning to Vision. 2011, 2439–46. identify longitudinal phenotypic markers for alzheimer’s disease progression 91. Wang H, Nie F and Huang H et al. Sparse multi-task regression and feature prediction. In: Advances in Neural Information Processing Systems 25. 2012, selection to identify brain imaging predictors for memory performance. In:Pro- 1286–94. ceedings of IEEE International Conference on Computer Vision. 2011, 557–62. 111. Mordelet F and Vert J. ProDiGe: Prioritization of disease genes with multitask 92. Lang C, Liu G and Yu J et al. Saliency detection by multitask sparsity pursuit. machine learning from positive and unlabeled examples. BMC Bioinformatics IEEE Trans Image Process 2012; 21: 1327–38. 2011; 12: 389. 93. Yuan C, Hu W and Tian G et al. Multi-task sparse learning with beta process 112. Li Y, Wang J and Ye J et al. A multi-task learning formulation for survival prior for action recognition. In: Proceedings of IEEE Conference on Computer analysis. In: Proceedings of the 22nd ACM SIGKDD International Conference Vision and Pattern Recognition. 2013, 423–9. on Knowledge Discovery and Data Mining. 2016, 1715–24. 94. Lapin M, Schiele B and Hein M. Scalable multitask representation learning for 113. He D, Kuhn D and Parida L. Novel applications of multitask learning and multi- scene classification. In: Proceedings of IEEE Conference on Computer Vision ple output regression to multiple genetic trait prediction. Bioinformatics 2016; and Pattern Recognition. 2014, 1434–41. 32: 37–43. 95. Abdulnabi AH, Wang G and Lu J et al. Multi-task CNN model for attribute 114. Wu Z, Valentini-Botinhao C and Watts O et al. Deep neural networks employ- prediction. IEEE Trans Multimed 2015; 17: 1949–59. ing multi-task learning and stacked bottleneck features for speech synthe- 96. Su C, Yang F and Zhang S et al. Multi-task learning with low rank attribute sis. In: Proceedings of the 2015 IEEE International Conference on Acoustics, embedding for person re-identification. In: Proceedings of IEEE International Speech and Signal Processing. 2015, 4460–4. Conference on Computer Vision. 2015, 3739–47. 115. Hu Q, Wu Z and Richmond K et al. Fusion of multiple parameterisations for 97. Chu X, Ouyang W and Yang W et al. Multi-task recurrent neural network for DNN-based sinusoidal speech synthesis with multi-task learning. In: Proceed- immediacy prediction. In: Proceedings of IEEE International Conference on ings of the 16th Annual Conference of the International Speech Communica- Computer Vision. 2015, 3352–60. tion Association. 2015, 854–8. 98. Zhang T, Ghanem B and Liu Setal. Robust visual tracking via multi-task sparse 116. Collobert R and Weston J. A unified architecture for natural language pro- learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern cessing: deep neural networks with multitask learning. In: Proceedings of the Recognition. 2012, 2042–9. 25th International Conference on Machine Learning. 2008, 160–7. 99. Zhang T, Ghanem B and Liu S et al. Robust visual tracking via structured multi- 117. Wu F and Huang Y. Collaborative multi-domain sentiment classification. In: task sparse learning. Int J Comput Vis 2013; 101: 367–83. Proceedings of the 2015 IEEE International Conference on Data Mining. 2015, 100. Hong Z, Mei X and Prokhorov DV et al. Tracking via robust multi-task multi- 459–68. view joint sparse representation. In: Proceedings of IEEE International Confer- 118. Luong M, Le QV and Sutskever I et al. Multi-task sequence to sequence learn- ence on Computer Vision. 2013, 649–56. ing. In: Proceedings of the 4th International Conference on Learning Repre- 101. Widmer C, Leiva J and Altun Y et al. Leveraging sequence classification by sentations. 2016. taxonomy-based multitask learning. In: Proceedings of the 14th Annual Inter- 119. Zhao L, Sun Q and Ye J et al. Multi-task learning for spatio-temporal event national Conference on Research in Computational Molecular Biology. 2010, forecasting. In: Proceedings of the 21th ACM SIGKDD International Confer- 522–34. ence on Knowledge Discovery and Data Mining. 2015, 1503–12. 102. Zhang K, Gray JW and Parvin B. Sparse multitask regression for identifying 120. Zhao L, Sun Q and Ye J et al. Feature constrained multi-task learning models common mechanism of response to therapeutic targets. Bioinformatics 2010; for spatiotemporal event forecasting. IEEE Trans Knowl Data Eng 2017; 29: 26: 97–105. 1059–72. 103. Liu Q, Xu Q and Zheng VW et al. Multi-task learning for cross-platform siRNA 121. Bai J, Zhou K and Xue G et al. Multi-task learning for learning to rank in efficacy prediction: an in-silico study. BMC Bioinformatics 2010; 11: 181. web search. In: Proceedings of the 18th ACM Conference on Information and 104. Puniyani K, Kim S and Xing EP. Multi-population GWA mapping via multi-task Knowledge Management. 2009, 1549–52. regularized regression. Bioinformatics 2010; 26: 208–16. 122. Chapelle O, Shivaswamy PK and Vadrevu Setal. Multi-task learning for boost- 105. Alamgir M, Grosse-Wentrup M and Altun Y. Multitask learning for brain- ing with application to web search ranking. In: Proceedings of the 16th ACM computer interfaces. In: Proceedings of the 13th International Conference on SIGKDD International Conference on Knowledge Discovery and Data Mining. Artificial Intelligence and Statistics . 2010, 17–24. 2010, 1189–98. Downloaded from https://academic.oup.com/nsr/article/5/1/30/4101432 by DeepDyve user on 20 July 2022 REVIEW Zhang and Yang 43 123. Zhang Y, Cao B and Yeung DY. Multi-domain collaborative filtering. In: Pro- 130. Zheng J and Ni LM. Time-dependent trajectory regression on road networks ceedingsofthe26thConferenceonUncertaintyinArtificialIntelligence . 2010, via multi-task learning. In: Proceedings of the 27th AAAI Conference on Arti- 725–32. ficial Intelligence . 2013. 124. Ahmed A, Aly M and Das A et al. Web-scale multi-task feature se- 131. Huang A, Xu L and Li Y et al. Robust dynamic trajectory regression on road net- lection for behavioral targeting. In: Proceedings of the 21st ACM Inter- works: a multi-task learning framework. In: Proceedings of IEEE International national Conference on Information and Knowledge Management. 2012, Conference on Data Mining. 2014, 857–62. 1737–41. 132. Lu X, Wang Y and Zhou X et al. Traffic sign recognition via multi-modal tree- 125. Ahmed A, Das A and Smola AJ. Scalable hierarchical multitask learning al- structure embedded multi-task learning. IEEE Trans Intell Transport Syst 2017; gorithms for conversion optimization in display advertising. In: Proceedings 18: 960–72. of the 7th ACM International Conference on Web Search and Data Mining. 133. Baxter J. A model of inductive bias learning. J Artif Intell Res 2000; 12: 149– 2014, 153–62. 98. 126. Ghosn J and Bengio Y. Multi-task learning for stock selection. In: Advances 134. Maurer A. Bounds for linear multi-task learning. J Mach Learn Res 2006; 7: in Neural Information Processing Systems 9. 1996, 946–52. 117–39. 127. Zheng VW, Pan SJ and Yang Q et al. Transferring multi-device localization 135. Kakade SM, Shalev-Shwartz S and Tewari A. Regularization techniques for models using latent multi-task learning. In: Proceedings of the 23rd AAAI Con- learning with matrices. J Mach Learn Res 2012; 13: 1865–90. ference on Artificial Intelligence . 2008, 1427–32. 136. Maurer A. The Rademacher complexity of linear transformation classes. In: 128. Chai KMA, Williams CKI and Klanke S et al. Multi-task Gaussian process Proceedings of the 19th Annual Conference on Learning Theory. 2006, 65–78. learning of robot inverse dynamics. In: Advances in Neural Information Pro- 137. Pontil M and Maurer A. Excess risk bounds for multitask learning with trace cessing Systems 21, December 8-11, 2008. 2008, 265–72. norm regularization. In: Proceedings of the 26th Annual Conference on Learn- 129. Yeung DY and Zhang Y. Learning inverse dynamics by Gaussian process re- ing Theory. 2013, 55–76. gression under the multi-task learning framework. In:ThePathtoAutonomous 138. Zhang Y. Multi-task learning and algorithmic stability. In: Proceedings of the Robots. Berlin: Springer, 2009, 131–42. 29th AAAI Conference on Artificial Intelligence . 2015.

Journal

National Science ReviewOxford University Press

Published: Jan 1, 2018

There are no references for this article.