Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Identification and characterization of moonlighting long non-coding RNAs based on RNA and protein interactome

Identification and characterization of moonlighting long non-coding RNAs based on RNA and protein... Motivation: Moonlighting proteins are a class of proteins having multiple distinct functions, which play essential roles in a variety of cellular and enzymatic functioning systems. Although there have long been calls for computational algorithms for the identification of moonlighting proteins, research on approaches to identify moonlighting long non-coding RNAs (lncRNAs) has never been undertaken. Here, we introduce a novel methodology, MoonFinder, for the identification of moon- lighting lncRNAs. MoonFinder is a statistical algorithm identifying moonlighting lncRNAs without a priori knowledge through the integration of protein interactome, RNA–protein interactions and functional annotation of proteins. Results: We identify 155 moonlighting lncRNA candidates and uncover that they are a distinct class of lncRNAs characterized by specific sequence and cellular localization features. The non-coding genes that transcript moonlighting lncRNAs tend to have shorter but more exons and the moon- lighting lncRNAs have a variable localization pattern with a high chance of residing in the cytoplas- mic compartment in comparison to the other lncRNAs. Moreover, moonlighting lncRNAs and moonlighting proteins are rather mutually exclusive in terms of both their direct interactions and interacting partners. Our results also shed light on how the moonlighting candidates and their interacting proteins implicated in the formation and development of cancers and other diseases. Availability and implementation: The code implementing MoonFinder is supplied as an R package in the supplementary material. Contact: lxcheng@cse.cuhk.edu.hk or ksleung@cse.cuhk.edu.hk Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction possession of additional roles such as signal transduction, transcrip- Protein moonlighting is a common phenomenon in nature involving tional regulation, apoptosis, motility and structural proteins a protein with a single polypeptide chain that can perform more (Jeffery, 2015). For example, crystallins, a class of well-studied than one independent cellular function (Boukouris et al., 2016; moonlighting proteins, function as enzymes when expressed at low Monaghan and Whitmarsh, 2015). Enzymes, receptors, ion channels levels in many tissues, but are densely packed to form lenses when or chaperones are the typical form of moonlighting proteins (MPs). expressed at high levels in eye tissue (Piatigorsky et al., 1988; Enzyme is the most common form of moonlighting proteins Piatigorsky and Wistow, 1989). The genes encoding crystallins need whose primary function is enzymatic catalysis, but they are also in to sustain functions of both catalytic and transparency maintenance. V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 3519 3520 L.Cheng and K.-S. Leung Another example is glycolysis, an ancient universal metabolic interactions, gene expression, phylogenetic profiles, genetic interac- pathway, in which a high percentage of proteins are moonlighting tions, network-based graph properties and protein structural proper- proteins (Boukouris et al., 2016; Sriram et al., 2005). Moreover, ties (Khan and Kihara, 2016). In general, MPs are assigned in more some proteins work on their moonlighting by being assembled to clusters because they interact with proteins of diverse functions, so supramolecules, such as the ribosome, which usually composed of the number of clusters that a protein involved is used as an omics more than a hundred of proteins and RNAs. However, the studies of feature. For proteins that do not have an available record of certain moonlighting merely concentrated on proteins and the genes coding features, an imputation step using random forest is used to predict them, yet the moonlighting of non-coding RNAs has not been the missing features. Eventually, these features are combined with investigated, despite the fact that ncRNAs have gained widespread machine learning classifiers to make moonlighting protein predic- attention due to their functional importance over the last decade tion. DextMP is a text mining algorithm consisting of four logical (Chen, 2016a,b; Liao et al., 2011; Quinn and Chang, 2016; Zhou steps (Khan et al., 2017). First, it extracts textual information of et al., 2017). proteins from literature and functional description in the UniProt Currently, the information of MPs, such as protein functions, database. Next, it constructs a k-dimensional feature vector from cell localization and primary structures, is scattered across a number each text using three language models, i.e. paragraph vector, Term of publications, since the MPs perform a variety of functions in dif- Frequency-Inverse Document Frequency (TFIDF) in the bag- ferent tissues and cell types. Some researchers have summarized the of-words category and Latent Dirichlet Allocation (LDA) in the literature about MPs from different aspects of the functional diver- topic modeling category. Third, using four machine learning classi- sity, such as regulation circuits or signaling pathways. The Jeffery fiers, a text is classified to either MP or non-MP based on the text lab constructed a manually curated database MoonProt, which con- features. Finally, the text predictions for each protein are separately sists of over 200 MPs that have been experimentally verified (Mani summarized to predict which ones are MPs. et al., 2015). The structures and function information about the Long non-coding RNAs (lncRNAs) is a subclass of non-coding MPs can aid researchers to understand how proteins function in a RNAs with little coding potential whose transcript consists of no moonlighting manner and help in designing proteins with novel less than 200 nucleotides. lncRNAs are implicated in a variety of functions. Min et al. summarized the MPs from the perspective biological processes through diverse functional mechanisms such as of a coupled intracellular signaling pathway (Min et al., 2016). chromatin remodeling, chromatin interactions and functioning as Numerous proteins are localized in more than one compartment in competing endogenous RNAs (Ferre et al., 2016). Specific expres- cells and the aberrant translocation of proteins may cause cancer or sion patterns of lncRNA in cells correspond to certain cell develop- other disorders. Hence, it is necessary to study the localization dy- ment and disorder. Nuclear and cytoplasmic lncRNAs can regulate namic and trans localization activity of MPs. Monaghan et al. gene expression and function in multiple ways, e.g. i) affecting reviewed several MPs with dual mitochondrial and nuclear func- mRNA translation directly, ii) interfering with protein post- tions (Monaghan and Whitmarsh, 2015). It is pointed out that translational modifications to disturb signal transduction, iii) func- the nuclear moonlighting of mitochondrial proteins is part of a tioning as decoys for miRNAs and proteins, iv) acting as miRNA mitochondria-to-nucleus signaling pathway in cells. They also dis- sponges, v) interacting proteins to enhancer regions and vi) encoding cussed various mechanisms commanding the dual localization of small proteins and functional micro peptides, etc. (Cabili et al., proteins and indicated that the nuclear moonlighters perform as a 2015; Ferre et al., 2016; Quinn and Chang, 2016; Zhou et al., 2017; regulating loop to maintain homeostasis in mitochondria. Boukouris Zhu et al., 2016). Many lncRNAs diversely reside in the nucleus and et al. summarized the moonlighting functions of metabolic enzymes play an essential role as modulators for nuclear functions. Some in the nucleus (Boukouris et al., 2016). They proposed a new mech- others are translocated to the cytoplasm to enforce their regulatory anistic connection between metabolic flux and differential expres- roles. Moreover, lncRNAs have a variety of subcellular localization sion of genes, which is implemented via nuclear translocation or patterns, which are not limited to specific nuclear and cytoplasm lo- moonlighting of nuclear metabolic enzymes. This mechanistic link calization but also nonspecific localization in both the nucleus and aids cells in adapting a changing environment in normal and disease cytoplasm (Barabasi and Oltvai, 2004; Buxbaum et al., 2015). For states, such as cancer, and thus has the potential to be explored for the lncRNAs localized in multiple compartments, the intercommuni- the novel therapeutic target. cation can modulate the interaction pattern or expression abun- In parallel to the serendipitous findings of MPs through experi- dance, e.g. regulating the lncRNA abundance in one compartment ments, some computational approaches have been developed to pre- may influence the function of the other cell compartment. Also, in- dict MPs in recent years (Pritykin et al., 2015). Specifically, three appropriate moonlighting may trigger genetic diseases (Abumrad algorithms were proposed for moonlighting protein identification, and Lange, 2006; Espinosa-Cantu et al., 2015; Min et al., 2016). MoonGO (Chapple et al., 2015), MPFit (Khan and Kihara, 2016) Hence, it is necessary to study the localization dynamic and expres- and DextMP (Khan et al., 2017), executing statistic, machine sion activity of moonlighting lncRNAs (mlncRNAs) and to investi- learning and text mining techniques, respectively. These studies gate the mechanism of how the mlncRNAs modulate and switch the investigated different aspects of MPs such as conserved sequence functions in the metabolic processes, which is of vital importance domains, structural disorders, protein interaction patterns and net- for cancer therapeutics and will provide tremendous opportunities work topology. MoonGO first identifies overlapping protein clus- for anti-cancer purposes (Du et al., 2013; Liu et al., 2014; Wang ters in the human interactome (Chapple et al., 2015). Then, the et al., 2015; Zhu et al., 2016). clusters are annotated to the Gene Ontology (GO) terms of biologic- We have demonstrated that using clustering algorithms is able to al process. GO terms annotating more than half of a cluster’s pro- group proteins into functional modules allowing the identification teins are assigned to the cluster. Each individual protein then of MPs (Chapple et al., 2015; Khan et al., 2014; Pritykin et al., inherits the annotations of its clusters in addition to its own. 2015). A module corresponds to a functional unit, which is com- Finally, the proteins shared by dissimilar functional clusters are posed of several closely interacted proteins involved in specific tasks identified as MPs. MPFit uses a variety of protein features to address in the cell. Therefore, it is promising to use the functional module the diverse nature of MPs, including functional annotation, protein approach to identify mlncRNAs that exhibit multiple but distinct Identification of moonlighting ncRNAs 3521 functionalities. Our study is focused on the moonlighting of human and the rest share the negligible weight of 0.05. The workflow of lncRNAs, since lncRNAs are pervasively transcribed in the mamma- MoonFinder is explained in more detail in Figure 1. lian genome and several of them play the roles as oncogenic or tumor- suppressor genes in multiple cancers (Ning et al.,2016; Quinn and 2.2 Statistical methods Chang, 2016; Wahlestedt, 2013; Zhu et al.,2016). We first propose a 2.2.1 Module-function and lncRNA–module association novel algorithm MoonFinder to identify mlncRNAs that have mul- The hypergeometric test (HGT) was used to evaluate the statistical tiple but unrelated functions. Then, we characterize the sequence fea- significance of the association between lncRNAs and functional tures and the localization tendency of these mlncRNA candidates. modules. A lncRNA is considered to be interacting with a module After that, we construct an mlncRNA–module network and topo- if the proteins interacting with the lncRNA are significantly logically analyzed the mlncRNAs with regard to the association with enriched in the module, i.e. P-value <0.05, which is shown as cancer, diseases, drug targets and moonlighting proteins. We also pre- follows, dict two cancer lncRNAs exclusively interacted with functional mod- k1 i mi ules from the network. x nx p ¼ 1  (2.1) i¼0 n 2 Material and methods where n is the total number of all proteins in the protein interaction 2.1 The workflow of MoonFinder network, m is the number of proteins analyzed in a module, x is the Moonlighting lncRNAs (mlncRNAs) may perform their distinct number of proteins interacting with a specific lncRNA, and i is functions through interacting with different functional modules and the number of proteins in the module interacting with the lncRNA. module is a cluster of proteins interconnected densely and associated The P-value describes the probability of randomly select no less than with specific tasks in cells. Hence, it is promising to identify k interacted proteins in the module with size m. Similarly, the model mlncRNAs from the aspect of whether they are targeting functional- was also utilized to carry out the functional annotation for the iden- ly unrelated protein modules. MoonFinder integrates several statis- tified modules by functional categories (or GO terms), where in this tical models based on the RNA and protein interactome as well case n is still the total number of gene products in the protein as the protein functional annotations for the identification of moon- lighting macromolecules. The workflow of MoonFinder contains the following six steps, 1. Protein interaction network refinement. Refine the protein inter- action network by filtering out protein pairs sharing no identical cell compartments and only the co-localized interactions are considered for further analysis. 2. Protein module identification. Detect protein modules from the co-localized protein interaction network (using ClusterONE by default). Each of the detected modules is highly interconnected and expected to be implicated in a specific cellular process. 3. Functional annotation of modules. Establish the annotation of the modules with GO terms by performing the functional cat- egory enrichment using hypergeometric test (HGT). The pair- wise association between the modules and the GO terms are constructed. 4. Establish RNA–module interactions. Assess whether the target proteins of an RNA are significantly overrepresented in a mod- ule. If yes, we define the RNA functionally regulates the module. 5. Construct similarity matrix of modules. All pairwise functional similarities are pre-calculated by a given semantic similarity measurement using a BMA combination approach and eventual- ly a similarity matrix is produced. 6. Moonlighting determination. Calculate the principal compo- nents (PCs) and their contribution weights of the similarity ma- trix of the modules targeted by an RNA using principal component analysis (PCA). An RNA is determined as moon- lighting if more than one PC play an indispensable role (weight Fig. 1. Schematic diagram of MoonFinder. The orange, green and blue boxes > 0.1) and none of these PCs play a dominant role (weight > represent the input, generated, and output data, respectively. The clustering 0.7). The weight can be customized. The higher the value, the algorithm and statistical models are shown in the dotted boxes. The inputs more moonlighting candidates can be identified. are the four datasets, i.e. interactions of proteins, localization annotation of proteins, function annotation of proteins, and interactions between ncRNAs For instance, we say an RNA is moonlighting if the weights of the and proteins, respectively. The output is a list of predicted moonlighting top three PCs of the module similarity matrix are 0.4, 0.3 and 0.2, lncRNA candidates. The associations between modules and GO terms (or respectively, because three PCs (more than one) are substantial. On ncRNAs) are established using Hypergeometric Test. The similarity among the other hand, an RNA is multifunctional but not moonlighting if modules is quantified using semantic similarity. PCA is used to obtain the la- the weight of the first PC is 0.85, the weight of the second PC is 0.1, tent features of the input ncRNAs. Please refer to Section 2.1 for more details 3522 L.Cheng and K.-S. Leung interaction network, but m represents the module size and x repre- less than 0.1, so we also define k  0:1 in Eqt 2.4 and accordingly sents the term size. i is the number of gene products annotated in the the number of latent features with key contribution is term and involved in the module. The P-value shows the probability of randomly choosing no less than k proteins annotated by a GO n ¼ sgnðk  k Þ (2.6) i cutoff i¼1 term for a given module. Monte Carlo simulation was adopted to model the probability of 1; x  0 interacting mlncRNAs and MPs. N pairs of mlncRNAs and MPs where sgn is defined as sgnðÞ x ¼ and k ¼ 0:1. cutoff 0; x < 0 were randomly extracted from the lncRNA–protein interaction Consequently, the number of latent functions for an RNA is network and then we calculated the interacting pair number, T . The P-value is the ratio of the number of simulated interactions (T ) l ¼ minðk; nÞ (2.7) that is larger than the number of practical interactions (T), which is An RNA is determined as moonlighting RNA if l is larger than one. mathematically defined as: Namely, the RNA has more than one latent feature in terms of the sgnðT  TÞ i¼1 interacting functional modules. p ¼ (2.2) 1; x  0 3 Results where sgn is defined as sgnðÞ x ¼ . 0; x < 0 3.1 Experimental and parametric setups of MoonFinder To gain the statistical significance, all comparisons in this study Moonlighting lncRNAs (mlncRNAs) are assumed to execute multiple between two lists of genes or gene products were analyzed using distinct functions through interactions with proteins that are localized Wilcoxon Ranksum Test (WRT, R-3.4.1). in different cell compartments, so we identify mlncRNAs from the as- pect of whether they are targeting multiple but functionally unrelated 2.2.2 Decomposition of the functional similarity matrix protein modules in a co-localized protein interactome. The human Principal component analysis (PCA) is a statistical procedure used to protein interactome was first refined and only the co-localized interac- reduce the number of features used to represent data. We accom- tions were maintained for module identification, since proteins closely plish the reduction by projecting data from a higher dimension to a interacted with each other in a module are more likely to reside in the lower dimensional manifold such that the error incurred by recon- same cell compartment. Eventually, a compartment-specific protein structing the data in the higher dimension is minimized (Ma and interaction network with 210 410 interactions among 10 111 proteins Dai, 2011; Ma and Kosorok, 2009). Mathematically, we want to was constructed. After that, a total of 765 functional modules were p q map the features x 2 R to x ~ 2 R where q < p. Here, eigenvalue identified using ClusterONE, an algorithm that can detect overlap- decomposition of the similarity matrix was used to calculate the ping clusters of proteins highly connected inside but sparse outside principal components (PCs) and their weights for the modules inter- (Nepusz et al.,2012). We choose ClusterONE because not only it can acted by each lncRNA. According to the Spectral Theorem, detect biological relevant clusters that can be appropriately mapped to modules, but also its ability to softly identify the overlapping mod- Av~ ¼ k v~ (2.3) i i i ules considering the network topological structure. we can calculate the eigenvalues k þ þ k of the Similarity ma- 1 m Functional enrichment analysis was employed to annotate each trix (arranged in decreasing order) and accordingly the trace of the identified module with specific and significant function categories. matrix is We applied the hypergeometric test (HGT) to obtain the enrichment P-values and the FDR adjusted P-values of 0.01 were eventually T ¼ k þ þ k (2.4) 1 m used as the threshold. To establish the RNA–module interactions, Only the lncRNAs whose similarity matrix can be decomposed into similarly, we used HGT to assess whether there are significant con- PCs without a dominant role were selected as the candidates, such nections between the lncRNAs and the functional modules. The tar- as the sum of the weights of the first k PCs is less than a threshold s get proteins of a lncRNA are significantly overrepresented in a (s ¼ 0:7 by default) as follows, module if the FDR adjusted enrichment P-values were less than a given threshold of 0.01. Accordingly, we obtained a bipartite net- argmax < s (2.5) work with 2726 interactions among 538 lncRNA and 106 protein k¼0;1;2... i¼1 modules. The function similarity matrices were calculated using the semantic similarities among the modules of each lncRNA using five The threshold s is a user-defined parameter. The higher the value, semantic similarity measurements, i.e. Resnik, Lin, Jiang and the more mlncRNA can be identified. Here, we set s ¼ 0:7because Conrath, Schlicker and Wang (see Supplementary Section S4). Only a PC will play a dominant role with a weight distribution coeffi- the intersection of the five sets of mlncRNAs identified using the five cient less than 0.5. The weight distribution coefficient is defined measures was determined as the candidates (in total 155), because as wdc ¼ 1  , which captures the distribution of the one measure may outperform the others in different expression or weightsofPCs foralncRNA.If the weights are extremely interaction scenarios. Importantly, we utilized eigenvalue decompos- similar then the distribution coefficient is close to 1, while if ition to calculate the number of latent features of each lncRNA. The some weights play a dominant role the distribution coefficient lncRNAs whose module similarity matrix can be decomposed into approaches 0. Then the maximum k is the number of latent features. The eigen- the principal components without a dominant role (more than one value or the weight of a latent feature has minor influence when it is latent feature) were selected as the candidates. The workflow of Identification of moonlighting ncRNAs 3523 MoonFinder is described in more detail in the Materials and as 28% when using Lin’s measure, while the ratios increase to about methods Section 2.2 as well as in Figure 1. 60% when the target modules are randomly picked. Here, we take ANCR as an example to illustrate its moonlight- ing functions. ANCR is an anti-differentiation lncRNA that is 3.2 Overview of the mlncRNA candidates required to enforce the undifferentiated state in somatic progenitor As shown in the Venn diagram of Figure 2A, among the 1284 populations of epidermis. Using MoonFinder, we observed ANCR lncRNAs with interacted proteins (background lncRNAs), 538 mainly interacts with three functional modules with closely linked lncRNAs (flncRNAs) are annotated to at least one functional mod- proteins inside and no proteins are shared between any two mod- ule and eventually 155 out of them are determined as moonlighting ules, as shown in Figure 2D. Specifically, one module was enriched lncRNAs (mlncRNAs), whose target modules are functionally unre- in a variety of metabolic processes, such as estrogen, retinal and ster- lated. The mlncRNAs were identified using five semantic similarity oid, etc., whereas another module was enriched in functions like tis- (SS) measures to quantify the functional similarity between modules. sue morphogenesis and development. Signaling pathways for To obtain more reliable results, only lncRNAs in the intersection of enforcing intracellular receptors like toll-like receptor 5 and 10 were the sets of identified mlncRNAs using the five different measures highly represented for the other module (Fig. 2E–G). Consequently, were defined as mlncRNAs (Fig. 2B). These identified mlncRNA ANCR was determined as a mlncRNA who shows its functional di- candidates are displayed in Supplementary Table S1, including the versity via regulating protein modules taking part in distinct bio- number of interacting modules and the number of latent features for logical processes and it would serve as a highly reliable candidate of different SS measurements. The modules targeted by an identical moonlighting lncRNA. lncRNA have a higher probability of sharing the same functions than the randomly picked modules. Not surprisingly, significantly 3.3 Sequence features of mlncRNAs more mlncRNAs are detected when simulating the randomly To investigate whether the mlncRNAs form a distinct group of selected modules as the target modules. Specifically, Figure 2C lncRNAs, we analyzed the sequences of the corresponding non- shows that around 40% of the lncRNAs with target module can be coding genes to detect common features such as biotype, gene identified as mlncRNAs using either SS measures, the ratio is as low length, transcript length, transcript number, exon length and exon Fig. 2. Overview of the identified mlncRNAs. (A) Venn diagram of the lncRNAs. (B) Venn diagram of the mlncRNAs identified using five semantic similarity measures. (C) The ratio of real and randomly identified mlncRNAs. (D) An example of mlncRNA ANCR.(E–G) Gene Ontology functional enrichment of the three modules regulated by ANCR 3524 L.Cheng and K.-S. Leung number, as well as the evolutionary conservation and expression The ratios of flncRNAs are lower than mlncRNAs but still higher correlation with orthologous genomes. The candidates were com- than that of the background lncRNAs. Similarly, mlncRNAs have pared with other two groups of lncRNAs, i.e. the entire set of the largest proportion of expression-conserved ones and the ratio lncRNAs with protein targets (background lncRNAs) and the func- of flncRNAs is second to it. Consequently, the evolution and ex- tional module related lncRNAs (flncNRAs), which interacts with pression patterns of mlncRNAs are more conserved than those of functionally related or unrelated modules (Fig. 2A). Hence, it is im- the other lncRNAs, suggesting that the lncRNAs moonlighting in portant to identify the unique characteristics of mlncRNAs that the the cells may play more important biological roles. Besides, RNA other types of lncRNAs do not exhibit. Although no significant dif- species were officially grouped into several biotypes by their tran- ferences of gene length were detected for distinct categories of scriptional direction and exosome sensitivity (Hon et al.,2017). lncRNAs (Fig. 3A), the candidate mlncRNAs have a significantly Here we also examined the relationship between the functional longer transcript than the other types of lncRNAs. On average, the categories and biotypes, but no significant correlation was detected transcript length is about 19 500 compared with about 11 000 for (Supplementary Fig. S1A). the flncRNAs (WRT P-value ¼ 1.27e-3; Fig. 3B). But the number of transcripts in these lncRNA categories are similar to each other, 3.4 Subcellular localization features only a marginal significance was tested between the mlncRNAs and Next, we aimed to understand how the mlncRNAs behave relative the background lncRNAs (WRT P-value ¼ 4.7e-2; Fig. 3C). to the other lncRNAs in terms of subcellular localization. To investi- Importantly, we observed that the number of exon of mlncRNAs gate the spatial distribution of lncRNAs at a subcellular level, we is much larger than the regular lncRNAs (10 versus 1, WRT applied Relative Concentration Index (RCI) (Mas-Ponte et al., P-value ¼ 2.2-e16; Fig. 3D), but these exons are significantly shorter 2017), a ratio of a transcript’s concentration between two cellular than the other categories of lncRNAs (WRT P-value ¼ 3.4e-6 and compartments, to measure the localization tendency of non-coding P-value ¼ 8.3e-9; Fig. 3E), probably owing to the transcripts of RNAs. Essentially, RCI is the log2 transformed ratio of FPKM (frag- mlncRNAs are on average much longer than that of the regular ments per kilobase per million mapped) in two compartments like lncRNAs. In short, mlncRNAs have short but more exons, which is cytoplasm and nucleus. First, we calculated the cytoplasmic-nuclear a potential sequence feature for lncRNAs to moonlight in between RCI to measure the relative concentration of a lncRNA between multiple biological functions. Additionally, Tilgner et al. observed the cytoplasm and the nucleus in 15 cell lines. Figure 4A illustrates that the splicing efficiency of an exon is negatively correlated with the RCI distributions of mlncRNAs, flncRNAs, and background the length of the downstream intron, supposedly owing to the RNA- lncRNAs. It is apparent that mlncRNAs tend to have higher RCI seq reads spanning the junction between exon and intron can only values compared to the other two categories of lncRNAs in almost be detected when transcription occurs, which is a necessary condi- all these cell lines except SK.N.DZ, SK.MEL.5 and K562, suggesting tion for splicing to be carried out (Tilgner et al., 2012). So, we also a more variable localization pattern of mlncRNAs with much higher explored if there is a difference in the intron length or splicing effi- chance of residing in the cytoplasmic compartment in comparison to ciency compare to background lncRNAs and flncRNAs. As shown the other lncRNAs. Then, we further investigated the localization of in Figure 3F, unfortunately, no significant difference in the average mlncRNAs at the sub-compartment level, since LncATLAS also pro- intron length was observed among the three lncRNA categories. vides information about enrichment in the cytoplasmic and nuclear Phylogenetic conservation and expression correlation are strong sub-compartments of the K562 cells. As shown in Figure 4B, the evidence for inferring functions of coding or non-coding genes sub-nuclear RCI values of mlncRNAs are higher than that of the (Cheng et al., 2016a,b; Park et al., 2014). As lncRNAs are crucial other two groups of lncRNAs while the sub-cytoplasmic RCI values in biological processes if they are evolutionarily conserved or are relatively small. Namely, in the nucleus, the mlncRNAs are expression-correlated across species, we checked whether the prone to appear in the sub-compartments of nucleoplasm, nucleolus mlncRNAs tend to be conserved among orthologous genomes and and chromatin, whereas in the cytoplasm, the mlncRNAs are not whose expression patterns are highly correlated in orthologs. As likely to reside in insoluble and membrane relative to the other shown in Figure 3G, the mlncRNAs are prone to be evolutionarily con- lncRNAs. That is why the cytoplasmic-nuclear RCI values of served (>12%) than flncRNAs and background lncRNAs (<10%). mlncRNAs are almost the same as the background lncRNAs in the K562 cell line (Fig. 4A). Meanwhile, we also calculated the expres- sion value distribution of lncRNAs in each sub-compartment in the K562 cell line. More importantly, the expression values of the mlncRNAs are significantly higher in all the sub-compartments, re- vealing that the expression abundance of lncRNAs is crucial in exe- cuting the part-time functions (Fig. 4C). In addition, we also used another RNA localization resource RNALocate, which contains manually-curated localizations classifi- cations, to investigate the localization tendency of mlncRNAs. In this database, the lncRNAs were collected and annotated to differ- ent cell compartments, e.g. nucleus, cytosol and cytoplasm. We cal- culated the ratio of lncRNAs in these compartments separately for each category of lncRNAs. We found that mlncRNAs tend to appear Fig. 3. The distributions of lncRNAs in different functional groups with regard in more than one compartment and localize in the cytoplasmic com- to distinct sequence features. (A) Gene length. (B) Transcript length. (C) partments such as cytosol and cytoplasm (Fig. 4D–G). The ratio of Transcript number. (D) Exon number. (E) Exon length. (F) Intron length. (G) multilocation mlncRNAs is as high as 0.35, which is much higher Evolutionary conservation and expression correlation. Outliers are not than that of flncRNAs and the background lncRNAs (about 0.3 shown. The bars from left to right correspond to lncRNA, flncRNA, and mlncRNA, respectively and 0.26, respectively; Fig. 4D). More importantly, mlncRNAs were Identification of moonlighting ncRNAs 3525 Fig. 4. The mlncRNA localization and expression features. (A) The boxes from top to bottom correspond to lncRNA, flncRNA, and mlncRNA. (B) Sub-compartment expression value distribution of lncRNAs in the K562 cell line. (C) Sub-compartment RCI distribution of lncRNAs in the K562 cell line. (D) The ratios of different groups of lncRNAs residing in multiple locations. (E–G) The ratios of different groups of lncRNAs residing in the nucleus, the cytosol, and the cytoplasm, respect- ively. RCI, Relative Concentration Index found to be enriched in cytosol and cytoplasm with the ratios of process, gene expression and amniotic stem cell differentiation 0.55 and 0.3 (Fig. 4F, G), respectively, whereas the ratio is compar- (Fig. 5B). able to the other categories of lncRNAs in the nucleus (Fig. 4E). To determine whether the moonlighting of lncRNAs is impli- Consequently, we can draw the same conclusion that mlncRNAs cated in the formation and development of cancers and other dis- have a higher chance of residing in the cytoplasmic compartment eases, we associated the mlncRNAs with public available cancer and and therefore have a more variable spatial distribution. disease lncRNAs as well as cancer proteins and drug target proteins (see Materials and methods). Around 13% of the mlncRNA candi- dates have been studied involved in cancer processes, while the 3.5 Topological features of mlncRNAs and its roles in ratios of flncRNAs and background lncRNAs are less than 10% cancers (Fig. 5C). When considering other complex diseases, not surprising- lncRNA functions through its interacting partners. Accumulating ly, the ratio is as high as 23% for mlncRNAs, which is still much studies show that the multi-functionality of lncRNAs as interacting higher than that the other categories (16 and 17%, respectively; hubs with other molecules such as proteins, DNAs and RNAs. Fig. 5D). From the perspective of regulated functional modules, Apparently, the candidate mlncRNAs connect a significantly larger about 39% of the candidates are included in modules significantly number of proteins and modules than the other lncRNAs according enriching cancer proteins, whereas the ratios decrease to about 29 to the identification methodology. On average, the number of part- and 33% for the other two groups (Fig. 5E). Similarly, for the mod- ner proteins is 36.1 for mlncRNA while less than 20 for the others ules enriched of drug targets, they are more likely to interact with (WRT P-value ¼ 7.4e-8; Supplementary Fig. S1B). The number of mlncRNAs than the flncRNAs and the background lncRNAs (12% the interacted module is around 6.8 for mlncRNA compared with versus 7% versus 8%; Fig. 5F). Besides, we can draw the same con- 5 for the other lncRNAs (WRT P-value ¼ 8.8e-14; Supplementary clusion when concentrating on the proportion of cancer modules or Fig. S1C). To illustrate the combinatorial regulation and give a sys- tematic description, we constructed an mlncRNA–module regula- drug target modules regulated by the mlncRNAs (Supplementary Fig. S1D and E). Therefore, these results demonstrate that the tory network, in which the edges link the candidate mlncRNAs mlncRNAs exercise a great influence on cancer metastasis and pro- to their corresponding functional modules. This network contains gression through pairwise interactions and clustered modules of pro- in total 1055 predicted regulatory interactions between 155 mlncRNAs and 83 modules (Fig. 5A). Some modules connect more teins in the regulatory network. candidate mlncRNAs than others, indicating that they might be More importantly, we observed that the mlncRNAs and moon- engaged in a larger number of moonlighting regulations. In particu- lighting proteins (MPs) tend to be mutually exclusive in terms of lar, the largest module in the center of the network shows the high- interactions and interacted partners. Only five MPs directly inter- acts with the mlncRNAs, whereas on average the value is as high est degree of 70, suggesting that it is subjected to the regulations of 70 mlncRNAs. The proteins in this module were found to be mainly as 18.9 when randomly selected the same number of MPs (Monte implicated in biological processes such as nucleic acid metabolic Carlo P-value ¼ 1e-04, HGT P-value ¼ 5.4e-18, Fig. 5G). Also, we 3526 L.Cheng and K.-S. Leung Fig. 5. Association between mlncRNAs and diseases. (A) The mlncRNA–module regulation network. RNAs are represented in squares while modules are in circles. The bars from left to right correspond to lncRNA, flncRNA, and mlncRNA. The circle size corresponds to the module size. (B) Gene Ontology function enrichment of the module with the largest number of regulated mlncRNAs. (C, D) The ratio of cancer and disease lncRNAs among the three lncRNA categories. (E, F) The ratio of lncRNAs associated with cancer and drug target module. (G) mlncRNAs tend not to interact with MPs. T, the number of practical mlncRNA-MP interactions. (H) mlncRNAs and MPs tend to share less interacting partners. T, the number of common partners practically interacted by mlncRNAs and MPs. (I) Summary of the top ten mlncRNAs with the highest OR scores simulated the interacted partners of both mlncRNAs and MPs 10 4 Discussion 000 times and found the number of common partners between In this study, we introduced a computational framework them is 691 on average, which is significantly higher than the MoonFinder to systematically identify moonlighting lncRNAs practical value of 529 (Monte Carlo P-value ¼ 0, HGT (mlncRNAs) based on the integrated lncRNA and protein inter- P-value ¼ 2.1e-86, Fig. 5H). In other words, the number of action network as well as the protein functional annotations. In total common partner proteins that the moonlighting lncRNAs and 155 lncRNAs were determined as candidates with multiple but dis- proteins shared is significantly less than that of randomly tinct functions. Also, we characterized them from various aspects of selected ones. These results indicate the possible mechanism sequence features, evolutionary conservation, expression correl- that the cells make full use of the macromolecules to efficiently ation, expression abundance, localization tendency and interaction and systematically perform cellular tasks avoiding the redundant patterns, which will facilitate our further understanding of their implementations. functions and the mechanism of moonlighting. Additionally, from the mlncRNA–module network in Remarkably, we observed that the non-coding genes that tran- Figure 5A, we found the mlncRNAs that exclusively interact func- script mlncRNAs tend to have shorter but more exons, which is a tional modules tend to be cancer-related. Accordingly, we intro- potential sequence feature for lncRNAs to moonlight in between duced a score, Interactor Share Rate (ISR), to measure how likely multiple biological functions. Also, we found the evolution and ex- the interactors of a given lncRNA are shared by the other pression patterns of mlncRNAs are more conserved than the other lncRNAs (see Supplementary Section S11). We found that the lncRNAs, which in contrast with the conventional knowledge that cancer mlncRNAs have significantly higher ISRs than that of lncRNAs are generally less conserved than mRNAs and proteins the others (WRT P-value ¼ 3.2e-03, Supplementary Fig. S1F). For (Hon et al., 2017; Park et al., 2014), suggesting that mlncRNAs are the mlncRNAs with the top ten highest ISR scores, six out of central for the homeostasis maintenance of human. them are cancer lncRNAs (Fig. 5I). When strengthening the More importantly, we found that mlncRNAs have a more thresholdto0.5,six outofeight (75%)ofthe mlncRNAs are can- variable localization pattern with a high chance of residing in the cer genes and the other two, ANCR and LRRC75A-AS1,could be cytoplasmic compartment, although they display high expression considered as the candidates of cancer mlncRNA, where the dys- function or inappropriate switching of these RNAs in different across all the cell compartments. mlncRNAs are expressed signifi- cantly higher in all the sub-compartments of the K562 cell lines in cell compartments may result in the biological activity of cancer, comparison with the other lncRNAs, suggesting that the high ex- although further experimental works are needed to warrant this claim. pression abundance is necessary for executing the part-time Identification of moonlighting ncRNAs 3527 functions. We studied the localization tendency and translocation importance for cancer therapeutics and will provide tremendous activity of these mlncRNAs because lncRNAs are diversely resided opportunities for anti-cancer strategies. The moonlighting feature of in the cells and play a crucial role as modulators to regulate gene ex- the other types of RNAs, such as miRNA and circRNA (Chen, pression in multiple ways (Cabili et al., 2015; Ferre et al., 2016; 2016a,b), will also be studied and compared and eventually a moon- Quinn and Chang, 2016; Zhou et al., 2017; Zhu et al., 2016). It is lighting atlas of both RNAs and proteins will be provided. Similar to well known that the competing endogenous RNAs (ceRNAs) can animal cells, plant cells are heavily compartmentalized. Besides the protect the coding RNAs from repression by binding them to the co- typical organelles in animal cells, plant cells usually contain a large targeted miRNAs. mlncRNAs may function diversely in the same number of plastids and a central vacuole, which is critical in storing way by competing or shared miRNAs, which mainly reside in the organelles and involved in the regulation of cellular osmoregulation cytoplasm to execute their functions, leading to a more variable spa- (Gao et al., 2015; Zeng et al.,2015; Zhuang et al., 2017). tial distribution (Rashid et al., 2016). In the future, we will investi- Hence, we will explore and establish methodologies to identify gate whether the mlncRNAs act as ceRNAs to modulate their moonlighting ncRNAs and proteins for in-depth functional studies expression abundance and interaction pattern, e.g. regulating the in plants. abundance of lncRNAs in one compartment may influence the func- tion of the other cell compartment. Our result also shows that mlncRNAs and MPs are rather mutu- Funding ally exclusive in terms of their direct interactions and interacting This work was supported by The Chinese University of Hong Kong Direct partners. In other words, lncRNAs and proteins with moonlighting Grant; and the Research Grants Council of Hong Kong GRF Grant [414413]. functions are not likely to interact with each other and they even Conflict of Interest: none declared. tend to share fewer neighbors in the regulatory network. The reason might be that the macromolecules in cells are usually organized to be efficient to perform different cellular tasks without redundancy. References According to the mlncRNA–module bipartite network, we also pre- Abumrad,N.A. and Lange,A.J. (2006) The metabolism of cancer cells: moon- dicted eight cancer lncRNAs and six out of them were previously lighting proteins and growth control. Curr. Opin. Clin. Nutrition Metab. identified as cancer lncRNAs by different experimental assays. Care, 9, 337–338. A potential limitation is that the low value of semantic similarity Barabasi,A.L. and Oltvai,Z.N. (2004) Network biology: understanding the between a pair of modules might due to the incompleteness of GO. cell’s functional organization. Nat. Rev. Genet., 5, 101–113. For instance, one lncRNA is detected as a moonlighting candidate Boukouris,A.E. et al. (2016) Metabolic enzymes moonlighting in the nucleus: and it regulates some modules with low semantic similarity, which metabolic regulation of gene transcription. Trends Biochem. Sci., 41, may partially due to the proteins involved in these modules show no 712–730. evidence of association with GO terms, but the association might Buxbaum,A.R. et al. (2015) In the right place at the right time: visualizing have not been established yet. It is noted that the absence of annota- and understanding mRNA localization. Nat. Rev. Mol. Cell Biol., 16, tion does not imply the absence of function, because the information 95–109. of GO is still far from complete, although the creation of new anno- Cabili,M.N. et al. (2015) Localization and abundance analysis of human tations increased dramatically. Another potential bias is that the GO lncRNAs at single-cell and single-molecule resolution. Genome Biol., 16, is structured as a graph with unbalance of high and low levels of 20. Chapple,C.E. et al. (2015) Extreme multifunctional proteins identified from a details among branches (Pritykin et al., 2015). This has a strong im- human protein interaction network. Nat. Commun., 6, 7412. plication on measuring the semantic similarity of GO terms. For in- Chen,L.L. (2016a) The biogenesis and emerging roles of circular RNAs. Nat. stance, sibling GO terms can be semantically very similar or Rev. Mol. Cell Biol., 17, 205–211. extremely distinct in different branches of the GO structure. In this Chen,L.L. (2016b) Linking long noncoding RNA localization and function. study, we leveraged semantic similarity among the GO terms to Trends Biochem. Sci., 41, 761–772. identify moonlighting lncRNAs. Since similar processes may be cate- Cheng,L. et al. (2016a) CrossNorm: a novel normalization strategy for micro- gorized in distinct branches in GO, low values of semantic similarity array data in cancers. Sci. Rep., 6, 18898. do not always imply multifunctionality and hence our findings may Cheng,L. et al. (2016b) ICN: a normalization method for gene expression data include some false positive discoveries. Pritykin et al. proposed an considering the over-expression of informative genes. Mol. Biosyst., 12, approach to identify multifunctional genes based on GO, which 3057–3066. defined a multifunctional gene as it is annotated with at least ‘two Du,Z. et al. (2013) Integrative genomic analyses reveal clinically relevant sufficiently distinct functional terms of comparable specificity’ long noncoding RNAs in human cancer. Nat. Struct. Mol. Biol., 20, (Pritykin et al., 2015). The definition of multifunctionality might 908–913. Espinosa-Cantu, A. et al. (2015) Gene duplication and the evolution of moon- help mitigate the problem of false positive discovery in detecting lighting proteins. Front. Genet., 6, 227. moonlighting lncRNAs. In our future work, we will integrate Ferre,F. et al. (2016) Revealing protein–lncRNA interaction. Brief. this co-occurrence based definition of multifunctionality into Bioinform., 17, 106–116. MoonFinder. Gao,C. et al. (2015) Dual roles of an Arabidopsis ESCRT component FREE1 We believe our observations can aid our and other research in regulating vacuolar protein transport and autophagic degradation. Proc. groups to understand how they function in a moonlighting manner Natl. Acad. Sci. USA, 112, 1886–1891. and help in designing RNAs with novel functions and applications. Hon,C.C. et al. (2017) An atlas of human long non-coding RNAs with accur- Moreover, investigating the mechanisms that determine the func- ate 5’ ends. Nature, 543, 199–204. tional diversity of mlncRNAs has the potential to provide new Jeffery,C.J. (2015) Why study moonlighting proteins? Front. Genet., 6, 211. insights into their biogenesis, physical interaction, subcellular Khan,I. et al. (2014) Genome-scale identification and characterization of localization and therapeutic application in diseases. In the future, moonlighting proteins. Biol. Direct, 9, 30. we will investigate the mechanism of how the mlncRNAs modulate Khan,I.K. et al. (2017) DextMP: deep dive into text for predicting moonlight- and switch the functions in metabolic processes, which is of vital ing proteins. Bioinformatics, 33, i83–i91. 3528 L.Cheng and K.-S. Leung Khan,I.K. and Kihara,D. (2016) Genome-scale prediction of moonlighting proteins Piatigorsky,J. and Wistow,G.J. (1989) Enzyme/crystallins: gene sharing as an using diverse protein association information. Bioinformatics, 32, 2281–2288. evolutionary strategy. Cell, 57, 197–199. Liao,Q. et al. (2011) Large-scale prediction of long non-coding RNA functions Pritykin,Y. et al. (2015) Genome-wide detection and analysis of multifunc- in a coding-non-coding gene co-expression network. Nucleic Acids Res., 39, tional genes. PLoS Comput. Biol., 11, e1004467. 3864–3878. Quinn,J.J. and Chang,H.Y. (2016) Unique features of long non-coding RNA Liu,W. et al. (2014) Gene co-expression analysis identifies common modules biogenesis and function. Nat. Rev. Genet., 17, 47–62. related to prognosis and drug resistance in cancer cell lines. Int. J. Cancer, Rashid,F. et al. (2016) Long non-coding RNAs in the cytoplasm. Genomics 135, 2795–2803. Proteomics Bioinform., 14, 73–80. Ma,S. and Dai,Y. (2011) Principal component analysis based methods in bio- Sriram,G. et al. (2005) Single-gene disorders: what role could moonlighting informatics studies. Brief. Bioinform., 12, 714–722. enzymes play? Am. J. Hum. Genet., 76, 911–924. Ma,S. and Kosorok,M.R. (2009) Identification of differential gene pathways Tilgner,H. et al. (2012) Deep sequencing of subcellular RNA fractions shows with principal component analysis. Bioinformatics, 25, 882–889. splicing to be predominantly co-transcriptional in the human genome but in- Mani,M. et al. (2015) MoonProt: a database for proteins that are known to efficient for lncRNAs. Genome Res., 22, 1616–1625. moonlight. Nucleic Acids Res., 43, D277–D282. Wahlestedt,C. (2013) Targeting long non-coding RNA to therapeutically Mas-Ponte,D. et al. (2017) LncATLAS database for subcellular localization of upregulate gene expression. Nat. Rev. Drug Discov., 12, 433–446. long noncoding RNAs. RNA, 23, 1080–1087. Wang,P. et al. (2015) Identification of lncRNA-associated competing triplets Min,K.W. et al. (2016) Moonlighting proteins in cancer. Cancer Lett., 370, reveals global patterns and prognostic markers for cancer. Nucleic Acids 108–116. Res., 43, 3478–3489. Monaghan,R.M. and Whitmarsh,A.J. (2015) Mitochondrial proteins moon- Zeng,Y. et al. (2015) Unique COPII component AtSar1a/AtSec23a pair is lighting in the nucleus. Trends Biochem. Sci., 40, 728–735. required for the distinct function of protein ER export in Arabidopsis thali- Nepusz,T. et al. (2012) Detecting overlapping protein complexes in protein– ana. Proc. Natl. Acad. Sci. USA, 112, 14360–14365. protein interaction networks. Nat. Methods, 9, 471–472. Zhou,J. et al. (2017) LncFunNet: an integrated computational framework for Ning,S. et al. (2016) Lnc2Cancer: a manually curated database of experimen- identification of functional long noncoding RNAs in mouse skeletal muscle tally supported lncRNAs associated with various human cancers. Nucleic cells. Nucleic Acids Res., 45, e108. Acids Res., 44, D980–D985. Zhu,X. et al. (2016) A long non-coding RNA signature to improve prognosis Park,C. et al. (2014) lncRNAtor: a comprehensive resource for functional prediction of gastric cancer. Mol. Cancer, 15, 60. investigation of long non-coding RNAs. Bioinformatics, 30, 2480–2485. Zhuang,X. et al. (2017) ATG9 regulates autophagosome progression from the Piatigorsky,J. et al. (1988) Gene sharing by delta-crystallin and argininosucci- endoplasmic reticulum in Arabidopsis. Proc. Natl. Acad. Sci. USA, 114, nate lyase. Proc. Natl. Acad. Sci. USA, 85, 3479–3483. E426–E435. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Identification and characterization of moonlighting long non-coding RNAs based on RNA and protein interactome

Bioinformatics , Volume 34 (20): 10 – May 16, 2018

Loading next page...
 
/lp/ou_press/identification-and-characterization-of-moonlighting-long-non-coding-Wy1a6TGegm
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bty399
Publisher site
See Article on Publisher Site

Abstract

Motivation: Moonlighting proteins are a class of proteins having multiple distinct functions, which play essential roles in a variety of cellular and enzymatic functioning systems. Although there have long been calls for computational algorithms for the identification of moonlighting proteins, research on approaches to identify moonlighting long non-coding RNAs (lncRNAs) has never been undertaken. Here, we introduce a novel methodology, MoonFinder, for the identification of moon- lighting lncRNAs. MoonFinder is a statistical algorithm identifying moonlighting lncRNAs without a priori knowledge through the integration of protein interactome, RNA–protein interactions and functional annotation of proteins. Results: We identify 155 moonlighting lncRNA candidates and uncover that they are a distinct class of lncRNAs characterized by specific sequence and cellular localization features. The non-coding genes that transcript moonlighting lncRNAs tend to have shorter but more exons and the moon- lighting lncRNAs have a variable localization pattern with a high chance of residing in the cytoplas- mic compartment in comparison to the other lncRNAs. Moreover, moonlighting lncRNAs and moonlighting proteins are rather mutually exclusive in terms of both their direct interactions and interacting partners. Our results also shed light on how the moonlighting candidates and their interacting proteins implicated in the formation and development of cancers and other diseases. Availability and implementation: The code implementing MoonFinder is supplied as an R package in the supplementary material. Contact: lxcheng@cse.cuhk.edu.hk or ksleung@cse.cuhk.edu.hk Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction possession of additional roles such as signal transduction, transcrip- Protein moonlighting is a common phenomenon in nature involving tional regulation, apoptosis, motility and structural proteins a protein with a single polypeptide chain that can perform more (Jeffery, 2015). For example, crystallins, a class of well-studied than one independent cellular function (Boukouris et al., 2016; moonlighting proteins, function as enzymes when expressed at low Monaghan and Whitmarsh, 2015). Enzymes, receptors, ion channels levels in many tissues, but are densely packed to form lenses when or chaperones are the typical form of moonlighting proteins (MPs). expressed at high levels in eye tissue (Piatigorsky et al., 1988; Enzyme is the most common form of moonlighting proteins Piatigorsky and Wistow, 1989). The genes encoding crystallins need whose primary function is enzymatic catalysis, but they are also in to sustain functions of both catalytic and transparency maintenance. V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 3519 3520 L.Cheng and K.-S. Leung Another example is glycolysis, an ancient universal metabolic interactions, gene expression, phylogenetic profiles, genetic interac- pathway, in which a high percentage of proteins are moonlighting tions, network-based graph properties and protein structural proper- proteins (Boukouris et al., 2016; Sriram et al., 2005). Moreover, ties (Khan and Kihara, 2016). In general, MPs are assigned in more some proteins work on their moonlighting by being assembled to clusters because they interact with proteins of diverse functions, so supramolecules, such as the ribosome, which usually composed of the number of clusters that a protein involved is used as an omics more than a hundred of proteins and RNAs. However, the studies of feature. For proteins that do not have an available record of certain moonlighting merely concentrated on proteins and the genes coding features, an imputation step using random forest is used to predict them, yet the moonlighting of non-coding RNAs has not been the missing features. Eventually, these features are combined with investigated, despite the fact that ncRNAs have gained widespread machine learning classifiers to make moonlighting protein predic- attention due to their functional importance over the last decade tion. DextMP is a text mining algorithm consisting of four logical (Chen, 2016a,b; Liao et al., 2011; Quinn and Chang, 2016; Zhou steps (Khan et al., 2017). First, it extracts textual information of et al., 2017). proteins from literature and functional description in the UniProt Currently, the information of MPs, such as protein functions, database. Next, it constructs a k-dimensional feature vector from cell localization and primary structures, is scattered across a number each text using three language models, i.e. paragraph vector, Term of publications, since the MPs perform a variety of functions in dif- Frequency-Inverse Document Frequency (TFIDF) in the bag- ferent tissues and cell types. Some researchers have summarized the of-words category and Latent Dirichlet Allocation (LDA) in the literature about MPs from different aspects of the functional diver- topic modeling category. Third, using four machine learning classi- sity, such as regulation circuits or signaling pathways. The Jeffery fiers, a text is classified to either MP or non-MP based on the text lab constructed a manually curated database MoonProt, which con- features. Finally, the text predictions for each protein are separately sists of over 200 MPs that have been experimentally verified (Mani summarized to predict which ones are MPs. et al., 2015). The structures and function information about the Long non-coding RNAs (lncRNAs) is a subclass of non-coding MPs can aid researchers to understand how proteins function in a RNAs with little coding potential whose transcript consists of no moonlighting manner and help in designing proteins with novel less than 200 nucleotides. lncRNAs are implicated in a variety of functions. Min et al. summarized the MPs from the perspective biological processes through diverse functional mechanisms such as of a coupled intracellular signaling pathway (Min et al., 2016). chromatin remodeling, chromatin interactions and functioning as Numerous proteins are localized in more than one compartment in competing endogenous RNAs (Ferre et al., 2016). Specific expres- cells and the aberrant translocation of proteins may cause cancer or sion patterns of lncRNA in cells correspond to certain cell develop- other disorders. Hence, it is necessary to study the localization dy- ment and disorder. Nuclear and cytoplasmic lncRNAs can regulate namic and trans localization activity of MPs. Monaghan et al. gene expression and function in multiple ways, e.g. i) affecting reviewed several MPs with dual mitochondrial and nuclear func- mRNA translation directly, ii) interfering with protein post- tions (Monaghan and Whitmarsh, 2015). It is pointed out that translational modifications to disturb signal transduction, iii) func- the nuclear moonlighting of mitochondrial proteins is part of a tioning as decoys for miRNAs and proteins, iv) acting as miRNA mitochondria-to-nucleus signaling pathway in cells. They also dis- sponges, v) interacting proteins to enhancer regions and vi) encoding cussed various mechanisms commanding the dual localization of small proteins and functional micro peptides, etc. (Cabili et al., proteins and indicated that the nuclear moonlighters perform as a 2015; Ferre et al., 2016; Quinn and Chang, 2016; Zhou et al., 2017; regulating loop to maintain homeostasis in mitochondria. Boukouris Zhu et al., 2016). Many lncRNAs diversely reside in the nucleus and et al. summarized the moonlighting functions of metabolic enzymes play an essential role as modulators for nuclear functions. Some in the nucleus (Boukouris et al., 2016). They proposed a new mech- others are translocated to the cytoplasm to enforce their regulatory anistic connection between metabolic flux and differential expres- roles. Moreover, lncRNAs have a variety of subcellular localization sion of genes, which is implemented via nuclear translocation or patterns, which are not limited to specific nuclear and cytoplasm lo- moonlighting of nuclear metabolic enzymes. This mechanistic link calization but also nonspecific localization in both the nucleus and aids cells in adapting a changing environment in normal and disease cytoplasm (Barabasi and Oltvai, 2004; Buxbaum et al., 2015). For states, such as cancer, and thus has the potential to be explored for the lncRNAs localized in multiple compartments, the intercommuni- the novel therapeutic target. cation can modulate the interaction pattern or expression abun- In parallel to the serendipitous findings of MPs through experi- dance, e.g. regulating the lncRNA abundance in one compartment ments, some computational approaches have been developed to pre- may influence the function of the other cell compartment. Also, in- dict MPs in recent years (Pritykin et al., 2015). Specifically, three appropriate moonlighting may trigger genetic diseases (Abumrad algorithms were proposed for moonlighting protein identification, and Lange, 2006; Espinosa-Cantu et al., 2015; Min et al., 2016). MoonGO (Chapple et al., 2015), MPFit (Khan and Kihara, 2016) Hence, it is necessary to study the localization dynamic and expres- and DextMP (Khan et al., 2017), executing statistic, machine sion activity of moonlighting lncRNAs (mlncRNAs) and to investi- learning and text mining techniques, respectively. These studies gate the mechanism of how the mlncRNAs modulate and switch the investigated different aspects of MPs such as conserved sequence functions in the metabolic processes, which is of vital importance domains, structural disorders, protein interaction patterns and net- for cancer therapeutics and will provide tremendous opportunities work topology. MoonGO first identifies overlapping protein clus- for anti-cancer purposes (Du et al., 2013; Liu et al., 2014; Wang ters in the human interactome (Chapple et al., 2015). Then, the et al., 2015; Zhu et al., 2016). clusters are annotated to the Gene Ontology (GO) terms of biologic- We have demonstrated that using clustering algorithms is able to al process. GO terms annotating more than half of a cluster’s pro- group proteins into functional modules allowing the identification teins are assigned to the cluster. Each individual protein then of MPs (Chapple et al., 2015; Khan et al., 2014; Pritykin et al., inherits the annotations of its clusters in addition to its own. 2015). A module corresponds to a functional unit, which is com- Finally, the proteins shared by dissimilar functional clusters are posed of several closely interacted proteins involved in specific tasks identified as MPs. MPFit uses a variety of protein features to address in the cell. Therefore, it is promising to use the functional module the diverse nature of MPs, including functional annotation, protein approach to identify mlncRNAs that exhibit multiple but distinct Identification of moonlighting ncRNAs 3521 functionalities. Our study is focused on the moonlighting of human and the rest share the negligible weight of 0.05. The workflow of lncRNAs, since lncRNAs are pervasively transcribed in the mamma- MoonFinder is explained in more detail in Figure 1. lian genome and several of them play the roles as oncogenic or tumor- suppressor genes in multiple cancers (Ning et al.,2016; Quinn and 2.2 Statistical methods Chang, 2016; Wahlestedt, 2013; Zhu et al.,2016). We first propose a 2.2.1 Module-function and lncRNA–module association novel algorithm MoonFinder to identify mlncRNAs that have mul- The hypergeometric test (HGT) was used to evaluate the statistical tiple but unrelated functions. Then, we characterize the sequence fea- significance of the association between lncRNAs and functional tures and the localization tendency of these mlncRNA candidates. modules. A lncRNA is considered to be interacting with a module After that, we construct an mlncRNA–module network and topo- if the proteins interacting with the lncRNA are significantly logically analyzed the mlncRNAs with regard to the association with enriched in the module, i.e. P-value <0.05, which is shown as cancer, diseases, drug targets and moonlighting proteins. We also pre- follows, dict two cancer lncRNAs exclusively interacted with functional mod- k1 i mi ules from the network. x nx p ¼ 1  (2.1) i¼0 n 2 Material and methods where n is the total number of all proteins in the protein interaction 2.1 The workflow of MoonFinder network, m is the number of proteins analyzed in a module, x is the Moonlighting lncRNAs (mlncRNAs) may perform their distinct number of proteins interacting with a specific lncRNA, and i is functions through interacting with different functional modules and the number of proteins in the module interacting with the lncRNA. module is a cluster of proteins interconnected densely and associated The P-value describes the probability of randomly select no less than with specific tasks in cells. Hence, it is promising to identify k interacted proteins in the module with size m. Similarly, the model mlncRNAs from the aspect of whether they are targeting functional- was also utilized to carry out the functional annotation for the iden- ly unrelated protein modules. MoonFinder integrates several statis- tified modules by functional categories (or GO terms), where in this tical models based on the RNA and protein interactome as well case n is still the total number of gene products in the protein as the protein functional annotations for the identification of moon- lighting macromolecules. The workflow of MoonFinder contains the following six steps, 1. Protein interaction network refinement. Refine the protein inter- action network by filtering out protein pairs sharing no identical cell compartments and only the co-localized interactions are considered for further analysis. 2. Protein module identification. Detect protein modules from the co-localized protein interaction network (using ClusterONE by default). Each of the detected modules is highly interconnected and expected to be implicated in a specific cellular process. 3. Functional annotation of modules. Establish the annotation of the modules with GO terms by performing the functional cat- egory enrichment using hypergeometric test (HGT). The pair- wise association between the modules and the GO terms are constructed. 4. Establish RNA–module interactions. Assess whether the target proteins of an RNA are significantly overrepresented in a mod- ule. If yes, we define the RNA functionally regulates the module. 5. Construct similarity matrix of modules. All pairwise functional similarities are pre-calculated by a given semantic similarity measurement using a BMA combination approach and eventual- ly a similarity matrix is produced. 6. Moonlighting determination. Calculate the principal compo- nents (PCs) and their contribution weights of the similarity ma- trix of the modules targeted by an RNA using principal component analysis (PCA). An RNA is determined as moon- lighting if more than one PC play an indispensable role (weight Fig. 1. Schematic diagram of MoonFinder. The orange, green and blue boxes > 0.1) and none of these PCs play a dominant role (weight > represent the input, generated, and output data, respectively. The clustering 0.7). The weight can be customized. The higher the value, the algorithm and statistical models are shown in the dotted boxes. The inputs more moonlighting candidates can be identified. are the four datasets, i.e. interactions of proteins, localization annotation of proteins, function annotation of proteins, and interactions between ncRNAs For instance, we say an RNA is moonlighting if the weights of the and proteins, respectively. The output is a list of predicted moonlighting top three PCs of the module similarity matrix are 0.4, 0.3 and 0.2, lncRNA candidates. The associations between modules and GO terms (or respectively, because three PCs (more than one) are substantial. On ncRNAs) are established using Hypergeometric Test. The similarity among the other hand, an RNA is multifunctional but not moonlighting if modules is quantified using semantic similarity. PCA is used to obtain the la- the weight of the first PC is 0.85, the weight of the second PC is 0.1, tent features of the input ncRNAs. Please refer to Section 2.1 for more details 3522 L.Cheng and K.-S. Leung interaction network, but m represents the module size and x repre- less than 0.1, so we also define k  0:1 in Eqt 2.4 and accordingly sents the term size. i is the number of gene products annotated in the the number of latent features with key contribution is term and involved in the module. The P-value shows the probability of randomly choosing no less than k proteins annotated by a GO n ¼ sgnðk  k Þ (2.6) i cutoff i¼1 term for a given module. Monte Carlo simulation was adopted to model the probability of 1; x  0 interacting mlncRNAs and MPs. N pairs of mlncRNAs and MPs where sgn is defined as sgnðÞ x ¼ and k ¼ 0:1. cutoff 0; x < 0 were randomly extracted from the lncRNA–protein interaction Consequently, the number of latent functions for an RNA is network and then we calculated the interacting pair number, T . The P-value is the ratio of the number of simulated interactions (T ) l ¼ minðk; nÞ (2.7) that is larger than the number of practical interactions (T), which is An RNA is determined as moonlighting RNA if l is larger than one. mathematically defined as: Namely, the RNA has more than one latent feature in terms of the sgnðT  TÞ i¼1 interacting functional modules. p ¼ (2.2) 1; x  0 3 Results where sgn is defined as sgnðÞ x ¼ . 0; x < 0 3.1 Experimental and parametric setups of MoonFinder To gain the statistical significance, all comparisons in this study Moonlighting lncRNAs (mlncRNAs) are assumed to execute multiple between two lists of genes or gene products were analyzed using distinct functions through interactions with proteins that are localized Wilcoxon Ranksum Test (WRT, R-3.4.1). in different cell compartments, so we identify mlncRNAs from the as- pect of whether they are targeting multiple but functionally unrelated 2.2.2 Decomposition of the functional similarity matrix protein modules in a co-localized protein interactome. The human Principal component analysis (PCA) is a statistical procedure used to protein interactome was first refined and only the co-localized interac- reduce the number of features used to represent data. We accom- tions were maintained for module identification, since proteins closely plish the reduction by projecting data from a higher dimension to a interacted with each other in a module are more likely to reside in the lower dimensional manifold such that the error incurred by recon- same cell compartment. Eventually, a compartment-specific protein structing the data in the higher dimension is minimized (Ma and interaction network with 210 410 interactions among 10 111 proteins Dai, 2011; Ma and Kosorok, 2009). Mathematically, we want to was constructed. After that, a total of 765 functional modules were p q map the features x 2 R to x ~ 2 R where q < p. Here, eigenvalue identified using ClusterONE, an algorithm that can detect overlap- decomposition of the similarity matrix was used to calculate the ping clusters of proteins highly connected inside but sparse outside principal components (PCs) and their weights for the modules inter- (Nepusz et al.,2012). We choose ClusterONE because not only it can acted by each lncRNA. According to the Spectral Theorem, detect biological relevant clusters that can be appropriately mapped to modules, but also its ability to softly identify the overlapping mod- Av~ ¼ k v~ (2.3) i i i ules considering the network topological structure. we can calculate the eigenvalues k þ þ k of the Similarity ma- 1 m Functional enrichment analysis was employed to annotate each trix (arranged in decreasing order) and accordingly the trace of the identified module with specific and significant function categories. matrix is We applied the hypergeometric test (HGT) to obtain the enrichment P-values and the FDR adjusted P-values of 0.01 were eventually T ¼ k þ þ k (2.4) 1 m used as the threshold. To establish the RNA–module interactions, Only the lncRNAs whose similarity matrix can be decomposed into similarly, we used HGT to assess whether there are significant con- PCs without a dominant role were selected as the candidates, such nections between the lncRNAs and the functional modules. The tar- as the sum of the weights of the first k PCs is less than a threshold s get proteins of a lncRNA are significantly overrepresented in a (s ¼ 0:7 by default) as follows, module if the FDR adjusted enrichment P-values were less than a given threshold of 0.01. Accordingly, we obtained a bipartite net- argmax < s (2.5) work with 2726 interactions among 538 lncRNA and 106 protein k¼0;1;2... i¼1 modules. The function similarity matrices were calculated using the semantic similarities among the modules of each lncRNA using five The threshold s is a user-defined parameter. The higher the value, semantic similarity measurements, i.e. Resnik, Lin, Jiang and the more mlncRNA can be identified. Here, we set s ¼ 0:7because Conrath, Schlicker and Wang (see Supplementary Section S4). Only a PC will play a dominant role with a weight distribution coeffi- the intersection of the five sets of mlncRNAs identified using the five cient less than 0.5. The weight distribution coefficient is defined measures was determined as the candidates (in total 155), because as wdc ¼ 1  , which captures the distribution of the one measure may outperform the others in different expression or weightsofPCs foralncRNA.If the weights are extremely interaction scenarios. Importantly, we utilized eigenvalue decompos- similar then the distribution coefficient is close to 1, while if ition to calculate the number of latent features of each lncRNA. The some weights play a dominant role the distribution coefficient lncRNAs whose module similarity matrix can be decomposed into approaches 0. Then the maximum k is the number of latent features. The eigen- the principal components without a dominant role (more than one value or the weight of a latent feature has minor influence when it is latent feature) were selected as the candidates. The workflow of Identification of moonlighting ncRNAs 3523 MoonFinder is described in more detail in the Materials and as 28% when using Lin’s measure, while the ratios increase to about methods Section 2.2 as well as in Figure 1. 60% when the target modules are randomly picked. Here, we take ANCR as an example to illustrate its moonlight- ing functions. ANCR is an anti-differentiation lncRNA that is 3.2 Overview of the mlncRNA candidates required to enforce the undifferentiated state in somatic progenitor As shown in the Venn diagram of Figure 2A, among the 1284 populations of epidermis. Using MoonFinder, we observed ANCR lncRNAs with interacted proteins (background lncRNAs), 538 mainly interacts with three functional modules with closely linked lncRNAs (flncRNAs) are annotated to at least one functional mod- proteins inside and no proteins are shared between any two mod- ule and eventually 155 out of them are determined as moonlighting ules, as shown in Figure 2D. Specifically, one module was enriched lncRNAs (mlncRNAs), whose target modules are functionally unre- in a variety of metabolic processes, such as estrogen, retinal and ster- lated. The mlncRNAs were identified using five semantic similarity oid, etc., whereas another module was enriched in functions like tis- (SS) measures to quantify the functional similarity between modules. sue morphogenesis and development. Signaling pathways for To obtain more reliable results, only lncRNAs in the intersection of enforcing intracellular receptors like toll-like receptor 5 and 10 were the sets of identified mlncRNAs using the five different measures highly represented for the other module (Fig. 2E–G). Consequently, were defined as mlncRNAs (Fig. 2B). These identified mlncRNA ANCR was determined as a mlncRNA who shows its functional di- candidates are displayed in Supplementary Table S1, including the versity via regulating protein modules taking part in distinct bio- number of interacting modules and the number of latent features for logical processes and it would serve as a highly reliable candidate of different SS measurements. The modules targeted by an identical moonlighting lncRNA. lncRNA have a higher probability of sharing the same functions than the randomly picked modules. Not surprisingly, significantly 3.3 Sequence features of mlncRNAs more mlncRNAs are detected when simulating the randomly To investigate whether the mlncRNAs form a distinct group of selected modules as the target modules. Specifically, Figure 2C lncRNAs, we analyzed the sequences of the corresponding non- shows that around 40% of the lncRNAs with target module can be coding genes to detect common features such as biotype, gene identified as mlncRNAs using either SS measures, the ratio is as low length, transcript length, transcript number, exon length and exon Fig. 2. Overview of the identified mlncRNAs. (A) Venn diagram of the lncRNAs. (B) Venn diagram of the mlncRNAs identified using five semantic similarity measures. (C) The ratio of real and randomly identified mlncRNAs. (D) An example of mlncRNA ANCR.(E–G) Gene Ontology functional enrichment of the three modules regulated by ANCR 3524 L.Cheng and K.-S. Leung number, as well as the evolutionary conservation and expression The ratios of flncRNAs are lower than mlncRNAs but still higher correlation with orthologous genomes. The candidates were com- than that of the background lncRNAs. Similarly, mlncRNAs have pared with other two groups of lncRNAs, i.e. the entire set of the largest proportion of expression-conserved ones and the ratio lncRNAs with protein targets (background lncRNAs) and the func- of flncRNAs is second to it. Consequently, the evolution and ex- tional module related lncRNAs (flncNRAs), which interacts with pression patterns of mlncRNAs are more conserved than those of functionally related or unrelated modules (Fig. 2A). Hence, it is im- the other lncRNAs, suggesting that the lncRNAs moonlighting in portant to identify the unique characteristics of mlncRNAs that the the cells may play more important biological roles. Besides, RNA other types of lncRNAs do not exhibit. Although no significant dif- species were officially grouped into several biotypes by their tran- ferences of gene length were detected for distinct categories of scriptional direction and exosome sensitivity (Hon et al.,2017). lncRNAs (Fig. 3A), the candidate mlncRNAs have a significantly Here we also examined the relationship between the functional longer transcript than the other types of lncRNAs. On average, the categories and biotypes, but no significant correlation was detected transcript length is about 19 500 compared with about 11 000 for (Supplementary Fig. S1A). the flncRNAs (WRT P-value ¼ 1.27e-3; Fig. 3B). But the number of transcripts in these lncRNA categories are similar to each other, 3.4 Subcellular localization features only a marginal significance was tested between the mlncRNAs and Next, we aimed to understand how the mlncRNAs behave relative the background lncRNAs (WRT P-value ¼ 4.7e-2; Fig. 3C). to the other lncRNAs in terms of subcellular localization. To investi- Importantly, we observed that the number of exon of mlncRNAs gate the spatial distribution of lncRNAs at a subcellular level, we is much larger than the regular lncRNAs (10 versus 1, WRT applied Relative Concentration Index (RCI) (Mas-Ponte et al., P-value ¼ 2.2-e16; Fig. 3D), but these exons are significantly shorter 2017), a ratio of a transcript’s concentration between two cellular than the other categories of lncRNAs (WRT P-value ¼ 3.4e-6 and compartments, to measure the localization tendency of non-coding P-value ¼ 8.3e-9; Fig. 3E), probably owing to the transcripts of RNAs. Essentially, RCI is the log2 transformed ratio of FPKM (frag- mlncRNAs are on average much longer than that of the regular ments per kilobase per million mapped) in two compartments like lncRNAs. In short, mlncRNAs have short but more exons, which is cytoplasm and nucleus. First, we calculated the cytoplasmic-nuclear a potential sequence feature for lncRNAs to moonlight in between RCI to measure the relative concentration of a lncRNA between multiple biological functions. Additionally, Tilgner et al. observed the cytoplasm and the nucleus in 15 cell lines. Figure 4A illustrates that the splicing efficiency of an exon is negatively correlated with the RCI distributions of mlncRNAs, flncRNAs, and background the length of the downstream intron, supposedly owing to the RNA- lncRNAs. It is apparent that mlncRNAs tend to have higher RCI seq reads spanning the junction between exon and intron can only values compared to the other two categories of lncRNAs in almost be detected when transcription occurs, which is a necessary condi- all these cell lines except SK.N.DZ, SK.MEL.5 and K562, suggesting tion for splicing to be carried out (Tilgner et al., 2012). So, we also a more variable localization pattern of mlncRNAs with much higher explored if there is a difference in the intron length or splicing effi- chance of residing in the cytoplasmic compartment in comparison to ciency compare to background lncRNAs and flncRNAs. As shown the other lncRNAs. Then, we further investigated the localization of in Figure 3F, unfortunately, no significant difference in the average mlncRNAs at the sub-compartment level, since LncATLAS also pro- intron length was observed among the three lncRNA categories. vides information about enrichment in the cytoplasmic and nuclear Phylogenetic conservation and expression correlation are strong sub-compartments of the K562 cells. As shown in Figure 4B, the evidence for inferring functions of coding or non-coding genes sub-nuclear RCI values of mlncRNAs are higher than that of the (Cheng et al., 2016a,b; Park et al., 2014). As lncRNAs are crucial other two groups of lncRNAs while the sub-cytoplasmic RCI values in biological processes if they are evolutionarily conserved or are relatively small. Namely, in the nucleus, the mlncRNAs are expression-correlated across species, we checked whether the prone to appear in the sub-compartments of nucleoplasm, nucleolus mlncRNAs tend to be conserved among orthologous genomes and and chromatin, whereas in the cytoplasm, the mlncRNAs are not whose expression patterns are highly correlated in orthologs. As likely to reside in insoluble and membrane relative to the other shown in Figure 3G, the mlncRNAs are prone to be evolutionarily con- lncRNAs. That is why the cytoplasmic-nuclear RCI values of served (>12%) than flncRNAs and background lncRNAs (<10%). mlncRNAs are almost the same as the background lncRNAs in the K562 cell line (Fig. 4A). Meanwhile, we also calculated the expres- sion value distribution of lncRNAs in each sub-compartment in the K562 cell line. More importantly, the expression values of the mlncRNAs are significantly higher in all the sub-compartments, re- vealing that the expression abundance of lncRNAs is crucial in exe- cuting the part-time functions (Fig. 4C). In addition, we also used another RNA localization resource RNALocate, which contains manually-curated localizations classifi- cations, to investigate the localization tendency of mlncRNAs. In this database, the lncRNAs were collected and annotated to differ- ent cell compartments, e.g. nucleus, cytosol and cytoplasm. We cal- culated the ratio of lncRNAs in these compartments separately for each category of lncRNAs. We found that mlncRNAs tend to appear Fig. 3. The distributions of lncRNAs in different functional groups with regard in more than one compartment and localize in the cytoplasmic com- to distinct sequence features. (A) Gene length. (B) Transcript length. (C) partments such as cytosol and cytoplasm (Fig. 4D–G). The ratio of Transcript number. (D) Exon number. (E) Exon length. (F) Intron length. (G) multilocation mlncRNAs is as high as 0.35, which is much higher Evolutionary conservation and expression correlation. Outliers are not than that of flncRNAs and the background lncRNAs (about 0.3 shown. The bars from left to right correspond to lncRNA, flncRNA, and mlncRNA, respectively and 0.26, respectively; Fig. 4D). More importantly, mlncRNAs were Identification of moonlighting ncRNAs 3525 Fig. 4. The mlncRNA localization and expression features. (A) The boxes from top to bottom correspond to lncRNA, flncRNA, and mlncRNA. (B) Sub-compartment expression value distribution of lncRNAs in the K562 cell line. (C) Sub-compartment RCI distribution of lncRNAs in the K562 cell line. (D) The ratios of different groups of lncRNAs residing in multiple locations. (E–G) The ratios of different groups of lncRNAs residing in the nucleus, the cytosol, and the cytoplasm, respect- ively. RCI, Relative Concentration Index found to be enriched in cytosol and cytoplasm with the ratios of process, gene expression and amniotic stem cell differentiation 0.55 and 0.3 (Fig. 4F, G), respectively, whereas the ratio is compar- (Fig. 5B). able to the other categories of lncRNAs in the nucleus (Fig. 4E). To determine whether the moonlighting of lncRNAs is impli- Consequently, we can draw the same conclusion that mlncRNAs cated in the formation and development of cancers and other dis- have a higher chance of residing in the cytoplasmic compartment eases, we associated the mlncRNAs with public available cancer and and therefore have a more variable spatial distribution. disease lncRNAs as well as cancer proteins and drug target proteins (see Materials and methods). Around 13% of the mlncRNA candi- dates have been studied involved in cancer processes, while the 3.5 Topological features of mlncRNAs and its roles in ratios of flncRNAs and background lncRNAs are less than 10% cancers (Fig. 5C). When considering other complex diseases, not surprising- lncRNA functions through its interacting partners. Accumulating ly, the ratio is as high as 23% for mlncRNAs, which is still much studies show that the multi-functionality of lncRNAs as interacting higher than that the other categories (16 and 17%, respectively; hubs with other molecules such as proteins, DNAs and RNAs. Fig. 5D). From the perspective of regulated functional modules, Apparently, the candidate mlncRNAs connect a significantly larger about 39% of the candidates are included in modules significantly number of proteins and modules than the other lncRNAs according enriching cancer proteins, whereas the ratios decrease to about 29 to the identification methodology. On average, the number of part- and 33% for the other two groups (Fig. 5E). Similarly, for the mod- ner proteins is 36.1 for mlncRNA while less than 20 for the others ules enriched of drug targets, they are more likely to interact with (WRT P-value ¼ 7.4e-8; Supplementary Fig. S1B). The number of mlncRNAs than the flncRNAs and the background lncRNAs (12% the interacted module is around 6.8 for mlncRNA compared with versus 7% versus 8%; Fig. 5F). Besides, we can draw the same con- 5 for the other lncRNAs (WRT P-value ¼ 8.8e-14; Supplementary clusion when concentrating on the proportion of cancer modules or Fig. S1C). To illustrate the combinatorial regulation and give a sys- tematic description, we constructed an mlncRNA–module regula- drug target modules regulated by the mlncRNAs (Supplementary Fig. S1D and E). Therefore, these results demonstrate that the tory network, in which the edges link the candidate mlncRNAs mlncRNAs exercise a great influence on cancer metastasis and pro- to their corresponding functional modules. This network contains gression through pairwise interactions and clustered modules of pro- in total 1055 predicted regulatory interactions between 155 mlncRNAs and 83 modules (Fig. 5A). Some modules connect more teins in the regulatory network. candidate mlncRNAs than others, indicating that they might be More importantly, we observed that the mlncRNAs and moon- engaged in a larger number of moonlighting regulations. In particu- lighting proteins (MPs) tend to be mutually exclusive in terms of lar, the largest module in the center of the network shows the high- interactions and interacted partners. Only five MPs directly inter- acts with the mlncRNAs, whereas on average the value is as high est degree of 70, suggesting that it is subjected to the regulations of 70 mlncRNAs. The proteins in this module were found to be mainly as 18.9 when randomly selected the same number of MPs (Monte implicated in biological processes such as nucleic acid metabolic Carlo P-value ¼ 1e-04, HGT P-value ¼ 5.4e-18, Fig. 5G). Also, we 3526 L.Cheng and K.-S. Leung Fig. 5. Association between mlncRNAs and diseases. (A) The mlncRNA–module regulation network. RNAs are represented in squares while modules are in circles. The bars from left to right correspond to lncRNA, flncRNA, and mlncRNA. The circle size corresponds to the module size. (B) Gene Ontology function enrichment of the module with the largest number of regulated mlncRNAs. (C, D) The ratio of cancer and disease lncRNAs among the three lncRNA categories. (E, F) The ratio of lncRNAs associated with cancer and drug target module. (G) mlncRNAs tend not to interact with MPs. T, the number of practical mlncRNA-MP interactions. (H) mlncRNAs and MPs tend to share less interacting partners. T, the number of common partners practically interacted by mlncRNAs and MPs. (I) Summary of the top ten mlncRNAs with the highest OR scores simulated the interacted partners of both mlncRNAs and MPs 10 4 Discussion 000 times and found the number of common partners between In this study, we introduced a computational framework them is 691 on average, which is significantly higher than the MoonFinder to systematically identify moonlighting lncRNAs practical value of 529 (Monte Carlo P-value ¼ 0, HGT (mlncRNAs) based on the integrated lncRNA and protein inter- P-value ¼ 2.1e-86, Fig. 5H). In other words, the number of action network as well as the protein functional annotations. In total common partner proteins that the moonlighting lncRNAs and 155 lncRNAs were determined as candidates with multiple but dis- proteins shared is significantly less than that of randomly tinct functions. Also, we characterized them from various aspects of selected ones. These results indicate the possible mechanism sequence features, evolutionary conservation, expression correl- that the cells make full use of the macromolecules to efficiently ation, expression abundance, localization tendency and interaction and systematically perform cellular tasks avoiding the redundant patterns, which will facilitate our further understanding of their implementations. functions and the mechanism of moonlighting. Additionally, from the mlncRNA–module network in Remarkably, we observed that the non-coding genes that tran- Figure 5A, we found the mlncRNAs that exclusively interact func- script mlncRNAs tend to have shorter but more exons, which is a tional modules tend to be cancer-related. Accordingly, we intro- potential sequence feature for lncRNAs to moonlight in between duced a score, Interactor Share Rate (ISR), to measure how likely multiple biological functions. Also, we found the evolution and ex- the interactors of a given lncRNA are shared by the other pression patterns of mlncRNAs are more conserved than the other lncRNAs (see Supplementary Section S11). We found that the lncRNAs, which in contrast with the conventional knowledge that cancer mlncRNAs have significantly higher ISRs than that of lncRNAs are generally less conserved than mRNAs and proteins the others (WRT P-value ¼ 3.2e-03, Supplementary Fig. S1F). For (Hon et al., 2017; Park et al., 2014), suggesting that mlncRNAs are the mlncRNAs with the top ten highest ISR scores, six out of central for the homeostasis maintenance of human. them are cancer lncRNAs (Fig. 5I). When strengthening the More importantly, we found that mlncRNAs have a more thresholdto0.5,six outofeight (75%)ofthe mlncRNAs are can- variable localization pattern with a high chance of residing in the cer genes and the other two, ANCR and LRRC75A-AS1,could be cytoplasmic compartment, although they display high expression considered as the candidates of cancer mlncRNA, where the dys- function or inappropriate switching of these RNAs in different across all the cell compartments. mlncRNAs are expressed signifi- cantly higher in all the sub-compartments of the K562 cell lines in cell compartments may result in the biological activity of cancer, comparison with the other lncRNAs, suggesting that the high ex- although further experimental works are needed to warrant this claim. pression abundance is necessary for executing the part-time Identification of moonlighting ncRNAs 3527 functions. We studied the localization tendency and translocation importance for cancer therapeutics and will provide tremendous activity of these mlncRNAs because lncRNAs are diversely resided opportunities for anti-cancer strategies. The moonlighting feature of in the cells and play a crucial role as modulators to regulate gene ex- the other types of RNAs, such as miRNA and circRNA (Chen, pression in multiple ways (Cabili et al., 2015; Ferre et al., 2016; 2016a,b), will also be studied and compared and eventually a moon- Quinn and Chang, 2016; Zhou et al., 2017; Zhu et al., 2016). It is lighting atlas of both RNAs and proteins will be provided. Similar to well known that the competing endogenous RNAs (ceRNAs) can animal cells, plant cells are heavily compartmentalized. Besides the protect the coding RNAs from repression by binding them to the co- typical organelles in animal cells, plant cells usually contain a large targeted miRNAs. mlncRNAs may function diversely in the same number of plastids and a central vacuole, which is critical in storing way by competing or shared miRNAs, which mainly reside in the organelles and involved in the regulation of cellular osmoregulation cytoplasm to execute their functions, leading to a more variable spa- (Gao et al., 2015; Zeng et al.,2015; Zhuang et al., 2017). tial distribution (Rashid et al., 2016). In the future, we will investi- Hence, we will explore and establish methodologies to identify gate whether the mlncRNAs act as ceRNAs to modulate their moonlighting ncRNAs and proteins for in-depth functional studies expression abundance and interaction pattern, e.g. regulating the in plants. abundance of lncRNAs in one compartment may influence the func- tion of the other cell compartment. Our result also shows that mlncRNAs and MPs are rather mutu- Funding ally exclusive in terms of their direct interactions and interacting This work was supported by The Chinese University of Hong Kong Direct partners. In other words, lncRNAs and proteins with moonlighting Grant; and the Research Grants Council of Hong Kong GRF Grant [414413]. functions are not likely to interact with each other and they even Conflict of Interest: none declared. tend to share fewer neighbors in the regulatory network. The reason might be that the macromolecules in cells are usually organized to be efficient to perform different cellular tasks without redundancy. References According to the mlncRNA–module bipartite network, we also pre- Abumrad,N.A. and Lange,A.J. (2006) The metabolism of cancer cells: moon- dicted eight cancer lncRNAs and six out of them were previously lighting proteins and growth control. Curr. Opin. Clin. Nutrition Metab. identified as cancer lncRNAs by different experimental assays. Care, 9, 337–338. A potential limitation is that the low value of semantic similarity Barabasi,A.L. and Oltvai,Z.N. (2004) Network biology: understanding the between a pair of modules might due to the incompleteness of GO. cell’s functional organization. Nat. Rev. Genet., 5, 101–113. For instance, one lncRNA is detected as a moonlighting candidate Boukouris,A.E. et al. (2016) Metabolic enzymes moonlighting in the nucleus: and it regulates some modules with low semantic similarity, which metabolic regulation of gene transcription. Trends Biochem. Sci., 41, may partially due to the proteins involved in these modules show no 712–730. evidence of association with GO terms, but the association might Buxbaum,A.R. et al. (2015) In the right place at the right time: visualizing have not been established yet. It is noted that the absence of annota- and understanding mRNA localization. Nat. Rev. Mol. Cell Biol., 16, tion does not imply the absence of function, because the information 95–109. of GO is still far from complete, although the creation of new anno- Cabili,M.N. et al. (2015) Localization and abundance analysis of human tations increased dramatically. Another potential bias is that the GO lncRNAs at single-cell and single-molecule resolution. Genome Biol., 16, is structured as a graph with unbalance of high and low levels of 20. Chapple,C.E. et al. (2015) Extreme multifunctional proteins identified from a details among branches (Pritykin et al., 2015). This has a strong im- human protein interaction network. Nat. Commun., 6, 7412. plication on measuring the semantic similarity of GO terms. For in- Chen,L.L. (2016a) The biogenesis and emerging roles of circular RNAs. Nat. stance, sibling GO terms can be semantically very similar or Rev. Mol. Cell Biol., 17, 205–211. extremely distinct in different branches of the GO structure. In this Chen,L.L. (2016b) Linking long noncoding RNA localization and function. study, we leveraged semantic similarity among the GO terms to Trends Biochem. Sci., 41, 761–772. identify moonlighting lncRNAs. Since similar processes may be cate- Cheng,L. et al. (2016a) CrossNorm: a novel normalization strategy for micro- gorized in distinct branches in GO, low values of semantic similarity array data in cancers. Sci. Rep., 6, 18898. do not always imply multifunctionality and hence our findings may Cheng,L. et al. (2016b) ICN: a normalization method for gene expression data include some false positive discoveries. Pritykin et al. proposed an considering the over-expression of informative genes. Mol. Biosyst., 12, approach to identify multifunctional genes based on GO, which 3057–3066. defined a multifunctional gene as it is annotated with at least ‘two Du,Z. et al. (2013) Integrative genomic analyses reveal clinically relevant sufficiently distinct functional terms of comparable specificity’ long noncoding RNAs in human cancer. Nat. Struct. Mol. Biol., 20, (Pritykin et al., 2015). The definition of multifunctionality might 908–913. Espinosa-Cantu, A. et al. (2015) Gene duplication and the evolution of moon- help mitigate the problem of false positive discovery in detecting lighting proteins. Front. Genet., 6, 227. moonlighting lncRNAs. In our future work, we will integrate Ferre,F. et al. (2016) Revealing protein–lncRNA interaction. Brief. this co-occurrence based definition of multifunctionality into Bioinform., 17, 106–116. MoonFinder. Gao,C. et al. (2015) Dual roles of an Arabidopsis ESCRT component FREE1 We believe our observations can aid our and other research in regulating vacuolar protein transport and autophagic degradation. Proc. groups to understand how they function in a moonlighting manner Natl. Acad. Sci. USA, 112, 1886–1891. and help in designing RNAs with novel functions and applications. Hon,C.C. et al. (2017) An atlas of human long non-coding RNAs with accur- Moreover, investigating the mechanisms that determine the func- ate 5’ ends. Nature, 543, 199–204. tional diversity of mlncRNAs has the potential to provide new Jeffery,C.J. (2015) Why study moonlighting proteins? Front. Genet., 6, 211. insights into their biogenesis, physical interaction, subcellular Khan,I. et al. (2014) Genome-scale identification and characterization of localization and therapeutic application in diseases. In the future, moonlighting proteins. Biol. Direct, 9, 30. we will investigate the mechanism of how the mlncRNAs modulate Khan,I.K. et al. (2017) DextMP: deep dive into text for predicting moonlight- and switch the functions in metabolic processes, which is of vital ing proteins. Bioinformatics, 33, i83–i91. 3528 L.Cheng and K.-S. Leung Khan,I.K. and Kihara,D. (2016) Genome-scale prediction of moonlighting proteins Piatigorsky,J. and Wistow,G.J. (1989) Enzyme/crystallins: gene sharing as an using diverse protein association information. Bioinformatics, 32, 2281–2288. evolutionary strategy. Cell, 57, 197–199. Liao,Q. et al. (2011) Large-scale prediction of long non-coding RNA functions Pritykin,Y. et al. (2015) Genome-wide detection and analysis of multifunc- in a coding-non-coding gene co-expression network. Nucleic Acids Res., 39, tional genes. PLoS Comput. Biol., 11, e1004467. 3864–3878. Quinn,J.J. and Chang,H.Y. (2016) Unique features of long non-coding RNA Liu,W. et al. (2014) Gene co-expression analysis identifies common modules biogenesis and function. Nat. Rev. Genet., 17, 47–62. related to prognosis and drug resistance in cancer cell lines. Int. J. Cancer, Rashid,F. et al. (2016) Long non-coding RNAs in the cytoplasm. Genomics 135, 2795–2803. Proteomics Bioinform., 14, 73–80. Ma,S. and Dai,Y. (2011) Principal component analysis based methods in bio- Sriram,G. et al. (2005) Single-gene disorders: what role could moonlighting informatics studies. Brief. Bioinform., 12, 714–722. enzymes play? Am. J. Hum. Genet., 76, 911–924. Ma,S. and Kosorok,M.R. (2009) Identification of differential gene pathways Tilgner,H. et al. (2012) Deep sequencing of subcellular RNA fractions shows with principal component analysis. Bioinformatics, 25, 882–889. splicing to be predominantly co-transcriptional in the human genome but in- Mani,M. et al. (2015) MoonProt: a database for proteins that are known to efficient for lncRNAs. Genome Res., 22, 1616–1625. moonlight. Nucleic Acids Res., 43, D277–D282. Wahlestedt,C. (2013) Targeting long non-coding RNA to therapeutically Mas-Ponte,D. et al. (2017) LncATLAS database for subcellular localization of upregulate gene expression. Nat. Rev. Drug Discov., 12, 433–446. long noncoding RNAs. RNA, 23, 1080–1087. Wang,P. et al. (2015) Identification of lncRNA-associated competing triplets Min,K.W. et al. (2016) Moonlighting proteins in cancer. Cancer Lett., 370, reveals global patterns and prognostic markers for cancer. Nucleic Acids 108–116. Res., 43, 3478–3489. Monaghan,R.M. and Whitmarsh,A.J. (2015) Mitochondrial proteins moon- Zeng,Y. et al. (2015) Unique COPII component AtSar1a/AtSec23a pair is lighting in the nucleus. Trends Biochem. Sci., 40, 728–735. required for the distinct function of protein ER export in Arabidopsis thali- Nepusz,T. et al. (2012) Detecting overlapping protein complexes in protein– ana. Proc. Natl. Acad. Sci. USA, 112, 14360–14365. protein interaction networks. Nat. Methods, 9, 471–472. Zhou,J. et al. (2017) LncFunNet: an integrated computational framework for Ning,S. et al. (2016) Lnc2Cancer: a manually curated database of experimen- identification of functional long noncoding RNAs in mouse skeletal muscle tally supported lncRNAs associated with various human cancers. Nucleic cells. Nucleic Acids Res., 45, e108. Acids Res., 44, D980–D985. Zhu,X. et al. (2016) A long non-coding RNA signature to improve prognosis Park,C. et al. (2014) lncRNAtor: a comprehensive resource for functional prediction of gastric cancer. Mol. Cancer, 15, 60. investigation of long non-coding RNAs. Bioinformatics, 30, 2480–2485. Zhuang,X. et al. (2017) ATG9 regulates autophagosome progression from the Piatigorsky,J. et al. (1988) Gene sharing by delta-crystallin and argininosucci- endoplasmic reticulum in Arabidopsis. Proc. Natl. Acad. Sci. USA, 114, nate lyase. Proc. Natl. Acad. Sci. USA, 85, 3479–3483. E426–E435.

Journal

BioinformaticsOxford University Press

Published: May 16, 2018

References