Quantification of non-coding RNA target localization diversity and its application in cancers

Quantification of non-coding RNA target localization diversity and its application in cancers Abstract Subcellular localization is pivotal for RNAs and proteins to implement biological functions. The localization diversity of protein interactions has been studied as a crucial feature of proteins, considering that the protein–protein interactions take place in various subcellular locations. Nevertheless, the localization diversity of non-coding RNA (ncRNA) target proteins has not been systematically studied, especially its characteristics in cancers. In this study, we provide a new algorithm, non-coding RNA target localization coefficient (ncTALENT), to quantify the target localization diversity of ncRNAs based on the ncRNA–protein interaction and protein subcellular localization data. ncTALENT can be used to calculate the target localization coefficient of ncRNAs and measure how diversely their targets are distributed among the subcellular locations in various scenarios. We focus our study on long non-coding RNAs (lncRNAs), and our observations reveal that the target localization diversity is a primary characteristic of lncRNAs in different biotypes. Moreover, we found that lncRNAs in multiple cancers, differentially expressed cancer lncRNAs, and lncRNAs with multiple cancer target proteins are prone to have high target localization diversity. Furthermore, the analysis of gastric cancer helps us to obtain a better understanding that the target localization diversity of lncRNAs is an important feature closely related to clinical prognosis. Overall, we systematically studied the target localization diversity of the lncRNAs and uncovered its association with cancer. long non-coding RNAs, RNA–protein interactions, target localization diversity, subcellular localization, gastric cancer Introduction Subcellular localization has been demonstrated to be a fundamental regulation mode in cells (Barabasi and Oltvai, 2004; Park et al., 2011; Buxbaum et al., 2015; Zhang et al., 2017; Cheng et al., 2017; Thul et al., 2017). Proteins are assigned in specific locations of a cell, enabling cells to implement diverse biochemical processes and metabolic activities in a concurrent manner. On the other hand, erroneous localization of proteins is a major determinant of cellular dysfunction and human diseases. Several examples of proteins in inappropriate locations were presented in our previous work (Cheng et al., 2017). In particular, the proteins targeted by non-coding RNAs (ncRNAs) locate in various cell compartments to reside in distinct physiological environments and interact with eligible partners. Hence, studying the dynamics of how ncRNAs regulate the spatial organization of their target proteins is critical to understand the mechanisms of ncRNA–protein interactions in cells. Accumulated studies have reported that long ncRNAs (lncRNAs) participate in numerous cellular processes, such as the post-transcriptional gene regulation through histone modification, or the transcriptional gene silencing through chromatin remodeling, resulting in the differential expression of target coding genes (Cabili et al., 2015; Ferre et al., 2016; Quinn and Chang, 2016; Zhu et al., 2016; Zhou et al., 2017). Not limited to coding genes, moreover, lncRNAs also serve as cancer diagnostic or prognostic markers (Du et al., 2013; Liu et al., 2014; Wang et al., 2015; Zhu et al., 2016). Growing lines of evidence indicate that the dysregulation of lncRNAs may cause disorder or even mediate oncogenic or tumor-suppressing effects in human via the interactions with other cellular macromolecules such as DNA and protein (Lukong et al., 2008; Licatalosi and Darnell, 2010; Wahlestedt, 2013; Li et al., 2014; Wang et al., 2015; Wu et al., 2015; Ning et al., 2016). The lncRNA interactions with proteins are central to a wide range of cellular processes such as transcriptional regulation, protein–protein interaction, complex assembly, or direct subcellular localization (Li et al., 2014; Chen, 2016b; Ferre et al., 2016). Consequently, an increasing number of experimental and computationally predicted data of lncRNA–protein interactions are becoming available during the last decade (Liu et al., 2010; Li et al., 2014, 2015; Jiang et al., 2015; Liu and Chen, 2016; Liu and Miao, 2016; Yi et al., 2017). Importantly, a majority of proteins have a highly dynamic localization in human cells and they are simultaneously localized in more than one cell compartment (Cheng et al., 2017; Thul et al., 2017). For example, REST, a transcription factor localized in nucleoplasm and cytosol, acts as a transcriptional repressor and its defects can cause Huntington disease, colon carcinomas, small-cell lung carcinomas, etc. (Huang and Bao, 2012; UniProt, 2014). Another example is SMARCB1, a nucleoplasm and nucleoli-related protein, which is a core component of the BAF complex engaging in cellular antiviral activities, cell proliferation, and cell differentiation (UniProt, 2014). Both of them are the targets of the lncRNA FGFR1, and most of its target proteins are functioning in different compartments of the cells and have miltilocalization (Thul et al., 2017; Yi et al., 2017). These factors show that the target localization diversity of a lncRNA is indicative of its functions. Thus, exploration of the spatial distribution of their target proteins at the subcellular level is essential for understanding the mechanism of ncRNA regulation. Despite the identification of these ncRNA–protein interactions and protein subcellular locations, the target localization diversity of lncRNAs and its association with cancers remain unclear. To address these issues, we propose a novel method, non-coding RNA target localization coefficient (ncTALENT), to quantify the target localization diversity of ncRNAs by integrating the ncRNA–protein interactions and the protein subcellular localization. Our result reveals that the target localization diversity is a key feature of lncRNAs in human. lncRNAs in multiple cancers, differentially expressed (DE) cancer lncRNAs, and lncRNAs with multiple cancer target proteins are prone to have high localization diversity. Furthermore, the analysis of gastric cancer helps us to better understand that the target localization diversity of lncRNAs is an important feature closely related to clinical prognosis. Results The computational framework to quantify ncRNA target localization diversity The target localization coefficient (TLC) was used to measure the localization diversity of ncRNA target proteins in the major cellular compartments. To obtain the TLC, two types of data were required, the association between ncRNAs and their target proteins as well as the localization information of these proteins. In this study, we focused on lncRNAs, a type of non-protein coding transcripts >200 nucleotides (Wahlestedt, 2013; Hon et al., 2017). A total of 1245 lncRNAs were collected from the RAID v2.0 database with annotated target proteins (Yi et al., 2017). For each of them, as shown in Figure 1, we first constructed a target localization matrix (TLM) representing the associations between proteins and their localized compartment, and then calculated the TLC based on the matrix and Equation (3). The TLCs and other detailed information of these lncRNAs, such as the biotype and the target number, are listed in Supplementary Table S1. For instance, lncRNA H19, a lncRNA that usually promotes cancer cell proliferation, obtains a high TLC of 0.7431 and it has 18 target proteins distributed among 10 compartments, including cytosol, Golgi apparatus, mitochondria, plasma membrane, and vesicles, etc. Figure 1 View largeDownload slide Calculation of the ncRNA TLC. ncRNA X has six targets with a total of 10 subcellular annotations, which can be represented in the TLM. The columns and rows of TLM refer to the ncRNAs and the cell compartments, respectively. For example, TLM4,2 represents the second target protein of X localizing in the compartment of cytosol. Based on TLM, we can compute the ncRNA TLC using the function of TLC in the bottom left panel, where m, n, and k correspond to the compartment number, the target number, and the annotation record number, respectively. A simple example is shown in the right panel accordingly. The numeric data related to different compartment are marked in different colors. Figure 1 View largeDownload slide Calculation of the ncRNA TLC. ncRNA X has six targets with a total of 10 subcellular annotations, which can be represented in the TLM. The columns and rows of TLM refer to the ncRNAs and the cell compartments, respectively. For example, TLM4,2 represents the second target protein of X localizing in the compartment of cytosol. Based on TLM, we can compute the ncRNA TLC using the function of TLC in the bottom left panel, where m, n, and k correspond to the compartment number, the target number, and the annotation record number, respectively. A simple example is shown in the right panel accordingly. The numeric data related to different compartment are marked in different colors. Overview of the lncRNA TLC lncRNAs are implicated in various biological functions via the interaction with their target proteins. Figure 2A shows the number of proteins and the number of lncRNA targets in each cell compartment. It is obvious that the target proteins account for approximately one-third of the proteins in each compartment. Referring to the FANTOM classification scheme (Hon et al., 2017), the lncRNAs were categorized into six biotypes according to their sequence signatures. As shown in Figure 2B, a total of 1245 lncRNAs were analyzed in this study, as the proteins they target have localization information, and 789 out of them were assigned to the six biotypes, i.e. antisense, divergent, sense intronic, intergenic, exonic, and pseudogenes. A large amount of intergenic, divergent, and exonic lncRNAs are functionally annotated by FANTOM, while the number is less for lncRNAs of antisense, sense intronic, and pseudogenes (≤100). Figure 2 View largeDownload slide Summary of lncRNAs and target proteins. (A) Bar chart showing the protein numbers and lncRNA target numbers in each cell compartment. (B) The distribution of lncRNAs in each biotype for all collected lncRNAs. (C) The box plot shows that lncRNAs in distinct biotypes have different TLC distributions. Figure 2 View largeDownload slide Summary of lncRNAs and target proteins. (A) Bar chart showing the protein numbers and lncRNA target numbers in each cell compartment. (B) The distribution of lncRNAs in each biotype for all collected lncRNAs. (C) The box plot shows that lncRNAs in distinct biotypes have different TLC distributions. Interestingly, we found that lncRNAs in distinct biotypes behave differently in TLC distributions. The exonic and pseudo-lncRNAs tend to have high TLCs whereas lncRNAs in the other biotypes show opposite trends. Specifically, the median TLCs of antisense, divergent, sense intronic, and intergenic lncRNAs are close to zero, which are much less than that of the background lncRNAs (Figure 2C). In contrast, exonic and pseudo lncRNAs achieve overall higher TLC distributions and the exonic lncRNA set has the highest distribution, both of which are significantly higher than the background distribution (P < 0.01, Mann–Whitney U test), indicating an association between the lncRNA expression level and the location diversity. Accumulating evidence suggests that mRNAs have substantially higher expression levels than lncRNAs. In particular, Derrien et al. (2012) observed that mRNAs tend to be more correlated with lncRNAs that are encoded by the same coding genes than the expression correlations between mRNAs and their intronic lncRNAs. The lncRNA TLC in cancers It is found that lncRNAs are pervasively transcribed in the mammalian genome and several of them play the roles as oncogenic or tumor-suppressor genes in multiple cancers (Wahlestedt, 2013; Ning et al., 2016; Quinn and Chang, 2016; Zhu et al., 2016). To determine whether the target localization diversity of lncRNAs is associated with cancers, cancer lncRNAs and proteins encoded by cancer genes of 26 cancers were investigated in the present study. Importantly, we uncovered that cancer lncRNAs generally have high target localization diversity, whereas the regular proteins do not show the same trend. As shown in Figure 3A, each box indicates the TLC distribution of the lncRNAs involved in a specific type of cancer and the ones marked in red represent their TLCs are significantly higher than those of all proteins (P < 0.05, Mann–Whitney U test). Specifically, the median TLCs of all lncRNAs and the Cancer lncRNAs are 0.38 and 0.5, respectively, but the value is as high as 0.6 for lncRNAs of gastric cancer (P < 0.01, Mann–Whitney U test). Remarkably, the TLCs of lncRNAs of all the 26 cancers are significantly higher in comparison with that of all the background lncRNAs. Among these cancer lncRNAs, the DE cancer lncRNAs have significantly higher TLCs than those of the regular cancer lncRNAs (P < 0.01, Mann–Whitney U test, Figure 3B). It is clear in Figure 3C that the lncRNA TLCs are positively correlated with the number of cancers in which the lncRNAs occur, with the R- and P-value of linear regression fitting model are 0.4670 and 1.43e–05, respectively. Additionally, we assessed whether the number of cancer targets contributes to the localization diversity. In Figure 3D, as expected, the lncRNA TLCs rise apparently with the increase of the number of cancer targets. Figure 3 View largeDownload slide TLC distributions of lncRNAs in cancers. (A) Red box indicates that the TLC of the target proteins involved in corresponding cancer is significantly higher compared with that of all proteins (P < 0.05, Mann–Whitney test). (B) The TLC distributions for different subtypes of cancer lncRNAs. No significant difference was found among subtypes of lncRNAs. (C) The association between the TLC of a lncRNA and the number of the lncRNA-occurred cancers. The R- and P-value of linear regression fitting model are 0.4670 and 1.43e–05, respectively. (D) The TLC distributions of lncRNAs with different numbers of target proteins. Up, upregulated cancer lncRNAs; Down, downregulated cancer lncRNAs; DE, differentially expressed cancer lncRNAs. Figure 3 View largeDownload slide TLC distributions of lncRNAs in cancers. (A) Red box indicates that the TLC of the target proteins involved in corresponding cancer is significantly higher compared with that of all proteins (P < 0.05, Mann–Whitney test). (B) The TLC distributions for different subtypes of cancer lncRNAs. No significant difference was found among subtypes of lncRNAs. (C) The association between the TLC of a lncRNA and the number of the lncRNA-occurred cancers. The R- and P-value of linear regression fitting model are 0.4670 and 1.43e–05, respectively. (D) The TLC distributions of lncRNAs with different numbers of target proteins. Up, upregulated cancer lncRNAs; Down, downregulated cancer lncRNAs; DE, differentially expressed cancer lncRNAs. The lncRNA TLC in gastric cancer Next, we explored the association between the expression pattern and spatial distribution of lncRNA target proteins in gastric cancer, as the gastric cancer associates more lncRNAs than the other cancers as shown in Figure 3A. Also, gastric cancer is currently the second leading cause of cancer-related deaths worldwide and the importance of studying how lncRNAs function in gastric cancer is beyond doubt (Torre et al., 2016; Zhu et al., 2016). The RNA-seq data of gastric cancer from TCGA were used for survival analysis and identification of survival-associated (SA) genes. To identify the lncRNAs associated with clinical outcome in gastric cancer, we performed multivariate Cox regression analysis to evaluate the significance of the correlations between lncRNA target expression and overall survival. The detailed risk score or eigengene (EG) value, survival information, and nine target expressions of lncRNA FGFR1 are shown in Figure 4. Specifically, Figure 4A shows the sorted EG value distribution of lncRNA FGFR1 based on the expression values of its targets. Figure 4B illustrates the survival time and vital status of detected patients and Figure 4C is the heatmap of the target expression profiles. The dotted line across the three figures represents the median EG value dividing patients into two groups with high and low risk. We identified ~20 lncRNAs in gastric cancer whose expressions are remarkably correlated with overall survival and the top 10 most significant SA lncRNAs are listed in Figure 4D. The lncRNA FGFR1 is an example of lncRNAs that shows a negative correlation between its target expression and the overall survival. Figure 4E illustrates the Kaplan–Meier curves for all the 415 patients according to the nine targets of lncRNA FGFR1. Patients with low EG value show significantly shorter survival times than those with high EG value, indicating that the suppression of lncRNA FGFR1 in human gastric cancer is linked to poor prognosis. Figure 4 View largeDownload slide Survival analysis of lncRNA FGFR1 in TCGA gastric cancer. (A) lncRNA EG score distribution. (B) The survival time and vital status of detected patients. (C) Heatmap of the target gene expression profiles. Rows represent lncRNA targets, while columns represent patients. In A–C, the dotted line represents the median lncRNA EG score dividing patients into two groups of high and low risk. (D) Summary of 10 SA lncRNAs with the most significant Q-values. (E) Kaplan–Meier curves of two patient groups with higher or lower EG value of FGFR1. Red corresponds to higher EG value and blue corresponds to lower EG value. (F) The box plot shows that DE and SA lncRNAs have higher TLC than the background lncRNAs (Mann–Whitney U test). EG, eigengene; DE, differentially expressed lncRNA; SA, survival-associated lncRNA; Q-value, BH corrected P-value. Figure 4 View largeDownload slide Survival analysis of lncRNA FGFR1 in TCGA gastric cancer. (A) lncRNA EG score distribution. (B) The survival time and vital status of detected patients. (C) Heatmap of the target gene expression profiles. Rows represent lncRNA targets, while columns represent patients. In A–C, the dotted line represents the median lncRNA EG score dividing patients into two groups of high and low risk. (D) Summary of 10 SA lncRNAs with the most significant Q-values. (E) Kaplan–Meier curves of two patient groups with higher or lower EG value of FGFR1. Red corresponds to higher EG value and blue corresponds to lower EG value. (F) The box plot shows that DE and SA lncRNAs have higher TLC than the background lncRNAs (Mann–Whitney U test). EG, eigengene; DE, differentially expressed lncRNA; SA, survival-associated lncRNA; Q-value, BH corrected P-value. Furthermore, 19 DE lncRNAs were screened out from the gastric RNA-seq data, and 10 out of them are still the SA lncRNAs. A lncRNA is defined as DE lncRNA if a significantly large fraction of genes encoding its target proteins are differentially expressed (see Equation (4)). We found that both the SA and DE lncRNAs are prone to have high TLCs and their overlaps have the highest TLC. As shown in Figure 4F, specifically, the median TLCs of DE and SA lncRNAs are 0.58 and 0.59, respectively, both of which are significantly higher than that of the background lncRNAs, and the median is as high as 0.7 for their overlap. Consequently, the proteins targeted by DE and/or SA lncRNAs are prone to diversely appear in cell compartments, revealing that the target localization diversity of lncRNAs is closely related to their expression levels and prognosis in cancers. Discussion It is well known that spatial partition of cells enables multiple cellular processes to proceed in parallel (Park et al., 2011; Cabili et al., 2015; Chen, 2016b; Cheng et al., 2017). Also, there is increasing recognition on ncRNA–protein interactions being fundamental to cellular processes like protein–protein interactions (Liu et al., 2010; Vidal et al., 2011; Jiang et al., 2015; Li et al., 2015; Ferre et al., 2016; Lievens et al., 2016; Liu and Chen, 2016; Liu and Miao, 2016; Yi et al., 2017). Therefore, we studied the target localization diversity of the lncRNAs and uncovered its association with cancer. The target localization diversity of lncRNA is found to be distinct among their biotypes. Remarkably, cancer lncRNAs are prone to have high TLCs, while cancer proteins are not. Moreover, DE and SA lncRNAs in gastric cancer tend to have high TLCs. We observed an association between lncRNA expression level and the target location diversity. The pseudo transcriptions and exonic lncRNAs are prone to have significantly higher TLCs and both of them are from the protein-coding regions of genes with high expression abundance, since several studies demonstrated that mRNAs have substantially higher expression levels than lncRNAs (Derrien et al., 2012; Deveson et al., 2017). Pseudogenes have little potential to produce the same functional product as their parental coding genes, but their sequences are still highly similar (Milligan and Lipovich, 2014). Hence, like exonic lncRNAs, the pseudogene-derived lncRNAs have more chances to interact with their parental genes, which might be the reason why these pseudo transcriptions and exonic lncRNAs tend to have higher target localization diversity. Also, this could be due to the cellular strategy of keeping non-coding gene expression in tune with physiological needs. The lncRNAs with high target localization diversity tend to interact with more target proteins and hence a high-level expression of lncRNAs in human cells is required. Alterations in regulatory lncRNAs may directly influence their target proteins by interfering with protein post-translational modification or indirectly affect the transcripts translating their targets by decoying miRNAs and proteins to modulate their translation, leading to dysfunction of the cellular signaling cascades and downstream pathways. For instance, the lncRNA MALAT1, a well-known prognostic biomarker of lung adenocarcinoma and squamous cell carcinoma, has an extremely high TLC of 0.81 with its target proteins evenly distributed across different subcellular compartments. From the view of cancer development, also, MALAT1 is usually abnormally overexpressed in many other tissues, such as breast, prostate, colon, liver, cervical, and bladder, where it affects the regulation of cell proliferation, cell migration, invasion, and apoptosis (Wahlestedt, 2013; Cabili et al., 2015; Quinn and Chang, 2016). Specifically, MALAT1 has been shown to accelerate the cell invasion and proliferation in cervical cancer by interacting with several protein partners, e.g. BAX, BCL2, Caspase-8, Caspase-3, and BCL-XL. Furthermore, MALAT1 interacts with the Polycomb 2 protein (Pc2) and activates E2F1 sumoylation to promote activate cancer progression. RNA-binding proteins such as TDP-43 also have been reported as the MALAT1 interaction partners, which contribute a lot to cancer pathogenesis. Overall, the lncRNAs may work as a generalist in cancers by interacting with multiple proteins involved in distinct cellular compartments and diverse biological functions. Despite thousands of lncRNAs have been identified in human, only a fraction of them are functionally determined, with the underlying reason that the lncRNA sequences are not as conserved as the regular protein-coding genes and the direct prediction of their functions and localization is a difficult task (Liao et al., 2011; Quinn and Chang, 2016; Chen, 2016b; Zhou et al., 2017). Due to the incomplete information of ncRNA localization and interaction, it is hard to generate a subcellular-specific RNA–protein interaction network and only around a thousand of lncRNAs were investigated in the current study compared to the large number of lncRNAs encoded by human genome (Buxbaum et al., 2015; Zhang et al., 2017). Alternatively, we measure the localization diversity of the target proteins of ncRNAs and the interacting ncRNAs and proteins are assumed to be in the same subcellular locations, since the localization annotation of ncRNAs is extremely incomplete. To the best of our knowledge, merely two databases have collected the localization annotation of ncRNAs, i.e. RNALocate (Zhang et al., 2017) and LncATLAS (Mas-Ponte et al., 2017), and only a small portion of lncRNAs we studied can be mapped to cell locations. Besides, lncRNAs execute functions in cancers through their interactions with macromolecules, such as proteins that scatter in subcellular compartments (Lukong et al., 2008; Licatalosi and Darnell, 2010; Li et al., 2014; Jiang et al., 2015; Ferre et al., 2016), and, therefore, the target localization diversity may indirectly reflect the ncRNA functional diversity. To evaluate the inferred localization of ncRNAs, we adopted Monte Carlo simulation to model the probability of co-localized ncRNA–protein interactions based on the ncRNA localization information from RNALocate. Due to the limited annotation, we only validated the ncRNAs in the nucleus and the cytoplasm. Our finding shows that the interacting ncRNAs and proteins are significantly more likely to reside in the same subcellular locations in comparison to the simulated ncRNA–protein interactions. As shown in Supplementary Figure S3A, the ratios of nucleus-specific interactions and cytoplasm-specific interactions are significantly higher than the ratio of the simulated ones, respectively. Specifically, 1139 out of 1323 interactions (86.75%) are subcellular specific in either nucleus or cytoplasm, which cannot be achieved by chance based on the simulated subcellular-specific interactions (Supplementary Figure S3B). When merely focusing on the subcellular-specific interactions, the TLCs of cancer lncRNAs are still higher than that of the non-cancer lncRNAs (Supplementary Figure S3C), and the difference is large despite not statistically significant due to the small available sample size. Only 17 lncRNAs and their respective target proteins are both annotated to the same locations of the nucleus or the cytoplasm. Hopefully, more RNA annotations can be confirmed with the rapid development of the RNA localization and interaction techniques (Liu et al., 2010; Liu and Chen, 2016; Liu and Miao, 2016), and we will reanalyze the subcellular-specific localization diversity once appropriate and more comprehensive ncRNA localization data are available. Meanwhile, we cannot neglect that cancer-associated lncRNAs are usually better studied and characterized than lncRNAs unrelated to cancer, and, therefore, more target proteins have been discovered in cancer-associated lncRNAs, which might be a latent reason for the higher TLCs in cancer. To address this problem, we also assessed whether the proteins targeted by cancer lncRNAs are prone to have high localization coefficients (LCs) in comparison with the other proteins. The LC of each protein was separately calculated. As shown in Supplementary Figure S1, the LCs of the cancer lncRNA target proteins are significantly higher than that of the others (P < 1.03e–02, Mann–Whitney U test), indicating that the target proteins, other than the target number, contribute a lot to the lncRNA target localization diversity. Furthermore, we also compared the TLC distributions between cancer and non-cancer lncRNAs using another database, NPInter v3.0 (Hao et al., 2016), in which only the interactions determined using high-throughput sequencing technologies were adopted, such as CLIP, HITS-CLIP, PAR-CLIP, and RIP. This may to some extent reduce the data bias in cancer studies, because we merely used the high-throughput data where no lncRNAs are specifically concerned. As shown in Supplementary Figure S2A, a very small proportion of lncRNAs are annotated in both databases. Unsurprisingly, we found that the TLCs of cancer lncRNAs are also significantly higher than that of the non-cancer ones (P < 3.25e–03, Mann–Whitney U test, Supplementary Figure S2B), suggesting that there is little study bias in the calculation of TCL. Also, it should be noted that an organelle can further be divided into sub-locations. Proteins located in different sub-locations play distinct biological roles and likewise for ncRNAs. Some researchers have developed algorithms to predict protein sub-subcellular localizations, such as sub-mitochondria locations (Lin et al., 2013a), sub-chloroplast locations (Lin et al., 2013b), and sub-Golgi locations (Ding et al., 2011, 2013). It is no doubt to make an analysis on the sub-organelle level that will provide a deep understanding of disease development. Thus, we will perform more detailed studies on the sub-organelle level by using these tools in the future. In summary, our result elucidates how lncRNAs obtain functional outcomes in cancer through diverse interactions with their protein partners, implying the localization diversity of target proteins is a major determinant of lncRNA functionality as well as shedding light on the molecular mechanisms of these cancer-associated transcripts in human cancers. In addition to physical interactions, the target localization diversity quantified can be used to systematically study lncRNA–protein interaction on the basis of transcriptome data including both normal and cancer samples. For instance, we can identify lncRNAs that become either more diverse or less diverse in cancer samples in comparison to the normal ones, which may also facilitate the experimental identification of cancer-related lncRNAs and novel drug targets. To our knowledge, this is the first time that the target localization diversity of lncRNAs was systematically and comprehensively investigated, especially its characteristics in cancers. Besides, the difference of target localization diversity among ncRNA subtypes may imply functional characteristics and biological mechanism, so it will be valuable to use ncTALENT to analyze the other ncRNA species in depth, such as circRNA (Chen, 2016a; Meng et al., 2017), although our study only concentrated in lncNRAs. Materials and methods Datasets Subcellular localization of proteins We obtained the experimental localization information of proteins from the Cell Atlas (Thul et al., 2017), a subpart of the Human Protein Atlas (Ponten et al., 2008). Using an antibody-based image profiling approach and transcriptomics data, the Cell Atlas delivers a comprehensive resource for human protein subcellular localization. In the current version, 12003 proteins were annotated to 32 subcellular locations on a single-cell level and they can be further categorized into 14 major compartments based on the cellular substructures. All the lncRNA target proteins were analyzed according to these major compartments in this study. lncRNA–protein interactions The interactions between lncRNAs and proteins were obtained from RAID v2.0 (Yi et al., 2017). It is a comprehensive and high-confidence resource of interactions between RNA and other cell components, which integrates experimentally and computationally predicted RNA-associated interactions from 18 data sources, e.g. StarBase (Li et al., 2014) and LncRNA2Target (Jiang et al., 2015). A total of 12008 human RNA–protein interactions and a variety of RNAs are involved, including lncRNA, circRNA, miRNA, pseudogenes, etc. lncRNAs biotypes According to their sequence features such as transcriptional directionality and exosome sensitivity, lncRNAs can be categorized into six biotypes in the FANTOM (Hon et al., 2017), i.e. antisense, divergent, sense intronic, intergenic, exonic, and pseudogenes. Cancer-associated lncRNAs The cancer-associated lncRNAs were collected from the Lnc2Cancer database (July 4, 2016, latest version) (Ning et al., 2016), which manually collected and integrated the lncRNA–cancer associations with experimental support from the published literature. The expression patterns of the lncRNAs (upregulated, downregulated, or differential expression) are also provided referring to their original studies. Only the cancers having five or more lncRNAs were considered in this study, resulting in 284 associations with direct experimental evidence among 74 lncRNAs and 26 human cancers. Gastric cancer expression data The RNA sequencing data of stomach adenocarcinoma or gastric cancer were collected from The Cancer Genome Atlas (https://cancergenome.nih.gov/). This dataset was produced from the poly (A) + Illumina HiSeq platform and shows the gene-level (Level 3) transcription estimates, as in log2(x + 1) transformed RSEM normalized count, for a total of 450 gastric samples, 415 from cancer and 35 from normal. We also downloaded the clinical matrix of these samples to retrieve phenotype information, such as overall survival, for survival analysis. The workflow of ncTALENT The workflow of ncTALENT includes three steps: (i) associating an input ncRNA with target proteins, (ii) constructing a ncRNA TLM, and (iii) calculating ncRNA TLC. As shown in Figure 1, a given ncRNA X has six targets annotated in 10 subcellular compartments, which can be denoted in a TLM with the rows ( i) representing compartments and columns ( j) representing target proteins. k is the entire number of the recorded compartments the target proteins of X annotated, given as k=∑i=1m∑j=1nTLMi,j (1) Based on TLM, we can compute the TLC of X using the function as follows: TLC(X)=1−∑i=1m(∑j=1nTLMi,jk)2 (2) where m is the number of cell compartments and n is the number of annotated target proteins in each compartment. Overall, we can calculate the TLC of a given ncRNA using the following function: TLC(X)=1−∑i=1m(∑j=1nTLMi,j∑i=1m∑j=1nTLMi,j)2 (3) A simple example is shown in the right panel of Figure 1. The number of target proteins occurred in each compartment is 1, 2, 1, 4, 0, and 2 for nucleoli, mitochondria, Golgi apparatus, cytosol, microtubules, and vesicles, respectively, and k is the sum of them equal to 10. Finally, we can obtain the TLC score of 0.74 for the ncRNA using the TLC function (Equation (3)). Statistical methods In this study, the Mann–Whitney U test was used to assess the statistical significance of the difference between two vectors of TLC, because it does not assume any properties regarding the distribution of the dependent variable in the analysis. Student’s t-test was used to identify DE target genes of lncRNAs between cancer and normal samples. The genes with fold change >1 and FDR corrected P-value <0.01 were considered as DE genes (Consortium, 2014; Cheng et al., 2016a, b). If the DE genes coding the target proteins of a lncRNA are significantly overrepresented, we define the lncRNA as DE lncRNA. A hypergeometric test was used to assess the statistical significance of whether the proteins that the DE genes encode are enriched in a set of target proteins, which is described as follows: p=1−∑i=0k−1(ix)(m−in−x)(mn) (4) where n is the number of all target genes analyzed in the expression matrix, m is the number of DE genes, x is the number of target genes for a given lncRNA, and i is the number of DE target genes. The P-value describes the probability of randomly picking up k or more DE genes from the n target genes. Survival-associated gene identification For each lncRNA, the first principal component of its target genes is calculated and it is defined as the lncRNA eigengene (Du et al., 2013; Liu et al., 2014; Jin et al., 2015). We serve the EG value as the risk score to assess the prognosis ability of a lncRNA by dichotomizing the patients as high- and low-risk groups based on the median of the EG value. We performed the survival analysis using log-rank and illustrated the result using Kaplan–Meier survival curves. Essentially, the EG value is the most representative gene expression of a cluster of genes. Given the expressions of m genes, X=(X1,…,Xm), we first construct Covn(X), the m×m sample variance–covariance matrix among n samples. Then the eigenvalues and eigenvectors of Covn(X) are calculated and the eigenvectors were sorted by the corresponding eigenvalues (Ma and Dai, 2011). Principal components are the eigenvectors and the first principal component is the first eigenvector having the largest eigenvalue. Monte Carlo simulation To evaluate the inferred localization of ncRNAs, Monte Carlo simulation was adopted to model the probability of co-localized ncRNA–protein interactions based on the ncRNA localization information from RNALocate (Zhang et al., 2017). Due to the limited annotation, only the ncRNAs in the nucleus and the cytoplasm were validated. m pairs of ncRNAs and proteins were randomly selected from the ncRNA–protein interactome data and then we calculated the percentage of the co-localized pair ( Ti). We performed the experiments n times and the P-value is the ratio of the simulated percentage ( Ti) greater than the practical one ( T), which is mathematically defined as p=∑i=1nsgn(Ti−T)n (5) where Ti=mi/m and mi is the number of co-localized ncRNA–protein pairs. m and n were set as 1000 and 10000, respectively, in this study. sgn is defined as sgn(x)={1,x≥00,x<0. Supplementary material Supplementary material is available at Journal of Molecular Cell Biology online. Acknowledgements We thank Prof. Dong Wang for the thoughtful discussion and the reviewers for their insightful comments on the paper. Funding This work was supported by The Chinese University of Hong Kong Direct Grant and the Research Grants Council of Hong Kong GRF Grant (414413). Conflict of interest none declared. References Barabasi , A.L. , and Oltvai , Z.N. ( 2004 ). Network biology: understanding the cell’s functional organization . Nat. Rev. Genet. 5 , 101 – 113 . Google Scholar CrossRef Search ADS PubMed Buxbaum , A.R. , Haimovich , G. , and Singer , R.H. ( 2015 ). In the right place at the right time: visualizing and understanding mRNA localization . Nat. Rev. Mol. Cell Biol. 16 , 95 – 109 . Google Scholar CrossRef Search ADS PubMed Cabili , M.N. , Dunagin , M.C. , McClanahan , P.D. , et al. . ( 2015 ). Localization and abundance analysis of human lncRNAs at single-cell and single-molecule resolution . Genome Biol. 16 , 20 . Google Scholar CrossRef Search ADS PubMed Chen , L.-L. ( 2016 a). The biogenesis and emerging roles of circular RNAs . Nat. Rev. Mol. Cell Biol. 17 , 205 – 211 . Google Scholar CrossRef Search ADS PubMed Chen , L.-L. ( 2016 b). Linking long noncoding RNA localization and function . Trends Biochem. Sci. 41 , 761 – 772 . Google Scholar CrossRef Search ADS PubMed Cheng , L. , Fan , K. , Huang , Y. , et al. . ( 2017 ). Full characterization of localization diversity in the human protein interactome . J. Proteome Res. 16 , 3019 – 3029 . Google Scholar CrossRef Search ADS PubMed Cheng , L. , Lo , L.Y. , Tang , N.L. , et al. . ( 2016 a). CrossNorm: a novel normalization strategy for microarray data in cancers . Sci. Rep. 6 , 18898 . Google Scholar CrossRef Search ADS PubMed Cheng , L. , Wang , X. , Wong , P.K. , et al. . ( 2016 b). ICN: a normalization method for gene expression data considering the over-expression of informative genes . Mol. Biosyst. 12 , 3057 – 3066 . Google Scholar CrossRef Search ADS PubMed Consortium , S.M.-I. ( 2014 ). A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium . Nat. Biotechnol. 32 , 903 – 914 . Google Scholar CrossRef Search ADS PubMed Derrien , T. , Johnson , R. , Bussotti , G. , et al. . ( 2012 ). The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression . Genome Res. 22 , 1775 – 1789 . Google Scholar CrossRef Search ADS PubMed Deveson , I.W. , Hardwick , S.A. , Mercer , T.R. , et al. . ( 2017 ). The dimensions, dynamics, and relevance of the mammalian noncoding transcriptome . Trends Genet. 33 , 464 – 478 . Google Scholar CrossRef Search ADS PubMed Ding , H. , Guo , S.-H. , Deng , E.-Z. , et al. . ( 2013 ). Prediction of Golgi-resident protein types by using feature selection technique . Chemom. Intell. Lab. Syst. 124 , 9 – 13 . Google Scholar CrossRef Search ADS Ding , H. , Liu , L. , Guo , F.B. , et al. . ( 2011 ). Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition . Protein Pept. Lett. 18 , 58 – 63 . Google Scholar CrossRef Search ADS PubMed Du , Z. , Fei , T. , Verhaak , R.G. , et al. . ( 2013 ). Integrative genomic analyses reveal clinically relevant long noncoding RNAs in human cancer . Nat. Struct. Mol. Biol. 20 , 908 – 913 . Google Scholar CrossRef Search ADS PubMed Ferre , F. , Colantoni , A. , and Helmer-Citterich , M. ( 2016 ). Revealing protein-lncRNA interaction . Brief. Bioinform. 17 , 106 – 116 . Google Scholar CrossRef Search ADS PubMed Hao , Y. , Wu , W. , Li , H. , et al. . ( 2016 ). NPInter v3.0: an upgraded database of noncoding RNA-associated interactions . Database 2016 , baw057 . Google Scholar CrossRef Search ADS PubMed Hon , C.C. , Ramilowski , J.A. , Harshbarger , J. , et al. . ( 2017 ). An atlas of human long non-coding RNAs with accurate 5′ ends . Nature 543 , 199 – 204 . Google Scholar CrossRef Search ADS PubMed Huang , Z. , and Bao , S. ( 2012 ). Ubiquitination and deubiquitination of REST and its roles in cancers . FEBS Lett. 586 , 1602 – 1605 . Google Scholar CrossRef Search ADS PubMed Jiang , Q. , Wang , J. , Wu , X. , et al. . ( 2015 ). LncRNA2Target: a database for differentially expressed genes after lncRNA knockdown or overexpression . Nucleic Acids Res. 43 , D193 – D196 . Google Scholar CrossRef Search ADS PubMed Jin , N. , Wu , H. , Miao , Z. , et al. . ( 2015 ). Network-based survival-associated module biomarker and its crosstalk with cell death genes in ovarian cancer . Sci. Rep. 5 , 11566 . Google Scholar CrossRef Search ADS PubMed Li , J.H. , Liu , S. , Zhou , H. , et al. . ( 2014 ). starBase v2.0: decoding miRNA–ceRNA, miRNA–ncRNA and protein–RNA interaction networks from large-scale CLIP-Seq data . Nucleic Acids Res. 42 , D92 – D97 . Google Scholar CrossRef Search ADS PubMed Li , Y. , Wang , C. , Miao , Z. , et al. . ( 2015 ). ViRBase: a resource for virus-host ncRNA-associated interactions . Nucleic Acids Res. 43 , D578 – D582 . Google Scholar CrossRef Search ADS PubMed Liao , Q. , Liu , C. , Yuan , X. , et al. . ( 2011 ). Large-scale prediction of long non-coding RNA functions in a coding–non-coding gene co-expression network . Nucleic Acids Res. 39 , 3864 – 3878 . Google Scholar CrossRef Search ADS PubMed Licatalosi , D.D. , and Darnell , R.B. ( 2010 ). RNA processing and its regulation: global insights into biological networks . Nat. Rev. Genet. 11 , 75 – 87 . Google Scholar CrossRef Search ADS PubMed Lievens , S. , Van der Heyden , J. , Masschaele , D. , et al. . ( 2016 ). Proteome-scale binary interactomics in human cells . Mol. Cell. Proteomics 15 , 3624 – 3639 . Google Scholar CrossRef Search ADS PubMed Lin , H. , Chen , W. , Yuan , L.F. , et al. . ( 2013 a). Using over-represented tetrapeptides to predict protein submitochondria locations . Acta Biotheor. 61 , 259 – 268 . Google Scholar CrossRef Search ADS PubMed Lin , H. , Ding , C. , Yuan , L.-F. , et al. . ( 2013 b). Predicting subchloroplast locations of proteins based on the general form of Chou’s pseudo amino acid composition: approached from optimal tripeptide composition . Int. J. Biomath. 6 , 1350003 . Google Scholar CrossRef Search ADS Liu , Z.P. , and Chen , L. ( 2016 ). Prediction and dissection of protein–RNA interactions by molecular descriptors . Curr. Top. Med. Chem. 16 , 604 – 615 . Google Scholar CrossRef Search ADS PubMed Liu , W. , Li , L. , and Li , W. ( 2014 ). Gene co-expression analysis identifies common modules related to prognosis and drug resistance in cancer cell lines . Int. J. Cancer 135 , 2795 – 2803 . Google Scholar CrossRef Search ADS PubMed Liu , Z.-P. , and Miao , H. ( 2016 ). Prediction of protein–RNA interactions using sequence and structure descriptors . Neurocomputing 206 , 28 – 34 . Google Scholar CrossRef Search ADS Liu , Z.P. , Wu , L.Y. , Wang , Y. , et al. . ( 2010 ). Prediction of protein–RNA binding sites by a random forest method with combined features . Bioinformatics 26 , 1616 – 1622 . Google Scholar CrossRef Search ADS PubMed Lukong , K.E. , Chang , K.W. , Khandjian , E.W. , et al. . ( 2008 ). RNA-binding proteins in human genetic disease . Trends Genet . 24 , 416 – 425 . Google Scholar CrossRef Search ADS PubMed Ma , S. , and Dai , Y. ( 2011 ). Principal component analysis based methods in bioinformatics studies . Brief. Bioinform . 12 , 714 – 722 . Google Scholar CrossRef Search ADS PubMed Mas-Ponte , D. , Carlevaro-Fita , J. , Palumbo , E. , et al. . ( 2017 ). LncATLAS database for subcellular localization of long noncoding RNAs . RNA 23 , 1080 – 1087 . Google Scholar CrossRef Search ADS PubMed Meng , S. , Zhou , H. , Feng , Z. , et al. . ( 2017 ). CircRNA: functions and properties of a novel potential biomarker for cancer . Mol. Cancer 16 , 94 . Google Scholar CrossRef Search ADS PubMed Milligan , M.J. , and Lipovich , L. ( 2014 ). Pseudogene-derived lncRNAs: emerging regulators of gene expression . Front. Genet. 5 , 476 . Google Scholar PubMed Ning , S. , Zhang , J. , Wang , P. , et al. . ( 2016 ). Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers . Nucleic Acids Res. 44 , D980 – D985 . Google Scholar CrossRef Search ADS PubMed Park , S. , Yang , J.S. , Shin , Y.E. , et al. . ( 2011 ). Protein localization as a principal feature of the etiology and comorbidity of genetic diseases . Mol. Syst. Biol. 7 , 494 . Google Scholar CrossRef Search ADS PubMed Ponten , F. , Jirstrom , K. , and Uhlen , M. ( 2008 ). The Human Protein Atlas—a tool for pathology . J. Pathol. 216 , 387 – 393 . Google Scholar CrossRef Search ADS PubMed Quinn , J.J. , and Chang , H.Y. ( 2016 ). Unique features of long non-coding RNA biogenesis and function . Nat. Rev. Genet. 17 , 47 – 62 . Google Scholar CrossRef Search ADS PubMed Thul , P.J. , Akesson , L. , Wiking , M. , et al. . ( 2017 ). A subcellular map of the human proteome . Science 356 , pii: eaal3321 . Google Scholar CrossRef Search ADS Torre , L.A. , Siegel , R.L. , Ward , E.M. , et al. . ( 2016 ). Global cancer incidence and mortality rates and trends—an update . Cancer Epidemiol. Biomarkers Prev. 25 , 16 – 27 . Google Scholar CrossRef Search ADS PubMed UniProt , C. ( 2014 ). Activities at the Universal Protein Resource (UniProt) . Nucleic Acids Res. 42 , D191 – D198 . Google Scholar CrossRef Search ADS PubMed Vidal , M. , Cusick , M.E. , and Barabasi , A.L. ( 2011 ). Interactome networks and human disease . Cell 144 , 986 – 998 . Google Scholar CrossRef Search ADS PubMed Wahlestedt , C. ( 2013 ). Targeting long non-coding RNA to therapeutically upregulate gene expression . Nat. Rev. Drug Discov. 12 , 433 – 446 . Google Scholar CrossRef Search ADS PubMed Wang , P. , Ning , S. , Zhang , Y. , et al. . ( 2015 ). Identification of lncRNA-associated competing triplets reveals global patterns and prognostic markers for cancer . Nucleic Acids Res. 43 , 3478 – 3489 . Google Scholar CrossRef Search ADS PubMed Wu , D. , Huang , Y. , Kang , J. , et al. . ( 2015 ). ncRDeathDB: A comprehensive bioinformatics resource for deciphering network organization of the ncRNA-mediated cell death system . Autophagy 11 , 1917 – 1926 . Google Scholar CrossRef Search ADS PubMed Yi , Y. , Zhao , Y. , Li , C. , et al. . ( 2017 ). RAID v2.0: an updated resource of RNA-associated interactions across organisms . Nucleic Acids Res. 45 , D115 – D118 . Google Scholar CrossRef Search ADS PubMed Zhang , T. , Tan , P. , Wang , L. , et al. . ( 2017 ). RNALocate: a resource for RNA subcellular localizations . Nucleic Acids Res . 45 , D135 – D138 . Google Scholar CrossRef Search ADS PubMed Zhou , J. , Zhang , S. , Wang , H. , et al. . ( 2017 ). LncFunNet: an integrated computational framework for identification of functional long noncoding RNAs in mouse skeletal muscle cells . Nucleic Acids Res. 45 , e108 . Google Scholar CrossRef Search ADS PubMed Zhu , X. , Tian , X. , Yu , C. , et al. . ( 2016 ). A long non-coding RNA signature to improve prognosis prediction of gastric cancer . Mol. Cancer 15 , 60 . Google Scholar CrossRef Search ADS PubMed © The Author (2018). Published by Oxford University Press on behalf of Journal of Molecular Cell Biology, IBCB, SIBS, CAS. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Molecular Cell Biology Oxford University Press

Quantification of non-coding RNA target localization diversity and its application in cancers

Loading next page...
 
/lp/ou_press/quantification-of-non-coding-rna-target-localization-diversity-and-its-G86HKWA8Gk
Publisher
Oxford University Press
Copyright
© The Author (2018). Published by Oxford University Press on behalf of Journal of Molecular Cell Biology, IBCB, SIBS, CAS. All rights reserved.
ISSN
1674-2788
eISSN
1759-4685
D.O.I.
10.1093/jmcb/mjy006
Publisher site
See Article on Publisher Site

Abstract

Abstract Subcellular localization is pivotal for RNAs and proteins to implement biological functions. The localization diversity of protein interactions has been studied as a crucial feature of proteins, considering that the protein–protein interactions take place in various subcellular locations. Nevertheless, the localization diversity of non-coding RNA (ncRNA) target proteins has not been systematically studied, especially its characteristics in cancers. In this study, we provide a new algorithm, non-coding RNA target localization coefficient (ncTALENT), to quantify the target localization diversity of ncRNAs based on the ncRNA–protein interaction and protein subcellular localization data. ncTALENT can be used to calculate the target localization coefficient of ncRNAs and measure how diversely their targets are distributed among the subcellular locations in various scenarios. We focus our study on long non-coding RNAs (lncRNAs), and our observations reveal that the target localization diversity is a primary characteristic of lncRNAs in different biotypes. Moreover, we found that lncRNAs in multiple cancers, differentially expressed cancer lncRNAs, and lncRNAs with multiple cancer target proteins are prone to have high target localization diversity. Furthermore, the analysis of gastric cancer helps us to obtain a better understanding that the target localization diversity of lncRNAs is an important feature closely related to clinical prognosis. Overall, we systematically studied the target localization diversity of the lncRNAs and uncovered its association with cancer. long non-coding RNAs, RNA–protein interactions, target localization diversity, subcellular localization, gastric cancer Introduction Subcellular localization has been demonstrated to be a fundamental regulation mode in cells (Barabasi and Oltvai, 2004; Park et al., 2011; Buxbaum et al., 2015; Zhang et al., 2017; Cheng et al., 2017; Thul et al., 2017). Proteins are assigned in specific locations of a cell, enabling cells to implement diverse biochemical processes and metabolic activities in a concurrent manner. On the other hand, erroneous localization of proteins is a major determinant of cellular dysfunction and human diseases. Several examples of proteins in inappropriate locations were presented in our previous work (Cheng et al., 2017). In particular, the proteins targeted by non-coding RNAs (ncRNAs) locate in various cell compartments to reside in distinct physiological environments and interact with eligible partners. Hence, studying the dynamics of how ncRNAs regulate the spatial organization of their target proteins is critical to understand the mechanisms of ncRNA–protein interactions in cells. Accumulated studies have reported that long ncRNAs (lncRNAs) participate in numerous cellular processes, such as the post-transcriptional gene regulation through histone modification, or the transcriptional gene silencing through chromatin remodeling, resulting in the differential expression of target coding genes (Cabili et al., 2015; Ferre et al., 2016; Quinn and Chang, 2016; Zhu et al., 2016; Zhou et al., 2017). Not limited to coding genes, moreover, lncRNAs also serve as cancer diagnostic or prognostic markers (Du et al., 2013; Liu et al., 2014; Wang et al., 2015; Zhu et al., 2016). Growing lines of evidence indicate that the dysregulation of lncRNAs may cause disorder or even mediate oncogenic or tumor-suppressing effects in human via the interactions with other cellular macromolecules such as DNA and protein (Lukong et al., 2008; Licatalosi and Darnell, 2010; Wahlestedt, 2013; Li et al., 2014; Wang et al., 2015; Wu et al., 2015; Ning et al., 2016). The lncRNA interactions with proteins are central to a wide range of cellular processes such as transcriptional regulation, protein–protein interaction, complex assembly, or direct subcellular localization (Li et al., 2014; Chen, 2016b; Ferre et al., 2016). Consequently, an increasing number of experimental and computationally predicted data of lncRNA–protein interactions are becoming available during the last decade (Liu et al., 2010; Li et al., 2014, 2015; Jiang et al., 2015; Liu and Chen, 2016; Liu and Miao, 2016; Yi et al., 2017). Importantly, a majority of proteins have a highly dynamic localization in human cells and they are simultaneously localized in more than one cell compartment (Cheng et al., 2017; Thul et al., 2017). For example, REST, a transcription factor localized in nucleoplasm and cytosol, acts as a transcriptional repressor and its defects can cause Huntington disease, colon carcinomas, small-cell lung carcinomas, etc. (Huang and Bao, 2012; UniProt, 2014). Another example is SMARCB1, a nucleoplasm and nucleoli-related protein, which is a core component of the BAF complex engaging in cellular antiviral activities, cell proliferation, and cell differentiation (UniProt, 2014). Both of them are the targets of the lncRNA FGFR1, and most of its target proteins are functioning in different compartments of the cells and have miltilocalization (Thul et al., 2017; Yi et al., 2017). These factors show that the target localization diversity of a lncRNA is indicative of its functions. Thus, exploration of the spatial distribution of their target proteins at the subcellular level is essential for understanding the mechanism of ncRNA regulation. Despite the identification of these ncRNA–protein interactions and protein subcellular locations, the target localization diversity of lncRNAs and its association with cancers remain unclear. To address these issues, we propose a novel method, non-coding RNA target localization coefficient (ncTALENT), to quantify the target localization diversity of ncRNAs by integrating the ncRNA–protein interactions and the protein subcellular localization. Our result reveals that the target localization diversity is a key feature of lncRNAs in human. lncRNAs in multiple cancers, differentially expressed (DE) cancer lncRNAs, and lncRNAs with multiple cancer target proteins are prone to have high localization diversity. Furthermore, the analysis of gastric cancer helps us to better understand that the target localization diversity of lncRNAs is an important feature closely related to clinical prognosis. Results The computational framework to quantify ncRNA target localization diversity The target localization coefficient (TLC) was used to measure the localization diversity of ncRNA target proteins in the major cellular compartments. To obtain the TLC, two types of data were required, the association between ncRNAs and their target proteins as well as the localization information of these proteins. In this study, we focused on lncRNAs, a type of non-protein coding transcripts >200 nucleotides (Wahlestedt, 2013; Hon et al., 2017). A total of 1245 lncRNAs were collected from the RAID v2.0 database with annotated target proteins (Yi et al., 2017). For each of them, as shown in Figure 1, we first constructed a target localization matrix (TLM) representing the associations between proteins and their localized compartment, and then calculated the TLC based on the matrix and Equation (3). The TLCs and other detailed information of these lncRNAs, such as the biotype and the target number, are listed in Supplementary Table S1. For instance, lncRNA H19, a lncRNA that usually promotes cancer cell proliferation, obtains a high TLC of 0.7431 and it has 18 target proteins distributed among 10 compartments, including cytosol, Golgi apparatus, mitochondria, plasma membrane, and vesicles, etc. Figure 1 View largeDownload slide Calculation of the ncRNA TLC. ncRNA X has six targets with a total of 10 subcellular annotations, which can be represented in the TLM. The columns and rows of TLM refer to the ncRNAs and the cell compartments, respectively. For example, TLM4,2 represents the second target protein of X localizing in the compartment of cytosol. Based on TLM, we can compute the ncRNA TLC using the function of TLC in the bottom left panel, where m, n, and k correspond to the compartment number, the target number, and the annotation record number, respectively. A simple example is shown in the right panel accordingly. The numeric data related to different compartment are marked in different colors. Figure 1 View largeDownload slide Calculation of the ncRNA TLC. ncRNA X has six targets with a total of 10 subcellular annotations, which can be represented in the TLM. The columns and rows of TLM refer to the ncRNAs and the cell compartments, respectively. For example, TLM4,2 represents the second target protein of X localizing in the compartment of cytosol. Based on TLM, we can compute the ncRNA TLC using the function of TLC in the bottom left panel, where m, n, and k correspond to the compartment number, the target number, and the annotation record number, respectively. A simple example is shown in the right panel accordingly. The numeric data related to different compartment are marked in different colors. Overview of the lncRNA TLC lncRNAs are implicated in various biological functions via the interaction with their target proteins. Figure 2A shows the number of proteins and the number of lncRNA targets in each cell compartment. It is obvious that the target proteins account for approximately one-third of the proteins in each compartment. Referring to the FANTOM classification scheme (Hon et al., 2017), the lncRNAs were categorized into six biotypes according to their sequence signatures. As shown in Figure 2B, a total of 1245 lncRNAs were analyzed in this study, as the proteins they target have localization information, and 789 out of them were assigned to the six biotypes, i.e. antisense, divergent, sense intronic, intergenic, exonic, and pseudogenes. A large amount of intergenic, divergent, and exonic lncRNAs are functionally annotated by FANTOM, while the number is less for lncRNAs of antisense, sense intronic, and pseudogenes (≤100). Figure 2 View largeDownload slide Summary of lncRNAs and target proteins. (A) Bar chart showing the protein numbers and lncRNA target numbers in each cell compartment. (B) The distribution of lncRNAs in each biotype for all collected lncRNAs. (C) The box plot shows that lncRNAs in distinct biotypes have different TLC distributions. Figure 2 View largeDownload slide Summary of lncRNAs and target proteins. (A) Bar chart showing the protein numbers and lncRNA target numbers in each cell compartment. (B) The distribution of lncRNAs in each biotype for all collected lncRNAs. (C) The box plot shows that lncRNAs in distinct biotypes have different TLC distributions. Interestingly, we found that lncRNAs in distinct biotypes behave differently in TLC distributions. The exonic and pseudo-lncRNAs tend to have high TLCs whereas lncRNAs in the other biotypes show opposite trends. Specifically, the median TLCs of antisense, divergent, sense intronic, and intergenic lncRNAs are close to zero, which are much less than that of the background lncRNAs (Figure 2C). In contrast, exonic and pseudo lncRNAs achieve overall higher TLC distributions and the exonic lncRNA set has the highest distribution, both of which are significantly higher than the background distribution (P < 0.01, Mann–Whitney U test), indicating an association between the lncRNA expression level and the location diversity. Accumulating evidence suggests that mRNAs have substantially higher expression levels than lncRNAs. In particular, Derrien et al. (2012) observed that mRNAs tend to be more correlated with lncRNAs that are encoded by the same coding genes than the expression correlations between mRNAs and their intronic lncRNAs. The lncRNA TLC in cancers It is found that lncRNAs are pervasively transcribed in the mammalian genome and several of them play the roles as oncogenic or tumor-suppressor genes in multiple cancers (Wahlestedt, 2013; Ning et al., 2016; Quinn and Chang, 2016; Zhu et al., 2016). To determine whether the target localization diversity of lncRNAs is associated with cancers, cancer lncRNAs and proteins encoded by cancer genes of 26 cancers were investigated in the present study. Importantly, we uncovered that cancer lncRNAs generally have high target localization diversity, whereas the regular proteins do not show the same trend. As shown in Figure 3A, each box indicates the TLC distribution of the lncRNAs involved in a specific type of cancer and the ones marked in red represent their TLCs are significantly higher than those of all proteins (P < 0.05, Mann–Whitney U test). Specifically, the median TLCs of all lncRNAs and the Cancer lncRNAs are 0.38 and 0.5, respectively, but the value is as high as 0.6 for lncRNAs of gastric cancer (P < 0.01, Mann–Whitney U test). Remarkably, the TLCs of lncRNAs of all the 26 cancers are significantly higher in comparison with that of all the background lncRNAs. Among these cancer lncRNAs, the DE cancer lncRNAs have significantly higher TLCs than those of the regular cancer lncRNAs (P < 0.01, Mann–Whitney U test, Figure 3B). It is clear in Figure 3C that the lncRNA TLCs are positively correlated with the number of cancers in which the lncRNAs occur, with the R- and P-value of linear regression fitting model are 0.4670 and 1.43e–05, respectively. Additionally, we assessed whether the number of cancer targets contributes to the localization diversity. In Figure 3D, as expected, the lncRNA TLCs rise apparently with the increase of the number of cancer targets. Figure 3 View largeDownload slide TLC distributions of lncRNAs in cancers. (A) Red box indicates that the TLC of the target proteins involved in corresponding cancer is significantly higher compared with that of all proteins (P < 0.05, Mann–Whitney test). (B) The TLC distributions for different subtypes of cancer lncRNAs. No significant difference was found among subtypes of lncRNAs. (C) The association between the TLC of a lncRNA and the number of the lncRNA-occurred cancers. The R- and P-value of linear regression fitting model are 0.4670 and 1.43e–05, respectively. (D) The TLC distributions of lncRNAs with different numbers of target proteins. Up, upregulated cancer lncRNAs; Down, downregulated cancer lncRNAs; DE, differentially expressed cancer lncRNAs. Figure 3 View largeDownload slide TLC distributions of lncRNAs in cancers. (A) Red box indicates that the TLC of the target proteins involved in corresponding cancer is significantly higher compared with that of all proteins (P < 0.05, Mann–Whitney test). (B) The TLC distributions for different subtypes of cancer lncRNAs. No significant difference was found among subtypes of lncRNAs. (C) The association between the TLC of a lncRNA and the number of the lncRNA-occurred cancers. The R- and P-value of linear regression fitting model are 0.4670 and 1.43e–05, respectively. (D) The TLC distributions of lncRNAs with different numbers of target proteins. Up, upregulated cancer lncRNAs; Down, downregulated cancer lncRNAs; DE, differentially expressed cancer lncRNAs. The lncRNA TLC in gastric cancer Next, we explored the association between the expression pattern and spatial distribution of lncRNA target proteins in gastric cancer, as the gastric cancer associates more lncRNAs than the other cancers as shown in Figure 3A. Also, gastric cancer is currently the second leading cause of cancer-related deaths worldwide and the importance of studying how lncRNAs function in gastric cancer is beyond doubt (Torre et al., 2016; Zhu et al., 2016). The RNA-seq data of gastric cancer from TCGA were used for survival analysis and identification of survival-associated (SA) genes. To identify the lncRNAs associated with clinical outcome in gastric cancer, we performed multivariate Cox regression analysis to evaluate the significance of the correlations between lncRNA target expression and overall survival. The detailed risk score or eigengene (EG) value, survival information, and nine target expressions of lncRNA FGFR1 are shown in Figure 4. Specifically, Figure 4A shows the sorted EG value distribution of lncRNA FGFR1 based on the expression values of its targets. Figure 4B illustrates the survival time and vital status of detected patients and Figure 4C is the heatmap of the target expression profiles. The dotted line across the three figures represents the median EG value dividing patients into two groups with high and low risk. We identified ~20 lncRNAs in gastric cancer whose expressions are remarkably correlated with overall survival and the top 10 most significant SA lncRNAs are listed in Figure 4D. The lncRNA FGFR1 is an example of lncRNAs that shows a negative correlation between its target expression and the overall survival. Figure 4E illustrates the Kaplan–Meier curves for all the 415 patients according to the nine targets of lncRNA FGFR1. Patients with low EG value show significantly shorter survival times than those with high EG value, indicating that the suppression of lncRNA FGFR1 in human gastric cancer is linked to poor prognosis. Figure 4 View largeDownload slide Survival analysis of lncRNA FGFR1 in TCGA gastric cancer. (A) lncRNA EG score distribution. (B) The survival time and vital status of detected patients. (C) Heatmap of the target gene expression profiles. Rows represent lncRNA targets, while columns represent patients. In A–C, the dotted line represents the median lncRNA EG score dividing patients into two groups of high and low risk. (D) Summary of 10 SA lncRNAs with the most significant Q-values. (E) Kaplan–Meier curves of two patient groups with higher or lower EG value of FGFR1. Red corresponds to higher EG value and blue corresponds to lower EG value. (F) The box plot shows that DE and SA lncRNAs have higher TLC than the background lncRNAs (Mann–Whitney U test). EG, eigengene; DE, differentially expressed lncRNA; SA, survival-associated lncRNA; Q-value, BH corrected P-value. Figure 4 View largeDownload slide Survival analysis of lncRNA FGFR1 in TCGA gastric cancer. (A) lncRNA EG score distribution. (B) The survival time and vital status of detected patients. (C) Heatmap of the target gene expression profiles. Rows represent lncRNA targets, while columns represent patients. In A–C, the dotted line represents the median lncRNA EG score dividing patients into two groups of high and low risk. (D) Summary of 10 SA lncRNAs with the most significant Q-values. (E) Kaplan–Meier curves of two patient groups with higher or lower EG value of FGFR1. Red corresponds to higher EG value and blue corresponds to lower EG value. (F) The box plot shows that DE and SA lncRNAs have higher TLC than the background lncRNAs (Mann–Whitney U test). EG, eigengene; DE, differentially expressed lncRNA; SA, survival-associated lncRNA; Q-value, BH corrected P-value. Furthermore, 19 DE lncRNAs were screened out from the gastric RNA-seq data, and 10 out of them are still the SA lncRNAs. A lncRNA is defined as DE lncRNA if a significantly large fraction of genes encoding its target proteins are differentially expressed (see Equation (4)). We found that both the SA and DE lncRNAs are prone to have high TLCs and their overlaps have the highest TLC. As shown in Figure 4F, specifically, the median TLCs of DE and SA lncRNAs are 0.58 and 0.59, respectively, both of which are significantly higher than that of the background lncRNAs, and the median is as high as 0.7 for their overlap. Consequently, the proteins targeted by DE and/or SA lncRNAs are prone to diversely appear in cell compartments, revealing that the target localization diversity of lncRNAs is closely related to their expression levels and prognosis in cancers. Discussion It is well known that spatial partition of cells enables multiple cellular processes to proceed in parallel (Park et al., 2011; Cabili et al., 2015; Chen, 2016b; Cheng et al., 2017). Also, there is increasing recognition on ncRNA–protein interactions being fundamental to cellular processes like protein–protein interactions (Liu et al., 2010; Vidal et al., 2011; Jiang et al., 2015; Li et al., 2015; Ferre et al., 2016; Lievens et al., 2016; Liu and Chen, 2016; Liu and Miao, 2016; Yi et al., 2017). Therefore, we studied the target localization diversity of the lncRNAs and uncovered its association with cancer. The target localization diversity of lncRNA is found to be distinct among their biotypes. Remarkably, cancer lncRNAs are prone to have high TLCs, while cancer proteins are not. Moreover, DE and SA lncRNAs in gastric cancer tend to have high TLCs. We observed an association between lncRNA expression level and the target location diversity. The pseudo transcriptions and exonic lncRNAs are prone to have significantly higher TLCs and both of them are from the protein-coding regions of genes with high expression abundance, since several studies demonstrated that mRNAs have substantially higher expression levels than lncRNAs (Derrien et al., 2012; Deveson et al., 2017). Pseudogenes have little potential to produce the same functional product as their parental coding genes, but their sequences are still highly similar (Milligan and Lipovich, 2014). Hence, like exonic lncRNAs, the pseudogene-derived lncRNAs have more chances to interact with their parental genes, which might be the reason why these pseudo transcriptions and exonic lncRNAs tend to have higher target localization diversity. Also, this could be due to the cellular strategy of keeping non-coding gene expression in tune with physiological needs. The lncRNAs with high target localization diversity tend to interact with more target proteins and hence a high-level expression of lncRNAs in human cells is required. Alterations in regulatory lncRNAs may directly influence their target proteins by interfering with protein post-translational modification or indirectly affect the transcripts translating their targets by decoying miRNAs and proteins to modulate their translation, leading to dysfunction of the cellular signaling cascades and downstream pathways. For instance, the lncRNA MALAT1, a well-known prognostic biomarker of lung adenocarcinoma and squamous cell carcinoma, has an extremely high TLC of 0.81 with its target proteins evenly distributed across different subcellular compartments. From the view of cancer development, also, MALAT1 is usually abnormally overexpressed in many other tissues, such as breast, prostate, colon, liver, cervical, and bladder, where it affects the regulation of cell proliferation, cell migration, invasion, and apoptosis (Wahlestedt, 2013; Cabili et al., 2015; Quinn and Chang, 2016). Specifically, MALAT1 has been shown to accelerate the cell invasion and proliferation in cervical cancer by interacting with several protein partners, e.g. BAX, BCL2, Caspase-8, Caspase-3, and BCL-XL. Furthermore, MALAT1 interacts with the Polycomb 2 protein (Pc2) and activates E2F1 sumoylation to promote activate cancer progression. RNA-binding proteins such as TDP-43 also have been reported as the MALAT1 interaction partners, which contribute a lot to cancer pathogenesis. Overall, the lncRNAs may work as a generalist in cancers by interacting with multiple proteins involved in distinct cellular compartments and diverse biological functions. Despite thousands of lncRNAs have been identified in human, only a fraction of them are functionally determined, with the underlying reason that the lncRNA sequences are not as conserved as the regular protein-coding genes and the direct prediction of their functions and localization is a difficult task (Liao et al., 2011; Quinn and Chang, 2016; Chen, 2016b; Zhou et al., 2017). Due to the incomplete information of ncRNA localization and interaction, it is hard to generate a subcellular-specific RNA–protein interaction network and only around a thousand of lncRNAs were investigated in the current study compared to the large number of lncRNAs encoded by human genome (Buxbaum et al., 2015; Zhang et al., 2017). Alternatively, we measure the localization diversity of the target proteins of ncRNAs and the interacting ncRNAs and proteins are assumed to be in the same subcellular locations, since the localization annotation of ncRNAs is extremely incomplete. To the best of our knowledge, merely two databases have collected the localization annotation of ncRNAs, i.e. RNALocate (Zhang et al., 2017) and LncATLAS (Mas-Ponte et al., 2017), and only a small portion of lncRNAs we studied can be mapped to cell locations. Besides, lncRNAs execute functions in cancers through their interactions with macromolecules, such as proteins that scatter in subcellular compartments (Lukong et al., 2008; Licatalosi and Darnell, 2010; Li et al., 2014; Jiang et al., 2015; Ferre et al., 2016), and, therefore, the target localization diversity may indirectly reflect the ncRNA functional diversity. To evaluate the inferred localization of ncRNAs, we adopted Monte Carlo simulation to model the probability of co-localized ncRNA–protein interactions based on the ncRNA localization information from RNALocate. Due to the limited annotation, we only validated the ncRNAs in the nucleus and the cytoplasm. Our finding shows that the interacting ncRNAs and proteins are significantly more likely to reside in the same subcellular locations in comparison to the simulated ncRNA–protein interactions. As shown in Supplementary Figure S3A, the ratios of nucleus-specific interactions and cytoplasm-specific interactions are significantly higher than the ratio of the simulated ones, respectively. Specifically, 1139 out of 1323 interactions (86.75%) are subcellular specific in either nucleus or cytoplasm, which cannot be achieved by chance based on the simulated subcellular-specific interactions (Supplementary Figure S3B). When merely focusing on the subcellular-specific interactions, the TLCs of cancer lncRNAs are still higher than that of the non-cancer lncRNAs (Supplementary Figure S3C), and the difference is large despite not statistically significant due to the small available sample size. Only 17 lncRNAs and their respective target proteins are both annotated to the same locations of the nucleus or the cytoplasm. Hopefully, more RNA annotations can be confirmed with the rapid development of the RNA localization and interaction techniques (Liu et al., 2010; Liu and Chen, 2016; Liu and Miao, 2016), and we will reanalyze the subcellular-specific localization diversity once appropriate and more comprehensive ncRNA localization data are available. Meanwhile, we cannot neglect that cancer-associated lncRNAs are usually better studied and characterized than lncRNAs unrelated to cancer, and, therefore, more target proteins have been discovered in cancer-associated lncRNAs, which might be a latent reason for the higher TLCs in cancer. To address this problem, we also assessed whether the proteins targeted by cancer lncRNAs are prone to have high localization coefficients (LCs) in comparison with the other proteins. The LC of each protein was separately calculated. As shown in Supplementary Figure S1, the LCs of the cancer lncRNA target proteins are significantly higher than that of the others (P < 1.03e–02, Mann–Whitney U test), indicating that the target proteins, other than the target number, contribute a lot to the lncRNA target localization diversity. Furthermore, we also compared the TLC distributions between cancer and non-cancer lncRNAs using another database, NPInter v3.0 (Hao et al., 2016), in which only the interactions determined using high-throughput sequencing technologies were adopted, such as CLIP, HITS-CLIP, PAR-CLIP, and RIP. This may to some extent reduce the data bias in cancer studies, because we merely used the high-throughput data where no lncRNAs are specifically concerned. As shown in Supplementary Figure S2A, a very small proportion of lncRNAs are annotated in both databases. Unsurprisingly, we found that the TLCs of cancer lncRNAs are also significantly higher than that of the non-cancer ones (P < 3.25e–03, Mann–Whitney U test, Supplementary Figure S2B), suggesting that there is little study bias in the calculation of TCL. Also, it should be noted that an organelle can further be divided into sub-locations. Proteins located in different sub-locations play distinct biological roles and likewise for ncRNAs. Some researchers have developed algorithms to predict protein sub-subcellular localizations, such as sub-mitochondria locations (Lin et al., 2013a), sub-chloroplast locations (Lin et al., 2013b), and sub-Golgi locations (Ding et al., 2011, 2013). It is no doubt to make an analysis on the sub-organelle level that will provide a deep understanding of disease development. Thus, we will perform more detailed studies on the sub-organelle level by using these tools in the future. In summary, our result elucidates how lncRNAs obtain functional outcomes in cancer through diverse interactions with their protein partners, implying the localization diversity of target proteins is a major determinant of lncRNA functionality as well as shedding light on the molecular mechanisms of these cancer-associated transcripts in human cancers. In addition to physical interactions, the target localization diversity quantified can be used to systematically study lncRNA–protein interaction on the basis of transcriptome data including both normal and cancer samples. For instance, we can identify lncRNAs that become either more diverse or less diverse in cancer samples in comparison to the normal ones, which may also facilitate the experimental identification of cancer-related lncRNAs and novel drug targets. To our knowledge, this is the first time that the target localization diversity of lncRNAs was systematically and comprehensively investigated, especially its characteristics in cancers. Besides, the difference of target localization diversity among ncRNA subtypes may imply functional characteristics and biological mechanism, so it will be valuable to use ncTALENT to analyze the other ncRNA species in depth, such as circRNA (Chen, 2016a; Meng et al., 2017), although our study only concentrated in lncNRAs. Materials and methods Datasets Subcellular localization of proteins We obtained the experimental localization information of proteins from the Cell Atlas (Thul et al., 2017), a subpart of the Human Protein Atlas (Ponten et al., 2008). Using an antibody-based image profiling approach and transcriptomics data, the Cell Atlas delivers a comprehensive resource for human protein subcellular localization. In the current version, 12003 proteins were annotated to 32 subcellular locations on a single-cell level and they can be further categorized into 14 major compartments based on the cellular substructures. All the lncRNA target proteins were analyzed according to these major compartments in this study. lncRNA–protein interactions The interactions between lncRNAs and proteins were obtained from RAID v2.0 (Yi et al., 2017). It is a comprehensive and high-confidence resource of interactions between RNA and other cell components, which integrates experimentally and computationally predicted RNA-associated interactions from 18 data sources, e.g. StarBase (Li et al., 2014) and LncRNA2Target (Jiang et al., 2015). A total of 12008 human RNA–protein interactions and a variety of RNAs are involved, including lncRNA, circRNA, miRNA, pseudogenes, etc. lncRNAs biotypes According to their sequence features such as transcriptional directionality and exosome sensitivity, lncRNAs can be categorized into six biotypes in the FANTOM (Hon et al., 2017), i.e. antisense, divergent, sense intronic, intergenic, exonic, and pseudogenes. Cancer-associated lncRNAs The cancer-associated lncRNAs were collected from the Lnc2Cancer database (July 4, 2016, latest version) (Ning et al., 2016), which manually collected and integrated the lncRNA–cancer associations with experimental support from the published literature. The expression patterns of the lncRNAs (upregulated, downregulated, or differential expression) are also provided referring to their original studies. Only the cancers having five or more lncRNAs were considered in this study, resulting in 284 associations with direct experimental evidence among 74 lncRNAs and 26 human cancers. Gastric cancer expression data The RNA sequencing data of stomach adenocarcinoma or gastric cancer were collected from The Cancer Genome Atlas (https://cancergenome.nih.gov/). This dataset was produced from the poly (A) + Illumina HiSeq platform and shows the gene-level (Level 3) transcription estimates, as in log2(x + 1) transformed RSEM normalized count, for a total of 450 gastric samples, 415 from cancer and 35 from normal. We also downloaded the clinical matrix of these samples to retrieve phenotype information, such as overall survival, for survival analysis. The workflow of ncTALENT The workflow of ncTALENT includes three steps: (i) associating an input ncRNA with target proteins, (ii) constructing a ncRNA TLM, and (iii) calculating ncRNA TLC. As shown in Figure 1, a given ncRNA X has six targets annotated in 10 subcellular compartments, which can be denoted in a TLM with the rows ( i) representing compartments and columns ( j) representing target proteins. k is the entire number of the recorded compartments the target proteins of X annotated, given as k=∑i=1m∑j=1nTLMi,j (1) Based on TLM, we can compute the TLC of X using the function as follows: TLC(X)=1−∑i=1m(∑j=1nTLMi,jk)2 (2) where m is the number of cell compartments and n is the number of annotated target proteins in each compartment. Overall, we can calculate the TLC of a given ncRNA using the following function: TLC(X)=1−∑i=1m(∑j=1nTLMi,j∑i=1m∑j=1nTLMi,j)2 (3) A simple example is shown in the right panel of Figure 1. The number of target proteins occurred in each compartment is 1, 2, 1, 4, 0, and 2 for nucleoli, mitochondria, Golgi apparatus, cytosol, microtubules, and vesicles, respectively, and k is the sum of them equal to 10. Finally, we can obtain the TLC score of 0.74 for the ncRNA using the TLC function (Equation (3)). Statistical methods In this study, the Mann–Whitney U test was used to assess the statistical significance of the difference between two vectors of TLC, because it does not assume any properties regarding the distribution of the dependent variable in the analysis. Student’s t-test was used to identify DE target genes of lncRNAs between cancer and normal samples. The genes with fold change >1 and FDR corrected P-value <0.01 were considered as DE genes (Consortium, 2014; Cheng et al., 2016a, b). If the DE genes coding the target proteins of a lncRNA are significantly overrepresented, we define the lncRNA as DE lncRNA. A hypergeometric test was used to assess the statistical significance of whether the proteins that the DE genes encode are enriched in a set of target proteins, which is described as follows: p=1−∑i=0k−1(ix)(m−in−x)(mn) (4) where n is the number of all target genes analyzed in the expression matrix, m is the number of DE genes, x is the number of target genes for a given lncRNA, and i is the number of DE target genes. The P-value describes the probability of randomly picking up k or more DE genes from the n target genes. Survival-associated gene identification For each lncRNA, the first principal component of its target genes is calculated and it is defined as the lncRNA eigengene (Du et al., 2013; Liu et al., 2014; Jin et al., 2015). We serve the EG value as the risk score to assess the prognosis ability of a lncRNA by dichotomizing the patients as high- and low-risk groups based on the median of the EG value. We performed the survival analysis using log-rank and illustrated the result using Kaplan–Meier survival curves. Essentially, the EG value is the most representative gene expression of a cluster of genes. Given the expressions of m genes, X=(X1,…,Xm), we first construct Covn(X), the m×m sample variance–covariance matrix among n samples. Then the eigenvalues and eigenvectors of Covn(X) are calculated and the eigenvectors were sorted by the corresponding eigenvalues (Ma and Dai, 2011). Principal components are the eigenvectors and the first principal component is the first eigenvector having the largest eigenvalue. Monte Carlo simulation To evaluate the inferred localization of ncRNAs, Monte Carlo simulation was adopted to model the probability of co-localized ncRNA–protein interactions based on the ncRNA localization information from RNALocate (Zhang et al., 2017). Due to the limited annotation, only the ncRNAs in the nucleus and the cytoplasm were validated. m pairs of ncRNAs and proteins were randomly selected from the ncRNA–protein interactome data and then we calculated the percentage of the co-localized pair ( Ti). We performed the experiments n times and the P-value is the ratio of the simulated percentage ( Ti) greater than the practical one ( T), which is mathematically defined as p=∑i=1nsgn(Ti−T)n (5) where Ti=mi/m and mi is the number of co-localized ncRNA–protein pairs. m and n were set as 1000 and 10000, respectively, in this study. sgn is defined as sgn(x)={1,x≥00,x<0. Supplementary material Supplementary material is available at Journal of Molecular Cell Biology online. Acknowledgements We thank Prof. Dong Wang for the thoughtful discussion and the reviewers for their insightful comments on the paper. Funding This work was supported by The Chinese University of Hong Kong Direct Grant and the Research Grants Council of Hong Kong GRF Grant (414413). Conflict of interest none declared. References Barabasi , A.L. , and Oltvai , Z.N. ( 2004 ). Network biology: understanding the cell’s functional organization . Nat. Rev. Genet. 5 , 101 – 113 . Google Scholar CrossRef Search ADS PubMed Buxbaum , A.R. , Haimovich , G. , and Singer , R.H. ( 2015 ). In the right place at the right time: visualizing and understanding mRNA localization . Nat. Rev. Mol. Cell Biol. 16 , 95 – 109 . Google Scholar CrossRef Search ADS PubMed Cabili , M.N. , Dunagin , M.C. , McClanahan , P.D. , et al. . ( 2015 ). Localization and abundance analysis of human lncRNAs at single-cell and single-molecule resolution . Genome Biol. 16 , 20 . Google Scholar CrossRef Search ADS PubMed Chen , L.-L. ( 2016 a). The biogenesis and emerging roles of circular RNAs . Nat. Rev. Mol. Cell Biol. 17 , 205 – 211 . Google Scholar CrossRef Search ADS PubMed Chen , L.-L. ( 2016 b). Linking long noncoding RNA localization and function . Trends Biochem. Sci. 41 , 761 – 772 . Google Scholar CrossRef Search ADS PubMed Cheng , L. , Fan , K. , Huang , Y. , et al. . ( 2017 ). Full characterization of localization diversity in the human protein interactome . J. Proteome Res. 16 , 3019 – 3029 . Google Scholar CrossRef Search ADS PubMed Cheng , L. , Lo , L.Y. , Tang , N.L. , et al. . ( 2016 a). CrossNorm: a novel normalization strategy for microarray data in cancers . Sci. Rep. 6 , 18898 . Google Scholar CrossRef Search ADS PubMed Cheng , L. , Wang , X. , Wong , P.K. , et al. . ( 2016 b). ICN: a normalization method for gene expression data considering the over-expression of informative genes . Mol. Biosyst. 12 , 3057 – 3066 . Google Scholar CrossRef Search ADS PubMed Consortium , S.M.-I. ( 2014 ). A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium . Nat. Biotechnol. 32 , 903 – 914 . Google Scholar CrossRef Search ADS PubMed Derrien , T. , Johnson , R. , Bussotti , G. , et al. . ( 2012 ). The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression . Genome Res. 22 , 1775 – 1789 . Google Scholar CrossRef Search ADS PubMed Deveson , I.W. , Hardwick , S.A. , Mercer , T.R. , et al. . ( 2017 ). The dimensions, dynamics, and relevance of the mammalian noncoding transcriptome . Trends Genet. 33 , 464 – 478 . Google Scholar CrossRef Search ADS PubMed Ding , H. , Guo , S.-H. , Deng , E.-Z. , et al. . ( 2013 ). Prediction of Golgi-resident protein types by using feature selection technique . Chemom. Intell. Lab. Syst. 124 , 9 – 13 . Google Scholar CrossRef Search ADS Ding , H. , Liu , L. , Guo , F.B. , et al. . ( 2011 ). Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition . Protein Pept. Lett. 18 , 58 – 63 . Google Scholar CrossRef Search ADS PubMed Du , Z. , Fei , T. , Verhaak , R.G. , et al. . ( 2013 ). Integrative genomic analyses reveal clinically relevant long noncoding RNAs in human cancer . Nat. Struct. Mol. Biol. 20 , 908 – 913 . Google Scholar CrossRef Search ADS PubMed Ferre , F. , Colantoni , A. , and Helmer-Citterich , M. ( 2016 ). Revealing protein-lncRNA interaction . Brief. Bioinform. 17 , 106 – 116 . Google Scholar CrossRef Search ADS PubMed Hao , Y. , Wu , W. , Li , H. , et al. . ( 2016 ). NPInter v3.0: an upgraded database of noncoding RNA-associated interactions . Database 2016 , baw057 . Google Scholar CrossRef Search ADS PubMed Hon , C.C. , Ramilowski , J.A. , Harshbarger , J. , et al. . ( 2017 ). An atlas of human long non-coding RNAs with accurate 5′ ends . Nature 543 , 199 – 204 . Google Scholar CrossRef Search ADS PubMed Huang , Z. , and Bao , S. ( 2012 ). Ubiquitination and deubiquitination of REST and its roles in cancers . FEBS Lett. 586 , 1602 – 1605 . Google Scholar CrossRef Search ADS PubMed Jiang , Q. , Wang , J. , Wu , X. , et al. . ( 2015 ). LncRNA2Target: a database for differentially expressed genes after lncRNA knockdown or overexpression . Nucleic Acids Res. 43 , D193 – D196 . Google Scholar CrossRef Search ADS PubMed Jin , N. , Wu , H. , Miao , Z. , et al. . ( 2015 ). Network-based survival-associated module biomarker and its crosstalk with cell death genes in ovarian cancer . Sci. Rep. 5 , 11566 . Google Scholar CrossRef Search ADS PubMed Li , J.H. , Liu , S. , Zhou , H. , et al. . ( 2014 ). starBase v2.0: decoding miRNA–ceRNA, miRNA–ncRNA and protein–RNA interaction networks from large-scale CLIP-Seq data . Nucleic Acids Res. 42 , D92 – D97 . Google Scholar CrossRef Search ADS PubMed Li , Y. , Wang , C. , Miao , Z. , et al. . ( 2015 ). ViRBase: a resource for virus-host ncRNA-associated interactions . Nucleic Acids Res. 43 , D578 – D582 . Google Scholar CrossRef Search ADS PubMed Liao , Q. , Liu , C. , Yuan , X. , et al. . ( 2011 ). Large-scale prediction of long non-coding RNA functions in a coding–non-coding gene co-expression network . Nucleic Acids Res. 39 , 3864 – 3878 . Google Scholar CrossRef Search ADS PubMed Licatalosi , D.D. , and Darnell , R.B. ( 2010 ). RNA processing and its regulation: global insights into biological networks . Nat. Rev. Genet. 11 , 75 – 87 . Google Scholar CrossRef Search ADS PubMed Lievens , S. , Van der Heyden , J. , Masschaele , D. , et al. . ( 2016 ). Proteome-scale binary interactomics in human cells . Mol. Cell. Proteomics 15 , 3624 – 3639 . Google Scholar CrossRef Search ADS PubMed Lin , H. , Chen , W. , Yuan , L.F. , et al. . ( 2013 a). Using over-represented tetrapeptides to predict protein submitochondria locations . Acta Biotheor. 61 , 259 – 268 . Google Scholar CrossRef Search ADS PubMed Lin , H. , Ding , C. , Yuan , L.-F. , et al. . ( 2013 b). Predicting subchloroplast locations of proteins based on the general form of Chou’s pseudo amino acid composition: approached from optimal tripeptide composition . Int. J. Biomath. 6 , 1350003 . Google Scholar CrossRef Search ADS Liu , Z.P. , and Chen , L. ( 2016 ). Prediction and dissection of protein–RNA interactions by molecular descriptors . Curr. Top. Med. Chem. 16 , 604 – 615 . Google Scholar CrossRef Search ADS PubMed Liu , W. , Li , L. , and Li , W. ( 2014 ). Gene co-expression analysis identifies common modules related to prognosis and drug resistance in cancer cell lines . Int. J. Cancer 135 , 2795 – 2803 . Google Scholar CrossRef Search ADS PubMed Liu , Z.-P. , and Miao , H. ( 2016 ). Prediction of protein–RNA interactions using sequence and structure descriptors . Neurocomputing 206 , 28 – 34 . Google Scholar CrossRef Search ADS Liu , Z.P. , Wu , L.Y. , Wang , Y. , et al. . ( 2010 ). Prediction of protein–RNA binding sites by a random forest method with combined features . Bioinformatics 26 , 1616 – 1622 . Google Scholar CrossRef Search ADS PubMed Lukong , K.E. , Chang , K.W. , Khandjian , E.W. , et al. . ( 2008 ). RNA-binding proteins in human genetic disease . Trends Genet . 24 , 416 – 425 . Google Scholar CrossRef Search ADS PubMed Ma , S. , and Dai , Y. ( 2011 ). Principal component analysis based methods in bioinformatics studies . Brief. Bioinform . 12 , 714 – 722 . Google Scholar CrossRef Search ADS PubMed Mas-Ponte , D. , Carlevaro-Fita , J. , Palumbo , E. , et al. . ( 2017 ). LncATLAS database for subcellular localization of long noncoding RNAs . RNA 23 , 1080 – 1087 . Google Scholar CrossRef Search ADS PubMed Meng , S. , Zhou , H. , Feng , Z. , et al. . ( 2017 ). CircRNA: functions and properties of a novel potential biomarker for cancer . Mol. Cancer 16 , 94 . Google Scholar CrossRef Search ADS PubMed Milligan , M.J. , and Lipovich , L. ( 2014 ). Pseudogene-derived lncRNAs: emerging regulators of gene expression . Front. Genet. 5 , 476 . Google Scholar PubMed Ning , S. , Zhang , J. , Wang , P. , et al. . ( 2016 ). Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers . Nucleic Acids Res. 44 , D980 – D985 . Google Scholar CrossRef Search ADS PubMed Park , S. , Yang , J.S. , Shin , Y.E. , et al. . ( 2011 ). Protein localization as a principal feature of the etiology and comorbidity of genetic diseases . Mol. Syst. Biol. 7 , 494 . Google Scholar CrossRef Search ADS PubMed Ponten , F. , Jirstrom , K. , and Uhlen , M. ( 2008 ). The Human Protein Atlas—a tool for pathology . J. Pathol. 216 , 387 – 393 . Google Scholar CrossRef Search ADS PubMed Quinn , J.J. , and Chang , H.Y. ( 2016 ). Unique features of long non-coding RNA biogenesis and function . Nat. Rev. Genet. 17 , 47 – 62 . Google Scholar CrossRef Search ADS PubMed Thul , P.J. , Akesson , L. , Wiking , M. , et al. . ( 2017 ). A subcellular map of the human proteome . Science 356 , pii: eaal3321 . Google Scholar CrossRef Search ADS Torre , L.A. , Siegel , R.L. , Ward , E.M. , et al. . ( 2016 ). Global cancer incidence and mortality rates and trends—an update . Cancer Epidemiol. Biomarkers Prev. 25 , 16 – 27 . Google Scholar CrossRef Search ADS PubMed UniProt , C. ( 2014 ). Activities at the Universal Protein Resource (UniProt) . Nucleic Acids Res. 42 , D191 – D198 . Google Scholar CrossRef Search ADS PubMed Vidal , M. , Cusick , M.E. , and Barabasi , A.L. ( 2011 ). Interactome networks and human disease . Cell 144 , 986 – 998 . Google Scholar CrossRef Search ADS PubMed Wahlestedt , C. ( 2013 ). Targeting long non-coding RNA to therapeutically upregulate gene expression . Nat. Rev. Drug Discov. 12 , 433 – 446 . Google Scholar CrossRef Search ADS PubMed Wang , P. , Ning , S. , Zhang , Y. , et al. . ( 2015 ). Identification of lncRNA-associated competing triplets reveals global patterns and prognostic markers for cancer . Nucleic Acids Res. 43 , 3478 – 3489 . Google Scholar CrossRef Search ADS PubMed Wu , D. , Huang , Y. , Kang , J. , et al. . ( 2015 ). ncRDeathDB: A comprehensive bioinformatics resource for deciphering network organization of the ncRNA-mediated cell death system . Autophagy 11 , 1917 – 1926 . Google Scholar CrossRef Search ADS PubMed Yi , Y. , Zhao , Y. , Li , C. , et al. . ( 2017 ). RAID v2.0: an updated resource of RNA-associated interactions across organisms . Nucleic Acids Res. 45 , D115 – D118 . Google Scholar CrossRef Search ADS PubMed Zhang , T. , Tan , P. , Wang , L. , et al. . ( 2017 ). RNALocate: a resource for RNA subcellular localizations . Nucleic Acids Res . 45 , D135 – D138 . Google Scholar CrossRef Search ADS PubMed Zhou , J. , Zhang , S. , Wang , H. , et al. . ( 2017 ). LncFunNet: an integrated computational framework for identification of functional long noncoding RNAs in mouse skeletal muscle cells . Nucleic Acids Res. 45 , e108 . Google Scholar CrossRef Search ADS PubMed Zhu , X. , Tian , X. , Yu , C. , et al. . ( 2016 ). A long non-coding RNA signature to improve prognosis prediction of gastric cancer . Mol. Cancer 15 , 60 . Google Scholar CrossRef Search ADS PubMed © The Author (2018). Published by Oxford University Press on behalf of Journal of Molecular Cell Biology, IBCB, SIBS, CAS. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Journal of Molecular Cell BiologyOxford University Press

Published: Mar 2, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off