Access the full text.
Sign up today, get DeepDyve free for 14 days.
Motivation: Although the amount of small non-coding RNA-sequencing data is continuously increasing, it is still unclear to which extent small RNAs are represented in the human genome. Results: In this study we analyzed 303 billion sequencing reads from nearly 25 000 datasets to answer this question. We determined that 0.8% of the human genome are reliably covered by 874 123 regions with an average length of 31 nt. On the basis of these regions, we found that among the known small non-coding RNA classes, microRNAs were the most prevalent. In subse- quent steps, we characterized variations of miRNAs and performed a staged validation of 11 877 candidate miRNAs. Of these, many were actually expressed and significantly dysregulated in lung cancer. Selected candidates were finally validated by northern blots. Although isolated miRNAs could still be present in the human genome, our presented set likely contains the largest fraction of human miRNAs. Contact: c.backes@mx.uni-saarland.de or andreas.keller@ccb.uni-saarland.de Supplementary information: Supplementary data are available at Bioinformatics online. pathophysiological processes. Generally, non-coding RNAs can be 1 Introduction divided into long- (lncRNAs) and small non-coding RNAs The part of the human genome which is transcribed into RNAs (sncRNAs). Among the many different categories in the latter class but does not encode for proteins contains many elements with are tRNAs, snoRNAs, miRNAs, snRNAs, piRNAs and others regulatory function that are central for physiological and (Kowalczyk et al., 2012). Altogether, over 85 000 human transcripts V The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1621 1622 T.Fehlmann et al. not coding for proteins have been annotated so far (Harrow et al., capture these different modification types, we performed our analy- 2006). To discover further non-coding transcripts and to detect cor- sis as follows: For each human miRNA annotated in the miRBase relations to pathologies various specimens have been deeply (Kozomara and Griffiths-Jones, 2014) we mapped the reads to the sequenced and evaluated by a heterogeneous set of bioinformatics respective precursor, while allowing up to two non-template addi- tools (Akhtar et al., 2016). Many studies list large collections of tions to the 5’ and 3’ ends and up to one mismatch in between. We ensured that all precursors have at least 15 bases flanks on both putatively novel sequences that are frequently validated either to a ends so that no isoforms are missed. We then defined the mature 3’ limited extent or not at all (Backes et al., 2016; Friedla ¨ nder et al., 2011; Londin et al., 2015). The proposed candidates do not only and 5’ form covered by the largest fraction of reads per million consist of new representatives of the respective molecule class, but mapped to miRNA (RPMMM) normalized reads as the canonical also consist of artifacts that are often also deposited in central data- form. Mappings were allowed in a window of 10 bases up- and bases. In a recent study, Vitsios et al. (2016) reanalyzed hundreds of downstream of the annotated miRNAs. To avoid a bias through potentially mis-annotated miRNAs, we required that at least 80% small RNA-sequencing samples and revealed that many miRNAs in of normalized reads mapped with at most one mismatch in an up- miRBase (Kozomara and Griffiths-Jones, 2014) are potentially mis- and downstream window of 2 and respectively 5 bases to the anno- annotated. Another challenge arises from the substantial redundan- tated miRNAs, or in-between. If only one miRNA was annotated, cies between the different studies. Since not all candidates are stored the other was derived from the read with the highest RPMMM, in a central repository, it is likely that a ‘new’ candidate has already shorter than 25 bases and with at least 3 bases distance to the anno- been discovered in the same or very similar manner by others in pre- tated miRNA. We determined the fraction of reads mapping to vious studies. The heterogeneity and magnitude of small RNA- potential isoforms from the determined canonical forms, while sequencing studies calls for a sophisticated meta-analysis of the allowing up to 5 bases variability at the 5’ end and 7 bases at the 3’ available data. We have carried out such a meta-analysis and present end. Isoforms that were longer than 25 bases or shorter than 17 our results in this paper including a high-resolution map of bases were discarded. If a potential isoform was covered by at least sncRNAs with a focus on miRNAs by an integrative analysis of 2% of all RPMMM normalized reads the isoform was considered to thousands of small RNA-sequencing datasets. be present. 2 Materials and methods 2.3 Single nucleotide variants in mature miRNAs Single nucleotide variants were detected for each previously deter- 2.1 Sample collection mined canonical miRNA form (see above) by mapping the reads Our sample collection stems from three different sources. First, we against the respective precursor while allowing up to one mismatch. downloaded sequencing data likely to contain small RNAs from the We allowed a variability of two nucleotides at the 5’ end and five sequence read archive (SRA) (Kodama et al., 2012) using the follow- nucleotides at the 3’ end. ing query: (“small rna” OR srna OR mirna OR microrna) AND “Homo 2.4 Prediction of novel miRNAs sapiens”[orgn:__txid9606] For the prediction of novel miRNAs, we used our web-based tool This query resulted in 18 367 SRA Runs, of which we kept only miRMaster (Fehlmann et al., 2017). For each of the 18 035 samples those sequenced using an Illumina platform, with single-end reads, we performed the prediction of novel miRNAs. Afterwards, the pre- public access and assay type annotated as miRNA-, ncRNA- or dicted precursors were merged and all reads aligned to the potential RNA-seq. This resulted in 10 233 Runs. Since multiple runs can be new precursors. From these mappings the candidate miRNAs were performed for the same experiment, we merged them, leading to derived. To exclude candidates overlapping with already known 8985 samples. We determined for these the presence or absence of 5’ ncRNAs, we mapped the predicted miRNAs to the human non- barcodes and 3’ adapter, as described in the section below. coding RNAs of Ensembl (release 85) (Yates et al., 2016) and to Second, we collected 10 999 miRNA-seq samples from the can- NONCODE 2016 (Zhao et al., 2016) using BLASTþ (Camacho cer genome atlas project (TCGA) (http://cancergenome.nih.gov/) et al., 2009). We considered a candidate as not novel when the can- (accessed on April 7, 2017). Since the raw files are only available as didate miRNA sequence had an overlap of at least 90% with the mapped BAM files, we transformed them back into FASTQ format aligned sequences while allowing at most one mismatch. All precur- for subsequent analysis. sors containing mapping (i.e. non-novel) miRNAs were then dis- As third data source, we used the collapsed reads users uploaded carded. To further asses the likelihood that miRNAs are true in our tool miRMaster (Fehlmann et al., 2017), if they gave us con- positives and to rank miRNA candidates we subsequently applied sent for usage of their data in an aggregated manner, which summed the novoMiRank tool (Backes et al., 2016). This tool compares fea- up to 4570 samples. These different data sources provided in total tures of new miRNAs to a set of high-confidence miRNAs from 24 554 samples. early miRBase versions and ranks those miRNA candidates highest More information about the pre-processing of the samples can that match best to the known high-confidence miRNAs. be found in the Supplementary Material. 2.5 Validation of novel miRNAs 2.2 Isoform analysis It is generally known that high-throughput approaches can lead to For miRNAs different types of isoforms have been described in the artifacts. This also holds for miRNA analysis where a substantial literature (Guo and Chen, 2014). Most common are isoforms that bias depending on the underlying measurement approach is known only differ in the length of the sequence, but non-template additions (Backes et al., 2016). Since even different sequencing approaches at the 5’ or 3’ end have also been detected (Guo et al., 2014). can have different bias (Fehlmann et al., 2016), we decided to vali- Variants within the miRNA sequence could also potentially stem date new miRNA candidates by another technique. Since we wanted from either sequencing errors or ‘normal’ genetic variability. To to avoid PCR bias, we decided in favor of amplification free High-res. map of the human small nc transcriptome 1623 microarrays. Specifically, we collected a set of miRNAs from their tissue of origin to create a map representing all datasets. This miRBase, a set of miRNA candidates from two other studies (Backes was achieved by applying t-distributed stochastic neighbor embed- et al., 2016; Londin et al., 2015) and those that achieved a high ding. The respective map—color coded by the tissue of origin—high- rank in the above mentioned novoMiRank analysis. Altogether, lights that in the majority of cases tissues of the same type build dense clusters (see Fig. 1B). 11 877 sequences were used for the custom microarray analysis. Previous studies (e.g. Ludwig et al., 2016) suggested reasonable results with 20 technical replicates per miRNA, thus 237 540 fea- 3.2 Coverage of the human genome tures were required. Since the used Agilent microarrays do not pro- Given the large collection of small RNA-sequencing reads, a natural vide sufficient feature numbers per array, the features were split first question to ask is how densely the human genome is covered by across five arrays. Each of these arrays contains a fifth of the fea- the sncRNA-sequencing reads. Mapping the 602 million unique tures, so that we get the full feature set when combining the different reads without allowing mismatches we calculate a total (1-fold) cov- expression data from five arrays for one sample. A set of RNA sam- erage of 64%. Increasing the coverage threshold leads to a rapid ples (plasma and PAXGene blood pools, a reference sample and drop of the covered part of the genome. For example, we observe a brain, kidney, liver, testis and heart tissue) was hybridized with the decrease from 13 to 6% when considering the 10- and 20-fold cov- microarrays as previously described for standard miRBase v21 erage of the human genome by unique reads. Detailed analysis microarrays (Hecksteden et al., 2016). immediately highlights that large fractions of the human genome are From all (candidate) miRNAs found expressed in the blood we covered only by single samples (16.3%) or few samples (41.1% are generated a new custom microarray entitled ‘all human blood represented by at most five of the 18 035 samples). Although such miRNA array’. This array contains 2305 new and known miRNAs regions covered by very few samples could still contain true non- that are found in human blood. The microarray is manufactured by coding RNAs, we excluded these regions because our aim was to Agilent, handling is facilitated by standard Agilent equipment and present a collection of reliable regions present in a reasonable num- the microarray is available for research use from Hummingbird ber of samples containing potentially new sncRNAs. We thus calcu- Diagnostics. This array was hybridized with 53 patients, 25 controls lated the fraction of the genome covered by at least n samples and at and 28 small-cell lung carcinoma patients (SCLC). Microarray data least m reads. The diagonal of the matrix, the percentage covered by were evaluated according to manufacturer’s instructions and data at least n reads in at least n samples, is presented in Figure 2A. For were processed using R. P-values were determined using a two- n ¼ m ¼ 20, the fraction already decreases to 7.5%, for n ¼ m ¼ tailed t-test and the Benjamini–Hochberg procedure (Benjamini and 50–3% and for n ¼ m ¼ 100–1.5%. Heatmaps for the complete Hochberg, 1995) was used to adjust for multiple testing. matrix are shown in Supplementary Figure S1. For determining regions that are reliably covered by small sequencing reads, we set the lower coverage threshold to 180, corresponding to 1% of the 18 3 Results 035 samples. We denote this part of the genome that consists of 3.1 Dataset collection and processing 0.8% of the genome covered by 25 million bases as ‘high confi- The important first step in the meta-analysis is the collection, quality dence’ or ‘reliable’ regions. The length distribution of the roughly control, and curation of available small RNA-sequencing datasets. 900 000 different regions is presented in Figure 2B as histogram. On top of the histogram, the length distribution of known and anno- As major sources for sequencing data we included the SRA tated non-coding RNAs is presented. The peak of the regions is at a (Kodama et al., 2012), the cancer genome atlas project (http://cancer length of 22 nucleotides, fitting best to annotated human miRNAs. genome.nih.gov/) (Cancer Genome Atlas Research, 2008), and data Especially in this range, many reliable regions fall into not- that have been analyzed through our miRMaster workbench annotated regions of the human genome. However, also longer (Fehlmann et al., 2017). The initial sample collection covered 24 554 individual NGS samples and 303 billion sequencing reads stretches exceeding 150 bases were detected among the reliable regions. These usually match to lncRNAs or even genes. The high covering together 13 009 billion bases (corresponding roughly to the confidence regions that represent one of the central results of our number of nucleic acids in 4200 human genomes). In a stringent work are provided as GFF3 file in Supplementary Material S1. quality control step, we removed samples having <1 million reads, samples without known sequencing adapters, samples not mapping to the human genome or those mapping with over 1% to coding 3.3 The most prevalent RNA species exons. Further, samples covering >1% of the human genome were In a next step, we asked how many percent of the considered RNA excluded, since high-quality small ncRNA-seq datasets usually cover classes are covered by reliable regions on each chromosome a much smaller fraction of the human genome. More details about (Supplementary Fig. S2). The reliable regions contain on average per the quality filtering process can be found in the Methods section. chromosome 81% of miRNAs that are annotated in miRBase v21. Following this stringent QC process, 27% of all samples were Best concordance was computed for Chromosome 14 (91%). Since removed for quality issues, leaving 18 035 samples containing 162 the data may not only contain miRNA-sequencing reads, but poten- billion reads. The list of samples from SRA and TCGA is provided tially also other non-coding RNAs or mRNA contaminations, we in Supplementary Table S1 and online at https://mircarta.cs.uni-saar calculated the fraction of these classes covered by reliable regions. land.de/data_sources/. As a matter of fact, a substantial percentage Only 0.5% of all piRNAs and 0.9% of all lncRNAs matched to reli- of all small RNA-sequencing reads is identical—collapsing the set of able regions. For coding exons, the number however increased to all reads leaves 1.9 billion unique reads. Mapping them to the 20%. Although this distribution in principle shows that our quality human genome and excluding those that match to more than five filtering successfully extracted samples primarily containing miRNA unique positions in order to avoid unspecific reads, we found hits data, we also observed many regions corresponding to protein-cod- for 602 million unique reads in the human genome. A detailed ing genes. These can potentially also contain non-coding elements; breakdown of the numbers in the different steps is presented in however, contamination due to fragmented mRNA is more likely. In Figure 1A. Next, the 18 035 samples were annotated with respect to the following, we focus on miRNAs as molecule class and highlight 1624 T.Fehlmann et al. Fig. 1. Overview of the sample collection. Panel (A) shows a detailed breakdown of the reads during our quality filtering process into different categories. We determined for the remaining 18 035 samples to which different genomic annotations they map (using zero mismatches and allowing at least five mappings in the genome per read). Panel (B) shows a t-SNE map for samples with known tissue of origin. We computed the pairwise distances of these samples using the software Mash and plotted a visual representation with the Rtsne package. For almost all tissues distinct clusters can be discovered. Some clusters contain outly- ing samples from other tissues that could partially be due to wrong annotations in the original datasets interesting findings about potential isoforms, variations and novel isoforms for 2258 miRNA/precursor pairs, thus corresponding to miRNA predictions. seven isoforms on average. An example of a miRNA with only two isoforms is hsa-miR-153-5p (Fig. 3B). The canonical form of this miRNA was covered by 92% of all RPMMM, while the most abun- 3.4 Variability in mature miRNAs dant isoform—two bases longer at the 3’ end—was found in 3% of Frequently, miRNAs show variability in their length, start and end reads. Although this miRNA is an example with a clearly most position, which can influence their regulatory function. In addition abundant mature form, the canonical form as annotated in miRBase to the canonical form, for each mature miRNA several dozens of so- v21 was one base shorter at the 5’ end and two bases longer at the 3’ called isomiRs can exist. Importantly, the mature miRNAs as anno- end. An example of a miRNA with many detected isoforms, in total tated in the miRBase (considered as the canonical form) represented 14 isoforms, is hsa-miR-330-3p (Fig. 3C). The corresponding canon- in <43% of cases the most expressed mature miRNA across our ical form is represented only by 19% of all RPMMM, while its sec- 18 035 samples. In particular, we observed shifts at the 5’ end in ond and third most expressed isoforms are represented by 14–15% 23% of cases, thereby influencing the expected miRNA seed region. of all RPMMM. Again, the annotation in miRBase does not match This effect is due to the fact that many canonical miRNAs in the the canonical form in our study—it is one base longer at the 3’ end. miRBase are derived only from few samples usually only in one tis- A complete list of the 15 437 isoforms for human miRNAs for sue type or cell fraction while we aim to identify the overall most which we detected a canonical form with the relative RPMMM fre- abundant mature form of a miRNA. We thus set the mature form quencies and absolute number of reads mapping to this isoform is with highest read count to be the canonical form for each miRNA available in Supplementary Table S3. and calculated the variation frequency for respective isoforms. To reduce the influence of potentially mis-annotated precursors on our analysis we considered only precursors that passed a basic signature 3.5 Single nucleotide variants in mature miRNAs check (details are provided in Section 2), leaving 1415 precursors Variability in miRNAs is not limited to the length of the mature out of the original 1881 extracted from miRBase v21. The deter- miRNAs. Single point variants in miRNAs, especially in the seed mined canonical forms are presented in Supplementary Table S2.As region, can significantly influence miRNA-target gene interactions. presented in Figure 3A, we observed a substantial variability in the Therefore, we investigated the mutation frequency of all miRNAs length distribution of miRNAs. Comparing the 3’ and the 5’ mature considered in the previous sub-section. As shown in Figure 4A, forms of all miRNAs suggests that the 3’ forms have a slightly mutations at the 3’ end are the most frequent ones, similar to the iso- increased variability. Most significantly, for both, the 3’ and 5’ forms that are also mostly observed at the 3’ end. The most frequent mature forms the frequency of variations at the 3’ end were signifi- variations are adenylation and uridylations. This is in agreement cantly higher as compared to the frequency of variation at the 5’ with previous findings where post-transcriptional adenylation by end. A factor that may compromise the considerations is the depth GLD-2 (Katoh et al., 2009) and uridylation by TUTases (Heo et al., of coverage per miRNA. For high-abundant miRNAs, the likelihood 2012) were reported. Interestingly, also a wide variability of modifi- to discover an isoform is higher as compared to low abundant cations at the 5’ end was observed. However, the relative frequency miRNAs. Therefore, we used a threshold relative to the expression of the different alterations was usually low. One example with a of the miRNA and considered an isoform as detected if at least 2% high frequency is hsa-miR-376a-2-5p for which we observed the of all RPMMMs matched to the variant. We observed 15 437 sequence with an A->G mutation at the third base in 48% of the High-res. map of the human small nc transcriptome 1625 Modified at 5‘ end Modified at 3‘ end Fig. 3. Panel (A) shows the frequency of canonical and iso-miRNAs. For each human miRNA we calculated the canonical form as the form with highest read stack. Then we likewise calculated the isoforms. The analyses were per- formed for 3’ and 5’ mature miRNAs separately. Below the box plots, the respective isoform is presented schematically. The horizontal black lines Fig. 2. Panel (A) Distribution of coverage/samples against percentage of mark the canonical form. The dots above the isoform flag those with 5’ modi- human genome covered. With 1% of all samples, we cover almost 1% of the fications, those below the isoforms flag those with 3’ modifications. The iso- human genome (represented by the dot). Panel (B) Length distribution of the forms are sorted with respect to decreasing median frequency and only the high confidence regions. There is a clear peak at 18–22 nt, which falls in the top 20 are shown. Higher abundant miRNA isoforms are dominated by modi- known length distribution of mature miRNAs (shaded region). In the peak fications at the 3’ end. Panels (B) and (C) represent examples for miRNAs region we see an enriched fraction of regions where no annotation is known with few isoforms (panel B) and many isoforms (panel C) yet. We also added the length distribution of selected other RNA entities on top of the plot for completeness. However, only piRNAs and miRNAs have For the other modifications there seems to be no observable bias. lengths distributions that match the observed peak Figure 4B shows the mean relative expression of the observed muta- tions per position for all known miRNAs of miRBase v21. A list normalized reads of this miRNA. Interestingly, no other sequence in containing all detected mismatches including their absolute number the human genome matched the mutated sequence (with up to one of reads and relative RPMMM frequency is provided in mismatch), minimizing the possibility of a false detection. This Supplementary Table S4. mutation could be the result of an adenosine to inosine RNA editing, since the inosine is reported as guanine during sequencing. We also 3.6 Discovery of new miRNAs detected the same A->G mutation for the other arm, hsa-miR-376a- As described earlier and highlighted in Figures 1A and 2B, not all 3p, at the sixth base. Here, the variant was observed even in 67% of reads match to known RNA resources. Respective regions that are the total reads. This variability is not limited to a certain cell type or covered but are not annotated yet contain potentially novel tissue: the miRNA was detected with at least 10 reads per sample in miRNAs. We applied the algorithms implemented in miRMaster a total of 5744 samples. The high frequency of these modifications (Fehlmann et al., 2017) to predict new miRNA candidates. From its and their expression across a large number of samples suggest that basic principles the approach is similar to miRDeep2 (Friedla ¨ nder the RNA editing of this miRNA is important in a wide range of dis- et al., 2011). The very large number of samples however required a eases or tissue contexts, confirming and extending previous findings completely re-implemented and optimized version in Cþþ to facili- by other researchers (Kawahara et al., 2007; Zheng et al., 2016). tate the joint analysis of the billions of reads in reasonable comput- When comparing the relative frequencies on the level of mature arms, we can see that the most abundant modifications at the 3’ end ing time. From all sequencing reads, we predicted a total of 135 290 seem to be slightly more frequent for the 3’ arm than for the 5’ arm. new miRNA candidates. It is expected that these contain many false 1626 T.Fehlmann et al. Fig. 5. Frequency of known and predicted precursor clusters across the differ- ent human chromosomes. A cluster was defined as a region containing at least two precursors with a distance of at most 10 kb between the middle positions of at least two members two precursors with a distance of at most 10 kb between the middle positions of at least two members. Altogether, 1802 clusters with an average of 2.35 precursors were observed. The chromosomes with the largest number were the largest chromosomes: Chromosome 1 (176 clusters) and Chromosome 2 (131 clusters). In contrast, Fig. 4. Panel (A) details the frequency of the different variants in the 3’ and 5’ mature form of the miRNAs. The variants are shown relative to the 3’ (T) or 5’ Chromosome 21 (18 clusters) and Chromosome 13 (25 clusters) had (F) end of the miRNAs. The variants are sorted according to decreasing significantly fewer representatives (Fig. 5). This figure shows that median frequency and only the top 20 are shown. Panel (B) shows the fre- the clusters are not uniformly distributed across the genome and quency of different single nucleotide variants across the position in the only partially reflect the chromosome sizes. Although Chromosome miRNA 4 with 190 million bases contains 62 clusters (1 precursor per 3 million bases), and Chromosome 13 (110 million bases) contains positives. Thus, in a first step to reduce the number of potential pre- 25 precursors (1 precursor per 4.4 million bases), especially cursors and to increase the precision (the fraction of true positive Chromosomes 17 (80 million bases, 121 clusters, 1 cluster per prediction on all positive predictions), we matched the predicted 0.66 million bases) and 19 are enriched for miRNA clusters (60 precursors to the previously annotated high confidence regions. This million bases, 100 clusters, 1 cluster per 0.8 million bases). The analysis yielded 17 400 miRNA candidates overlapping with our full list of clusters with regions, and the number of known as well as reliable regions that are not equal or similar to known ncRNAs. predicted miRNAs is presented in Supplementary Table S5. These candidates are available as GFF3 in Supplementary Material S2 and are stored in our new online repository miRCarta (http:// 3.8 Validation of new miRNAs www.ccb.uni-saarland.de/mircarta) (Backes et al., 2017). MiRCarta As described earlier miRNA candidates can only be considered real (v1.0) is a comprehensive database that allows for browsing and fil- miRNAs once they have been experimentally validated. Among the tering the miRNA candidates from this article, the candidates from core criteria for validating a miRNA is to report expression by using the publications of Backes et al. (2016) and Londin et al. (2015), as a hybridization-based technique. We consider the detection of proc- well as the latest miRBase data. In addition, it visualizes the expres- essed mature forms from cloned precursors via Northern Blotting as sion data from the 18 035 samples as pileup plots for the stem-loops gold standard for experimental validation. However, due to the to facilitate the assessment if the expression profile belongs to a true large number of candidates we performed a first validation step via positive finding. To have a consistent naming of the novel candi- a custom microarray. Since the predicted miRNAs could be influ- dates and known miRNAs we implemented a new scheme. This enced by technological bias, e.g. the library preparation or sequenc- scheme, which is detailed in the Methods section, is also suited to unify future findings and is valid across species. In brief, identical ing, we selected a subset of 11 877 miRNAs containing the annotated miRNAs from miRBase v21 and further candidates, with mature miRNAs have the same identifier independent of the species the focus on blood miRNA candidates. Using these candidates, we (e.g. m-17) and the precursors are named by an organism tag (simi- lar to miRBase) followed by the 5’ and 3’ mature miRNAs they con- built a custom microarray and hybridized the arrays with plasma sist of (e.g. hsa-17-22 contains the 5’ mature miRNA m-17 and the and PAXGene blood pools, a reference sample, brain, kidney, liver, 3’ mature miRNA m-22). Although this new naming scheme is the testis and heart tissue samples. By this amplification free procedure, primary identifier in miRCarta, the well-known miRNA name and we measured signals for 1146 (44%) known miRNAs and 3151 miRBase ID is contained for all currently available miRNAs. (34%) miRNA candidates. The results of the microarray analysis are presented as heat map in Figure 6A. The pileup plot and secon- dary structure of one novel predicted precursor is shown exempla- 3.7 Genomic miRNA clusters Frequently, miRNA genes are not isolated and uniformly distributed rily in Figure 6B. Another example of a predicted precursor is shown across the genome but accumulate in clusters. Thus, we searched for in Supplementary Figure S3. This precursor is already annotated in such clusters in the human genome for the set of known and pre- two other species (oan-mir-2985 and tgu-mir-2985-1), but not yet dicted miRNAs. A cluster was defined as a region containing at least as human miRNA precursor. High-res. map of the human small nc transcriptome 1627 detect the processed/mature forms of miRNAs on Northern Blots. SCLC Expression For 59 of 103 tested candidates that have previously not been Control reported by Northern Blots, bands in the expected size of around 22 nucleotides were observed (manuscript in preparation). This set is to our knowledge the largest collection of miRNAs that have jointly been validated using respective experimental methods. The precur- sor presented in Figure 6B was among the ones for which both miRNAs was validated. 3.9 Evolutionary conservation of new miRNAs MiRNAs are frequently highly conserved. To verify if we can also observe this for our novel candidates, we mapped the sequences of B the candidates without mismatches against the genomes of 148 200K organisms that are also contained in miRCarta and counted a hit for 150K an organism if we could find the sequence at least one time in its 100K genome. We find about 85% of our candidates in at least one other 0 5000 10000 15000 organism than Homo sapiens. A novel candidate occurs on average 50K #Experiments in 4.5 organisms. The sequences of the top five miRNA candidates AACCAGCUCUGGUUCCCCAGCCUACUGGAGGAUAAGAGGAUAUAAAGGUCUCUUAUCCUCCAGUAGACUAGGGAGCCAGAGCUGGUAAU (sorted by the sum of hits in different organisms) can even be found on average in 37 species. The miRNA candidate with the most hits (m-7214) was detected in 58/148 organisms. These 58 organisms Fig. 6. Panel (A) describes the first-pass validation of human (candidate) are from various taxonomy classes such as Mammalia, Insecta, miRNAs by microarrays. Cluster heat map of all 4297 mature (candidate) Amphibia etc. For the remaining top five candidates, the variety in miRNAs that showed signals in high-quality RNA samples based on amplifi- species is smaller such that we can find the lowest common ancestor cation free hybridization. Panel (B) presents a representative example with Craniata for these taxa at subphylum level. Furthermore, some of the mapping distribution as presented in Figure 3. In addition to the pileup our candidates are already annotated in miRBase for other organ- plot, the secondary structure is shown. Panel (C) shows the result of the human blood miRNA array hybridized with lung cancer (blue) and control isms than human. For example, m-3155 corresponds to the known samples (orange). Results are presented as heat map resulting from hierarch- miRNAs mmu-miR-3085-3p and rno-miR-3085. Of course, these ical clustering with dendrograms on top (clustering of samples) and at the left findings only illustrate that identical sequences are contained in side (clustering of miRNAs) other organisms, not that they necessarily function as miRNAs. Among the specimens with the most complex miRNomes were blood samples. Since they can be gathered minimally invasive, blood 4 Discussion and conclusions samples represent also an important source of new biomarkers for Since the advent of next-generation sequencing, thousands of small various human pathologies. Many studies have shown that known RNA-sequencing datasets have been created and also been partially miRNAs can be differentially expressed comparing disease and con- deposited in public databases. However, depending on which pipe- trol samples. In an additional validation step, we therefore asked lines were used for the evaluation of the datasets in the different whether we can also observe expression differences for novel blood study setups, these are not directly comparable to each other. To miRNA candidates. To this end, we built a microarray containing make maximal use of the available datasets, a consistent analysis of 2305 (candidate) miRNAs from the first validation that were the different samples with the same pipeline is necessary. The aim of expressed in blood. The resulting human blood miRNA microarray our study was to remove redundancies due to different study setups can be used to facilitate the discovery and validation of circulating and to provide a reliable map of high-quality small RNA annota- miRNAs for many human pathologies. Among the diseases with the tions, particularly for miRNAs. largest effect size and the highest reproducibility in miRNA bio- Starting with a collection of 24 554 human small RNA NGS markers is lung cancer. With the new array, we hybridized 53 indi- samples, we performed stringent quality controls, which left us with viduals, 25 controls and 28 SCLC patients. Following adjustment 18 035 usable samples for down-stream analysis. for multiple testing, 695 miRNAs had an adjusted t-test P-value Depending on the coverage thresholds, up to 64% of the genome below 0.05 and were considered statistically significant. Although are covered (1-fold coverage). However, this number does not reflect the six most significant features are known from the miRBase, the true complexity of the human non-coding transcriptome but is already the seventh marker was a new candidate miRNA (raw and affected by different sources of bias. By defining those regions that 9 6 adjusted P-value 10 and 10 ). Altogether, 457 candidate are covered by at least 1% (180) of all samples as reliable, we still miRNAs that are not included in the miRBase were among the 695 obtained about 900 000 such regions with variable lengths. significant markers for SCLC. As the cluster heat map in Figure 6C Although they contain true positive sncRNAs, we still expect a rea- details, the most significant markers were predominantly down- sonable number of false positive hits. Since solid low-throughput regulated in SCLC patients. These results suggest that new miRNAs validation of such large sets of potential non-coding miRNAs is not are not only detectable by hybridization-based techniques but also feasible, we performed a large first pass validation by using microar- bear a substantial diagnostic information content. rays. We were able to detect expression signals for 34% of our novel Of course, the hybridization on arrays does not replace a thor- candidates in high-quality samples, minimizing the risk of false posi- ough analysis of the expression and biogenesis of miRNAs followed tives e.g. by degradation of mRNAs. Further, we found that some of by Northern Blotting. We thus cloned the precursors of miRNAs these miRNAs have considerable potential as biomarkers in SCLC. from different miRBase versions and new candidates and tried to Still, these are not necessarily functional miRNAs but remain #Total reads Plasma Blood Kidney Reference Brain Lung Heart Tess Reads supplied (%) 1628 T.Fehlmann et al. Fehlmann,T. et al. (2016) cPAS-based sequencing on the BGISEQ-500 to candidates until a detailed validation has been carried out. Thus, we explore small non-coding RNAs. Clin. Epigenet., 8, 123. performed such a validation for selected candidates using Northern Friedla ¨ nder,M.R. et al. (2011) miRDeep2 accurately identifies known Blotting. and hundreds of novel microRNA genes in seven animal clades. Nucleic With our sncRNA study, we performed to our knowledge the Acids Res., 40, 37–52. most complete analysis of human sncRNAs with a focus on Guo,L. and Chen,F. (2014) A challenge for miRNA: multiple isomiRs in miRNAs. The set of reported reliable regions, which is covering miRNAomics. Gene, 544, 1–7. 0.8% of the human genome, likely contains a very substantial frac- Guo,L. et al. (2014) A genome-wide screen for non-template nucleotides tion of all small non-coding elements in the human genome. and isomiR repertoires in miRNAs indicates dynamic and versatile microRNAome. Mol. Biol. Rep., 41, 6649–6658. Harrow,J. et al. (2006) GENCODE: producing a reference annotation for Funding ENCODE. Genome Biol., 7 (Suppl 1), S4 1–S9. Hecksteden,A. et al. (2016) miRNAs and sports: tracking training status and The different aspects of this work have been supported by the following sour- potentially confounding diagnoses. J. Transl. Med., 14, 219. ces: The generation of the own datasets was supported by Siemens Heo,I. et al. (2012) Mono-uridylation of pre-microRNA as a key step in the Healthineers. The in silico analysis was supported by the EU FP7 project biogenesis of group II let-7 microRNAs. Cell, 151, 521–532. BestAgeing and the Michael J. Fox Foundation. The validation of new Katoh,T. et al. (2009) Selective stabilization of mammalian microRNAs by 3’ miRNAs by microarrays was supported by Hummingbird Diagnostics. The adenylation mediated by the cytoplasmic poly(A) polymerase GLD-2. validation of new miRNAs by Northern Blots was supported by the Michael Genes Dev., 23, 433–438. J. Fox Foundation. Kawahara,Y. et al. (2007) Redirection of silencing targets by adenosine-to-inosine editing of miRNAs. Science, 315, 1137–1140. Conflict of Interest: none declared. Kodama,Y. et al. (2012) The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res., 40, D54–D56. Kowalczyk,M.S. et al. (2012) Molecular biology: RNA discrimination. References Nature, 482, 310–311. Akhtar,M.M. et al. (2016) Bioinformatic tools for microRNA dissection. Kozomara,A. and Griffiths-Jones,S. (2014) miRBase: annotating high confi- Nucleic Acids Res., 44, 24–44. dence microRNAs using deep sequencing data. Nucleic Acids Res., 42, Backes,C. et al. (2017) miRCarta: a central repository for collecting miRNA D68–D73. candidates. Nucleic Acids Res., doi: 10.1093/nar/gkx851. Londin,E. et al. (2015) Analysis of 13 cell types reveals evidence for the expres- Backes,C. et al. (2016) Prioritizing and selecting likely novel miRNAs from sion of numerous novel primate- and tissue-specific microRNAs. Proc. Natl. NGS data. Nucleic Acids Res., 44, e53. Acad. Sci. USA, 112, E1106–E1115. Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a Ludwig,N. et al. (2016) Distribution of miRNA expression across human tis- practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B sues. Nucleic Acids Res., 44, 3865–3877. Methodol., 57, 289–300. Vitsios,D.M. et al. (2016) Large-scale analysis of microRNA expression, Camacho,C. et al. (2009) BLASTþ: architecture and applications. BMC epi-transcriptomic features and biogenesis. Nucleic Acids Res., 45, Bioinformatics, 10, 421. 1079–1090. Cancer Genome Atlas Research, N. (2008) Comprehensive genomic character- Yates,A. et al. (2016) Ensembl 2016. Nucleic Acids Res., 44, D710–D716. ization defines human glioblastoma genes and core pathways. Nature, 455, Zhao,Y. et al. (2016) NONCODE 2016: an informative and valuable data 1061–1068. source of long non-coding RNAs. Nucleic Acids Res., 44, D203–D208. Fehlmann,T. et al. (2017) Web-based NGS data analysis using miRMaster: a Zheng,Y. et al. (2016) Accurate detection for a wide range of mutation and large-scale meta-analysis of human miRNAs. Nucleic Acids Res., 45, editing sites of microRNAs from small RNA high-throughput sequencing 8731–8744. profiles. Nucleic Acids Res., 44, e123.
Bioinformatics – Oxford University Press
Published: Dec 21, 2017
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.