Background: The main bottleneck for genomic studies of tumors is the limited availability of fresh frozen (FF) samples collected from patients, coupled with comprehensive long-term clinical follow-up. This shortage could be alleviated by using existing large archives of routinely obtained and stored Formalin-Fixed Paraffin-Embedded (FFPE) tissues. However, since these samples are partially degraded, their RNA sequencing is technically challenging. Results: In an effort to establish a reliable and practical procedure, we compared three protocols for RNA sequencing using pairs of FF and FFPE samples, both taken from the same breast tumor. In contrast to previous studies, we compared the expression profiles obtained from the two matched sample types, using the same protocol for both. Three protocols were tested on low initial amounts of RNA, as little as 100 ng, to represent the possibly limited availability of clinical samples. For two of the three protocols tested, poly(A) selection (mRNA-seq) and ribosomal-depletion, the total gene expression profiles of matched FF and FFPE pairs were highly correlated. For both protocols, differential gene expression between two FFPE samples was in agreement with their matched FF samples. Notably, although expression levels of FFPE samples by mRNA-seq were mainly represented by the 3′-end of the transcript, they yielded very similar results to those obtained by ribosomal-depletion protocol, which produces uniform coverage across the transcript. Further, focusing on clinically relevant genes, we showed that the high correlation between expression levels persists at higher resolutions. Conclusions: Using the poly(A) protocol for FFPE exhibited, unexpectedly, similar efficiency to the ribosomal- depletion protocol, with the latter requiring much higher (2–3 fold) sequencing depth to compensate for the relative low fraction of reads mapped to the transcriptome. The results indicate that standard poly(A)-based RNA sequencing of archived FFPE samples is a reliable and cost-effective alternative for measuring mRNA-seq on FF samples. Expression profiling of FFPE samples by mRNA-seq can facilitate much needed extensive retrospective clinical genomic studies. Keywords: RNA sequencing, FFPE, Poly-a capturing, Ribosomal depletion, Breast cancer Background of frozen tumor samples that were obtained from Gene expression profiling of tumor samples is a power- patients with long-term clinical follow up [1, 2]. On the ful technique for identifying prognostic and predictive other hand, there exist large diagnostic repositories of biomarkers. To date, all large scale transcriptomic profil- archived tissues, with matched clinical records on ing of cancer were performed on frozen tissue samples disease progression and outcome, which could poten- and required international efforts; nevertheless, analysis tially comprise invaluable resources for comprehensive of the results was limited by availability of the numbers genomic studies of cancer. Exploiting archived reservoirs could allow unprecedented large-scale retrospective in- * Correspondence: Maya.firstname.lastname@example.org vestigations of longitudinal samples, focusing on specific Chaim Sheba Medical Center, Cancer Research Center, 5262100 subtypes or ethnic groups of interest, without the need Tel-Hashomer, Israel for long-term prospective collection of samples. Such Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 2 of 11 informative datasets could facilitate biomarker discovery Results and personalized drug development. Three pairs of matched FF/FFPE tumor samples were col- Archived tissues are preserved as formalin-fixed lected, with a moderate archival time of about 4–5years paraffin-embedded (FFPE) blocks, which is the standard (T1-T3). To compare the efficiency of the various RNA- preparation format for pathological diagnosis. The qual- seq protocols we used the same library preparation ity of FFPE-extracted RNA is highly variable due to frag- protocol for both FF and FFPE samples. Three protocols mentation of RNA transcripts, chemical modifications were tested: illumina Truseq RNA after poly(A) selection and cross-linking of nucleic acids and proteins [3, 4]as (mRNA-seq); Truseq after ribosomal depletion (Ribo- well as variability in tissue handling and processing . Zero); and Nugen Ovation with Ribosomal depletion. As These factors are influenced by various methods for sometimes archived biopsies are limited in size, we also tissue fixation and by archiving time of tissue samples. compared 2 different amounts of starting material, 500 Hence, due to the relative low quality of FFPE-derived and 100 ng of RNA, to assess the lower starting material RNA, applying standard protocols for whole transcrip- limitations of the protocols. In addition, to evaluate the tome analysis is a challenging task. limitation of the mRNA-seq method for highly degraded In recent years the field has moved from microarray RNA, we analyzed additional 3 FFPE tumor samples based profiling to RNA-sequencing (RNA-seq), an archived for more than 10 years (with no matching FF sam- unbiased method which provides greater analytical depth ples; T4-T6). RNA integrity was high for the FF samples and increased dynamic range for gene expression meas- (RIN number 6–7) and low for the FFPE samples (1.8 or urement [6, 7]. RNA-seq protocols commonly include below detection limit) (see Additional file 1:Table S1). steps for enrichment of exonic RNA sequences and removal of ribosomal RNA. The conventional protocol, Transcriptome mapping and coverage known as mRNA-Seq, is based on capturing polyadeny- Sequencing reads were aligned to the human genome to lated (poly(A)) RNA transcripts with oligo(dT) primers, assess the fraction of mapped transcripts. Percentage of thereby depleting the highly abundant ribosomal RNA uniquely mapped reads varied between the sample types (rRNA) and all other non-poly(A) fragments . The in favor of the FF samples (Table 1 and Additional file 2: concern in utilizing mRNA-Seq for FFPE samples is that Table S2). Older FFPE samples exhibited lower fraction degraded or modified poly(A) tail of transcripts will not of aligned reads, probably due to lower RNA quality, perfectly anneal to oligo(dT) and the captured short however the fraction of exonic reads in the old samples transcripts will not comprise an unbiased, accurate was relatively high, so that the number of exonic reads representation of the transcriptome. Other protocols sufficed for gene expression analysis (Table 1 and eliminate rRNA by capturing these highly abundant Additional file 2: Table S2). The fraction of unmapped transcripts and removing them by either magnetic beads reads obtained from the NuGEN Ovation for the FFPE (i.e. RiboZero by Illumina) or by enzymatic digestion, (i. samples protocol was very high (~ 60%), resulting in very e., RNAse H or Ovation by NuGEN) [9–12]. low percentage of exonic reads out of the total number Several previous studies evaluated the feasibility of of reads (~ 4%); therefore, we excluded this protocol RNA-seq to reliably profile gene expression comparing from further analysis (Table 1 and Additional file 2: matched FF and FFPE samples. However, most studies Table S2). mRNA-seq and RiboZero exhibited compar- used the gold-standard poly(A) mRNA-seq protocol for able efficiency of rRNA removal for FFPE samples but FF samples compared to ribosomal-depletion protocol showed high variance for the FF samples (1.3% versus for FFPE [10, 12–14], out of the assumption that 24.7%, see Table 1). mRNA-seq will not optimally capture degraded mRNAs. To accurately and sensitively estimate gene expression, Such comparisons mix effects of the different sample high coverage of exonic regions is needed. Comparing types with those of the protocol used. In the current the fractions of exonic reads out of uniquely mapped study we perform an unbiased evaluation of RNA-seq of reads demonstrated that mRNA-seq was superior to archived tumor tissues by comparing the same library RiboZero (Fig. 1a). While both protocols showed lower preparation methods for both FF and FFPE matched fraction of exonic regions for FFPE compared to FF, tumor samples and for small amounts of total RNA mRNA-seq resulted in a total of 42–61% exonic regions starting material. In addition to comparing the coverage compared to 2–14% for RiboZero. Larger fractions of and mapping parameters of FF and FFPE, we addressed reads were mapped to intronic regions in the RiboZero the question of the reproducibility characteristics of the libraries, and this significantly varied between samples. sample types and the methods, when trying to estimate To estimate the total number of reads one needs in differential gene expression, with particular focus on order to get expression data for a satisfactory number of quantifying expression of clinically relevant genes, de- genes from each protocol, we compared the fraction of rived from the METABRIC dataset . exonic reads out of total reads (Fig. 1b). Again, for all Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 3 of 11 Table 1 Mapping percentages for the different RNA-seq methods and samples mRNA-seq RiboZero Nugen FF FFPE FFPE old FF FFPE FF FFPE n= 6 n= 5 n= 3 n= 3 n= 3 n= 3 n= 6 Exons 58 (51–63) 29.2 (16.6–38.8) 20.1 (13.8–32.2) 21.4 (14.9–30) 8.4 (0.6–12.7) 20.3 (16–22.7) 3.9 (0.8–7) Intronic/intergenic 25 (20–32) 28.2 (22.3–32.6) 23.3 (18.9–27.4) 44.4 (35.1–55.5) 70.2 (61.7–75.2) 65.1 (63.1–67.9) 32.5 (24.5–43.1) rRNA 1.3 (0.9–1.6) 12.7 (4.17–17.8) 6.4 (2.7–12.9) 24.7 (1.15–45.3) 5.3 (0.9–13.4) 0.07 (0.05–0.08) 1.2 (0.3–2.2) Multiple alignment 11.9 (9.5–15) 5.9 (3.8–7.2) 4.2 (2.63–6) 6.3 (2–10) 6.5 (6.2–6.6) 9.6 (8.7–10.3) 3.6 (2.7–4.7) Unmapped 3.8 (3.7–4) 24 (10.9–44.9) 45.7 (31.6–54) 3 (2.7–3.3) 9.6 (5.3–17.8) 4.8 (4.2–6.1) 58.7 (43–67.8) Values presented as mean percentage (range) n = number of measurements (#of tumor samples X number of initial RNA amounts) protocols the percentages of reads mapping to exons FFPE both for 500 and 100 ng. Overall, it appears that (out of the total number of reads) were lower in FFPE the mRNA-seq protocol provides consistent correlation RNA libraries compared to FF (about 1/2). However, between matched FF and FFPE total gene expression, focusing on FFPE samples reveals that while the exonic even for samples that the RiboZero protocol failed to fraction is around 30% for the mRNA-seq protocol, only show with similar number of reads. These results about 10% of the total reads remain for analysis in the suggest that FFPE samples can be a reliable alternative RiboZero protocol (Fig. 1b). This key point is further for FF samples by mRNA-seq also at low initial amounts demonstrated when calculating the estimated total of 100 ng FFPE-derived RNA. number of reads required to detect approximately 11,000 genes (threshold is set according to the saturation Differential gene expression of the curves in Fig. 1c). While with mRNA-seq the total The ideal quality check for RNA-seq data is to evaluate number of reads needed to reach this endpoint is 26–42 its reliability to quantify differential gene expression. To million for FFPE samples, depending on their quality estimate this, we compared the fold changes (FC) be- and age (range between ~ 4 to ~ 10 years old), the tween two different FFPE samples to the FCs between RiboZero protocol requires at least 70 million reads for their matched FF samples (Fig. 3a-b). Both mRNA-seq the ~ 4 years old samples (Fig. 1c). and RiboZero protocols yielded high correlation between As the mRNA-seq protocol captures transcripts based the FCs of two FFPE tumor samples and the FCs on their poly(A) tail, a 3′-end bias is expected, predom- obtained from their matched FF samples. Since the inantly for degraded transcripts. To check the extent to RiboZero protocol failed for tumor T3, we were able to which the library preparation protocol generates a 3′- compare only the differences between tumors T1 and T2, bias, we plotted the coverage along the normalized tran- whilefor themRNA-seq protocolwe were abletocompare script length (Fig. 1d). A small 3′-end bias is evident for all three comparisons between T1-T3 (Additional file 5: FF samples in both protocols. In the RiboZero protocol, Figure S2). To evaluate the extent of agreement in dif- no difference was observed between FF and FFPE. For ferential expression between FFPE and FF samples, mRNA-seq libraries, the 3′-end bias is much more evi- we calculated the fraction of genes with FC ≥2be- dent for FFPE samples, indicating that with this protocol tween two FFPE samples, which change in the same fragmented RNA transcripts are mainly represented at direction in the matched FF samples (see methods for their 3′-end. details). As demonstrated in Fig. 3c, the percentage of agreement between FFPE samples and matched FF Total gene expression correlation samples is comparable between mRNA-seq and Ribo- The correlation of gene expression between matched FF Zero protocols (around 80%), with a slight advantage and FFPE samples for mRNA-seq and RiboZero proto- for the latter. We further checked the dependence of cols are shown in Fig. 2a-c. For all the three tumors, the this percentage of genes with FC in agreement on the mRNA-seq protocol resulted in high correlation between value of the FC threshold used (between 1 to 10 fold, FF/FFPE pairs (correlation coefficient around 0.9) Fig. 3d), yielding similar results. Notably, in addition whereas the riboZero protocol yielded lower correlations to the technical differences between the protocols, the for tumors T1 and T2 and no correlation for tumor T3, observed variability between differential gene expres- which failed exome mapping. A color-coded correlation sions of matched FF and FFPE samples can be due to matrix comparing all methods and the various initial expression noise and tumor heterogeneity, as the two RNA amounts is shown in Fig. 2d-f. mRNA-seq protocol sample types were taken from two regions of the resulted in highest correlation between matched FF and tumor. Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 4 of 11 bc Fig. 1 (See legend on next page.) Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 5 of 11 (See figure on previous page.) Fig. 1 Mapping and coverage information. (a) The percentage of reads mapped to exons and to introns/intergenic regions out of the total number of uniquely mapped reads per sample. Color-code: purple for mRNA-seq protocol, orange for RiboZero protocol and blue for NuGEN ovation protocol. Black: FF samples (T1-T3), gray: ~ 4 years old FFPE samples (T1-T3), and light gray: ~ 10 years old FFPE samples (T4-T6, done only with mRNA-seq protocol). The amount of starting material, in ng, is indicated below the bars corresponding to each sample. (b) Box-plot of the percentages of reads uniquely mapped to exons out of total number of reads, for each sample type (FF mRNA-seq (T1-T3; n = 6); ~ 4 years old FFPE mRNA-seq (T1-T3; n = 5); ~ 10 years old FFPE mRNA-seq (T4-T6; n = 3); FF RiboZero (T1-T3; n= 3); FFPE RiboZero (T1-T3; n= 3); FFPE NuGEN ovation (T1-T3; n= 6)). n= number of measurements (#of tumor samples x number of initial RNA amounts). (c) The estimated number of detected genes as a function of the total number of reads for each sample type (FF mRNA-seq samples: purple solid line; FFPE mRNA-seq samples: purple dashed line, shown separately for ~ 4 and ~ 10 years old samples; FFPE RiboZero samples: orange dashed line). The horizontal dashed line represents the estimated coverage required for each sample type to get ~ 11,000 genes (see the numbers at the bottom (15 M, etc) for the estimated number of reads required for each sample type, and see methods for more details). (d) The average coverage along the relative genomic region from the 5′ end (Transcription Start Site) to the 3′ end (Transcription End Site) for each sample. mRNA-seq protocol at the left (purple; solid line for FF samples (T1-T3) and dashed line for FFPE samples (T1-T6)). RiboZero protocol to the right (orange; solid line for FF samples (T1-T3) and dashed line for FFPE samples (T1-T3)) Biological and clinical validity of gene expression levels of the PAM50 genes (Fig. 4b). At this To estimate the biological and clinical significance of the resolution the high correlation between the expression results to breast cancer, we tested the correlation be- levels of both methods for these clinically relevant tween the expression levels of the various protocols for genes is more noticeable. relevant gene lists. First, we focused on 701 genes that Additional biological relevance of the expression data is are differentially expressed between normal breast and demonstrated by comparing the immunohistochemistry breast cancer, derived from the METABRIC dataset  score of estrogen receptor to its expression level (Fig. 4c). (Fig. 4a). Comparing the standard FF mRNA-seq to the There was a very good agreement between the ESR1 FFPE RiboZero, which is the comparison done so far in expression level and its pathological score, even for the other studies, results in high correlation for these genes older FFPE samples from ~ 10 years ago (T4-T6; marked (cc = 0.92) (left, orange). Notably, a general shift of by stars in Fig. 4c). higher expression in the mRNA-seq is presented by Importantly, the correlation between the mRNA-seq higher fraction of genes below the diagonal (× 2.5 fold). and the RiboZero method is very high despite the Comparing expression levels between FF and FFPE sam- difference in distribution of the reads across the genes. ples, both with the mRNA-seq protocol (middle, purple) A representative plot of the reads obtained for each removes this shift (fractions of genes below and above method is illustrated at the chromosomal map of the diagonal are similar) and results in higher correlation ESR1 gene (Fig. 4d). It is clear that the number of reads levels (cc = 0.94). Here, the method is the same but the at the 3′-end of the transcript is much higher for the small variability may stem from expression noise of the mRNA-seq protocol than the RiboZero protocol and this two adjacent tumor regions. The ideal comparison is much more evident in the FFPE sample. The RiboZero between the two methods applies both to the same library is distributed across the entire exome as well as FFPE sample (right, grey) and results in high correl- at intergenic regions. Nevertheless, as shown in Fig. 4a-c, ation (cc = 0.94). But similarly to the first comparison expression levels are very similar despite the difference in between the 2 methods, the mRNA-seq protocol re- distribution of reads across the gene. sults in higher expression levels versus the RiboZero, as shown by higher fraction of genes bellow the diag- onal. This indicates that the higher expression levels Discussion are related to the mRNA-seq method rather than to Genome profiling datasets of many tumors types are the the sample quality, as could be assumed on the basis basis for a large number of studies, in both basic re- of the first comparison, performed in previous studies. search and clinically relevant diagnostic developments. This is further emphasized in the comparison of the Most studies are performed on fresh frozen biopsies, two sample types by the same mRNA-seq method, assuming that the much more readily available FFPE showing no advantage for the FF sample (similar samples are less suitable for large-scale genomic profiling. amount of genes bellow and above diagonal; middle, The availability of FFPE tumors at pathological archives purple). Notably, the high correlation between the enables researchers to expand the current expression methods is evident across all expression levels, except databases to specific tumor subtypes, to perform extensive at the very low expressed genes below 5 (log2 scale). retrospective analyses and to execute longitudinal studies A further zoom-in into the correlation between the on individual patients over the course of disease. In this methods is presented by comparing the expression study, we demonstrate that standard mRNA-seq of FFPE Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 6 of 11 ad be cf Fig. 2 Comparisons of matched FF and FFPE samples prepared with different RNA-seq protocols. (a) Scatter plot of gene expression levels as measured from FFPE sample T1 by mRNA-seq protocol (purple; left) and RiboZero protocol (orange; right), compared to the gene expression data obtained from the FF sample of the same tumor . Gene expression data presented in log2 scale; r-square and correlation coefficients are presented for each plot. Total number of reads for each library is indicated at the x and y labels (M = million) (b)Same as(a) for the T2 samples (c)Same as(a)for T3. (d) Pearson correlation coefficient matrix between T1 gene expression data as measured by the different protocols: mRNA-seq in purple, RiboZero in orange and NuGEN ovation in blue, on the FF (black) and FFPE (gray) samples (colorbar at the bottom), measured for the indicated RNA quantities (in ng). The colorbar to the right is for the correlation coefficient values from 0 to 1. (e)Same as (d) for T2 (f)Same as (d)for T3 samples is a consistent and efficient method for basic of RNA-seq for the latter [10, 12–14, 16–18]. However, transcriptome profiling. most studies compared the standard poly(A) mRNA-seq RNA-seq is increasingly gaining ground as the preferred protocol for FF samples to ribosomal depletion, used for unbiased method for genome profiling. Indeed, several FFPE samples [10, 12–14]. Alternatively, FF samples were previous studies have already showed high correlation be- compared to matched FFPE using ribosomal depletion tween matched FF and FFPE samples, indicating feasibility methods for both sample types [16–18] (Additional file 3: Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 7 of 11 a b c d Fig. 3 Comparison of fold-changes measured for FFPE samples versus matched FF samples. (a) Scatter plot for the expression fold changes (log2 scale) of ~ 23,000 genes measured in T1 vs. T2, obtained from FF samples (x-axis) compared to matched FFPE samples (y-axis) by mRNA-seq protocol (purple). R and correlation coefficient are presented at the plot. (b) Scatter plot for the expression fold changes (log2 scale) of ~ 23,000 genes measured in T1 vs. T2, obtained from FF samples (x-axis) compared to matched FFPE samples (y-axis) by RiboZero protocol (orange). R and correlation coefficient are presented at the plot. (c) For each comparison in (a)-(b) the percentages of differentiating genes by FFPE samples (2fold) that are in agreement or disagreement with the FC of matched FF samples are presented (see methods for more information). In addition, the same percentages are presented for the FC comparing T1 vs. T3 and T2 vs. T3 for the mRNA-seq protocol. Purple bars represent the mRNA-seq protocol and orange bar for RiboZero protocol. (d) Same as (c), presented for fold-change threshold values (imposed on the FFPE samples) varying between 1 to 10 (x-axis). The y-axis shows the percentage of differentiating genes in FFPE that changed in the same direction as in matched FF samples. Purple for mRNA-seq protocol and orange for RiboZero protocol Table S3). When FF samples are compared to matched modified transcripts, and therefore gene expression will FFPE, using different protocols for the two sample types, be underestimated in an unknown transcript-dependent it is difficult to interpret the results since difference can be manner. These studies concluded that RiboZero is a caused by both factors – different samples and different satisfactory method for RNA-seq of FFPE, compared to protocols. Nevertheless, the “ideal” comparison, of results the gold standard mRNA-seq of matched FF. But the obtained for matched samples of the two types, using ribosomal-depletion methods frequently resulted in lower mRNA-seq for both, was not yet performed. The reason is fraction of exonic transcripts out of total reads (10–24%, that it was assumed that oligo(dT) primers would not depending on the initial starting material, Additional file 3: reliably capture the partially degraded and chemically Table S3), requiring significantly higher numbers of reads Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 8 of 11 c d Fig. 4 Expression of clinically relevant genes in breast cancer, and comparison to IHC. (a) Scatter plot of the expression levels of a set of 701 differentially expressed genes in the METABRIC dataset, between normal and tumor samples, as measured on T1 FFPE sample by RiboZero (orange; left) or mRNA-seq (purple; middle) vs. their mRNA-seq derived expression in the T1 FF sample. At the right we compare the expression of these genes on the T1 FFPE sample alone, as measured by mRNAs-seq (x-axis) and RiboZero (y-axis). (b) Same as (a) for PAM50 genes (“intrinsic subtype”). ESR1 (Estrogen Receptor) is indicated in the scatter plots. (c) Comparison between the expression levels of Estrogen Receptor (ESR1) as measured by the different RNA-seq protocols on T1-T6 (mRNA-seq in purple, RiboZero in orange, diamond for FF samples, circles for ~ 4 years old FFPE samples, and stars for ~ 10 years old FFPE samples), relative to IHC levels. (d) Chromosomal view of ESR1 with the reads mapped to this location (from IGV). Data shown for T1 FF sample, done with the mRNA-seq protocol; T1 FFPE sample, done with mRNA-seq protocol, and T1 FFPE sample done with RiboZero protocol. For each panel a histogram of the mapped reads to this genomic location is presented. At the bottom the gene model and the chromosomal location are shown (55 M–65 M reads) ([10, 12–14, 16–19]). A large fraction the RiboZero method, used for both tissue types. The re- of reads obtained by the ribosomal-depletion methods map sults indicate that gene expression measurement of FFPE to intronic regions and to unspliced mRNA . tissues using the standard poly(A) protocol is feasible and Here we compared matched FF and FFPE samples using represents similar differential expression as obtained for both the standard protocol for mRNA-seq (Truseq) and FF tissues. Our results are consistent with a study by Beck Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 9 of 11 et al., showing that their customized 3’end sequencing is require 2–3 times more reads to compensate for higher an effective technique for expression profiling of FFPE fraction of intergenic reads. mRNA-seq method has lower samples . library as well as sequencing costs due to higher fraction Although we observed a 3′-end bias in the mRNA-seq of exonic reads. Our results demonstrate that conven- libraries of FFPE compared to the RiboZero libraries, tional transcriptome profiling of FFPE tissues is feasible as this bias was not reflected in the expression levels. Gene an alternative for frozen samples. Although mRNA-seq quantification was highly equivalent for a wide range of libraries are mainly represented by the 3′-end relative to expression levels. Estimation of expression fold changes the uniform distribution of reads by the RiboZero method, between FF and FFPE samples from both protocols was the resulting gene expression levels are highly comparable also very consistent, resulting in high overlap, in differ- and obtained at much lower total numbers of reads. entially expressed genes and their fold changes, between the two methods and sample types. Methods The main advantage of the mRNA-seq protocol over Tumor samples the RiboZero protocol is the higher percentage of exonic FF breast cancer tumor samples, collected for the Sheba reads out of the total number of reads, which enables to Medical Center institutional tumor bank at time of sur- quantify gene expression with much lower coverage, and gery were included (n = 3). Matched FFPE blocks were hence it is a cost-effective approach. In addition, in the obtained from the same resected tumors, archived at the protocol we select the desired genomic regions (i.e. Sheba Pathology Institute (preservation time of 4–5years). exons, mRNA), as opposed to the RiboZero protocol, in All patients had estrogen receptor positive tumors. Three which we deplete some of the undesired regions (i.e. the additional FFPE tumor samples, with longer preservation rRNA). Thus, mRNA-seq requires less preliminary times (~ 10 years), were included. Tissue slides were calibrations (e.g. calibration of the beads required for examined by expert breast pathologists to include a efficient depletion of rRNA in the RiboZero protocol). minimum of 70% cancer cells. The study was approved by Our results suggest that for standard transcriptome the Institutional Review Board (IRB). profiling, mRNA-seq of FFPE is a reliable and cost- effective method. With this said, it is important to keep RNA extraction in mind the limitations of the mRNA-seq methods. FFPE tumor samples were sectioned to 5 μm slices and Since this method sequences mainly the 3’end of the deparafinized at 90 °C for 5 min. Total RNA was transcript, it excludes transcripts that do not have extracted using nucleic acid isolation kit (AllPrep® poly(A) tails. Notably, for the non-coding lincRNAs that Qiagen) according to the protocol instructions, with the are mostly polyadenylated we observed a very good correl- following modifications: DNAse treatment was at 37° for ation between the RiboZero and mRNAseq libraries of the 20 min and the column was incubated for 1 min before FFPE samples (Additional file 6: Figure S3A). Both each washing step with buffer RPE. Samples were eluted methods, however, are not ideal for short non-coding in 30 μL purified water. RNA from FF tumor samples RNAs, such as miRNAs, that require size selection steps, was extracted using Trizol. RNA concentrations were yet the very few detected miRNAs were similarly repre- determined by Qubit™ fluorometer (Thermofisher sented in the two methods (Additional file 6: Figure S3B). scientific) and the RNA quality was determined by meas- In the case of highly degraded samples, mRNA-seq uring the RNA integrity number (RIN) using Agilent will not be ideal for accurate sequence-based discoveries TapeStation. such as identification of novel transcripts, alternative splicing or of single nucleotide polymorphisms. Low Library preparation quality samples can be related to archival time as well as Library preparation and sequencing were performed at to preservation methods. All the FFPE samples in this the Nancy and Stephen Grand Israel National Center for study were of low quality, as measured by the RNA in- Personalized Medicine at the Weizmann Institute of tegrity number. For a relatively moderate time of archiv- Science. Total RNA from FFPE and FF samples was ing, starting material of 100 ng of RNA is sufficient and processed using three protocols: comparable to 500 ng of RNA. For highly degraded samples, such as very old archived samples, the minimal 1. Truseq RNA Sample preparation kit v2 (illumina) amount of RNA is 500 ng or higher. (cat# RS-122-2002) for mRNA-seq libraries. Briefly, polyA fraction (mRNA) was purified from 500 ng Conclusions or 100 ng of total RNA following by fragmentation In summary, when considering methods for transcriptome and generation of double stranded cDNA. Then, profiling, practice and cost are important factors. Riboso- end repair, A base addition, adapter ligation and mal depletion protocols have higher cost per sample and PCR amplification steps were performed. Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 10 of 11 2. Truseq Stranded total RNA with Ribo-Zero Gold. Estimation of total number of reads required for each Library Protocol: TruSeq® Stranded Total RNA protocol and sample type (Illumina) (Cat # RS-122-2301, RS-122-2302) for The number of uniquely mapped reads to exonic regions ribosomal-depletion libraries. Briefly, After rRNA varied significantly between samples and protocols (from depletion from 500 ng total RNA with rRNA 1 to 20 million). For each library (total of 29 libraries from Ribo-Zero Gold removal mix, cDNA was performed the different sample types and protocols), we counted the followed by second strand synthesis with dUTP number of exonic reads and the number of detected instead of dTTP. Then, A base addition, adapter genes; the resulting 29 points lied, to a good approxima- ligation, UDG treatment and PCR amplification tion, on a smooth monotonic curve (Additional file 4: steps were performed. Figure S1). For each protocol and sample type we 3. Ovation Human FFPE RNA-Seq Library (Cat # estimated the percentage of exonic reads out of the total 0340–32, NuGEN) for ribosomal-depletion libraries. number of reads. Using the curve and this percentage, we Starting material for the libraries was: 100 ng for extrapolated, for each protocol and sample type, the corre- fresh frozen samples whereas 100 and 250 ng for sponding curve of the number of detected genes for a FFPE samples. given total number of reads (see Fig. 1c). Libraries concentrations were evaluated by Qubit Fold change agreement between FF and FFPE samples (Thermofisher scientific) and their size was evaluated by We estimated the level of agreement between the Agilent Tapestation. Primer dimers were eliminated differential expression (Fold Change, FC) of a gene, as using 1× Agencourt RNAClean XP Beads after library measured in two FFPE samples, and the FC obtained preparation. Sequencing libraries were constructed with from their matched FF samples. This was done for all TruSeq SBS Kit using barcodes to allow multiplexing of genes with FC of at least 2 in the two FFPE samples (up- few samples in one lane with a read length of 60 bp or down-regulated), to calculate how many of these single-end run in Illumina HiSeq V4 instrument. genes exhibited FC in the same direction in the matched FF samples. To control the dependence of this measure of the FC agreement, the same procedure was repeated Immunohistochemisty for FC thresholds varying between 1 to 10. Diagnostic slides were immunostained for Estrogen receptor (ER) on the Ventana Discovery autostainer (Ventana) using commercial ER antibody. ER levels were METABRIC dataset determined by a dedicated breast pathologist in accord- The METABRIC dataset  contains mRNA expression ance with the clinical guidelines by the American Society data for ~ 2000 breast tumors and 126 normal breast of Clinical Oncology (ASCO) and the College of Ameri- samples. Two-samples t-test was performed on all can Pathology (CAP). expressed genes to compare their expression levels in normal versus tumor samples. Genes with 0.1% FDR were defined as differentially expressed between normal Data analysis and tumor samples. Processing and alignment Fragments were mapped to the human genome (hg19) using TopHat version V2.0.5 . Only uniquely Additional files mapped fragments to the genome were considered and Additional file 1: Table S1. RNA integrity number for all samples. exonic and intronic signals were calculated using HTSeq (PDF 212 kb) . The signals of the same sample from all lanes were Additional file 2: Table S2. Number of exonic and intronic reads. summed. The number of fragments obtained for each (PDF 167 kb) gene in each sample was normalized to the total number Additional file 3: Table S3. Comparison of the mean % of reads of fragments obtained from this sample, and compari- mapped on exons in other studies. (PDF 147 kb) sons were made between the expression levels of a gene Additional file 4: Figure S1. Number of detected genes as a function of exonic reads. The number of detected genes as a function of exonic across all samples and not between different genes. The reads is shown for each library (total of 29 libraries from the different minimum expression level threshold was set to 5 (log2 sample types and protocols, see legend). The black line is a smooth scale) to reduce noise (based on data distribution). monotonic curve extrapolated from all 29 data points. (PDF 722 kb) Additional file 5: Figure S2. Comparison of fold-changes measured for FFPE samples vs. matched FF samples using mRNA-seq. (A) Scatter plot Coverage along the genomic region for the expression fold changes (log2 scale) of genes measured in T1 The average coverage along the genomic region from 5′ vs.T3, obtained from FF samples (x-axis) compared to matched FFPE samples (y-axis) by mRNA-seq protocol (purple). r-square and correlation transcription start site to 3′ transcription end site was coefficient are presented at the plot. B) Scatter plot for the expression calculated for each sample using ngs.plot . Bossel Ben-Moshe et al. BMC Genomics (2018) 19:419 Page 11 of 11 3. Masuda N, Ohnishi T, Kawamoto S, Monden M, Okubo K. Analysis of fold changes (log2 scale) of genes measured in T2 vs.T3, obtained from chemical modification of RNA from formalin-fixed samples and optimization FF samples (x-axis) compared to matched FFPE samples (y-axis) by of molecular biology applications for such samples. Nucleic Acids Res. mRNA-seq protocol (purple). r-square and correlation coefficient are 1999;27:4436–43. presented at the plot. (PDF 936 kb) 4. von Ahlfen S, Missel A, Bendrat K, Schlumpberger M. Determinants of RNA Additional file 6: Figure S3. Expression of non-coding RNAs in FFPE quality from FFPE samples. PLoS One. 2007;2:e1261. samples by mRNAseq and RiboZero protocols. (A) Scatter plot of the 5. Chung J-Y, Braunschweig T, Williams R, Guerrero N, Hoffmann KM, Kwon M, expression levels of annotated lincRNAs as measured on T1 FFPE sample et al. Factors in tissue handling and processing that impact RNA obtained by mRNAseq (x-axis) versus RiboZero protocol (y-axis). Correlation from formalin-fixed, paraffin-embedded tissue. J Histochem Cytochem. coefficients between the two protocols for the expression of these 2008:1033–42. lincRNAs are indicated at the fig. (B) Same as (A) for miRNAs expression. 6. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and (PDF 185 kb) quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–8. 7. Guo Y, Sheng Q, Li J, Ye F, Samuels DC, Shyr Y. Large scale comparison of Acknowledgements gene expression levels by microarrays and RNAseq using TCGA data. We thank the Cancer Research Center at Sheba Medical Center for the PLoS One. 2013;8:e71462. support. We thank Goni Hut-Siloni for helping with the tissue bank samples. 8. Aviv H, Leder P. Purification of biologically active globin messenger RNA by We acknowledge Gilgi Freidlander for her assistance with the bioinformatics. chromatography on oligothymidylic acid-cellulose. Proc Natl Acad Sci U S A. 1972;69:1408–12. Availability of data and materials 9. Huang R, Jaritz M, Guenzl P, Vlatkovic I, Sommer A, Tamir IM, et al. An RNA- The data have been deposited in NCBI’s Gene Expression Omnibus Seq strategy to detect the complete coding and non-coding transcriptome (GEO) and are available through GEO accession number GSE113976 including full-length imprinted macro ncRNAs. PLoS One. 2011;6:e27288. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE113976). 10. Morlan JD, Qu K, Sinicropi DV. Selective depletion of rRNA enables whole transcriptome profiling of archival fixed tissue. PLoS One. 2012;7:e42882. Financial support 11. O’Neil D, Glowatz H, Schlumpberge M. Ribosomal RNA depletion for This research was supported by a grant from the Israel Ministry of Science (# efficient use of RNA-seq capacity. Biol: Curr. Protoc. Mol; 2013. 3–11175) and funds for Research and Development of the Nancy and 12. Adiconis X, Borges-Rivera D, Satija R, DeLuca DS, Busby MA, Berlin AM, et al. Stephen Grand Israel National Center for Personalized Medicine. ED and NBB Comparative analysis of RNA sequencing methods for degraded or low- were supported by the Leir Charitable Foundation. input samples. Nat Methods. 2013;10:623–9. 13. Zhao W, He X, Hoadley KA, Parker JS, Hayes DN, Perou CM. Comparison of Authors’ contributions RNA-Seq by poly (a) capture, ribosomal RNA depletion, and DNA microarray MD, NBB and SG designed the study. NBB performed the bioinformatics and for expression profiling. BMC Genomics. 2014;15:419. data analysis. SG and SB carried out the library preparation and the RNA 14. Graw S, Meier R, Minn K, Bloomer C, Godwin AK, Fridley B, et al. Robust sequencing. GP, AP and SH collected the samples, and were responsible gene expression and mutation analyses of RNA-sequencing of formalin- for their macrodissection and extraction of RNA. NBL and AY are the fixed diagnostic tumor samples. Sci Rep. 2015;5:12335. pathologists that scored the samples. BM contributed to bioinformatics 15. Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. analysis. BK and IB participated in the study coordination. MD, NBB, ED, SG Supervised risk predictor of breast Cancer based on intrinsic subtypes. and ENG participated in paper writing and editing. All authors read and J Clin Oncol. 2009;27:1160–7. approved the final manuscript. 16. Hedegaard J, Thorsen K, Lund MK, Hein A-MK, Hamilton-Dutoit SJ, Vang S, et al. Next-generation sequencing of RNA and DNA isolated from paired Ethics approval and consent to participate fresh-frozen and formalin-fixed paraffin-embedded samples of human The study was approved by the Sheba Institutional Review Board (IRB). Full cancer and normal tissue. PLoS One. 2014;9:e98187. exemption for consent form for anonymized samples was approved by the 17. Esteve-Codina A, Arpi O, Martinez-Garcia M, Pineda E, Mallo M, Gut M, et al. Sheba IRB. A comparison of RNA-Seq results from paired formalin-fixed paraffin- embedded and fresh-frozen glioblastoma tissue samples. PLoS One. Competing interests 2017;12:e0170632. The authors declare that they have no competing interests. 18. Li P, Conley A, Zhang H, Kim HL. Whole-transcriptome profiling of formalin- fixed, paraffin-embedded renal cell carcinoma by RNA-seq. BMC Genomics. 2014;15:1087. Publisher’sNote 19. Sinicropi D, Qu K, Collin F, Crager M, Liu M-L, Pelham RJ, et al. Whole Springer Nature remains neutral with regard to jurisdictional claims in transcriptome RNA-Seq analysis of breast cancer recurrence risk using published maps and institutional affiliations. formalin-fixed paraffin-embedded tumor tissue. PLoS One. 2012;7:e40092. 20. Beck AH, Weng Z, Witten DM, Zhu S, Foley JW, Lacroute P, et al. 3′-end Author details sequencing for expression quantification (3SEQ) from archival tumor Department of Physics of Complex Systems, Weizmann Institute of Science, samples. PLoS One. 2010;5:e8768. Rehovot, Israel. The Nancy and Stephen Grand Israel National Center for 21. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with Personalized Medicine, Weizmann Institute of Science, Rehovot, Israel. RNA-Seq. Bioinformatics. 2009;25:1105–11. Chaim Sheba Medical Center, Cancer Research Center, 5262100 22. Anders S, Pyl PT, Huber W. HTSeq—a Python framework to work with Tel-Hashomer, Israel. Chaim Sheba Medical Center, Institute of Pathology, high-throughput sequencing data. Bioinformatics. 2015;31:166–9. Tel-Hashomer, Israel. Chaim Sheba Medical Center, Institute of Oncology, 23. Shen L, Shao N, Liu X, Nestler E. Ngs.Plot: quick mining and visualization of Tel-Hashomer, Israel. Sackler Faculty of Medicine, Tel Aviv University, Tel next-generation sequencing data by integrating genomic databases. Aviv, Israel. BMC Genomics. 2014;15:284. Received: 5 December 2017 Accepted: 4 May 2018 References 1. TCGA. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. 2. Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486:346–52.
BMC Genomics – Springer Journals
Published: May 30, 2018