TAD-free analysis of architectural proteins and insulatorsMourad,, Raphaël;Cuvier,, Olivier
doi: 10.1093/nar/gkx1246pmid: 29272504
Abstract The three-dimensional (3D) organization of the genome is intimately related to numerous key biological functions including gene expression and DNA replication regulations. The mechanisms by which molecular drivers functionally organize the 3D genome, such as topologically associating domains (TADs), remain to be explored. Current approaches consist in assessing the enrichments or influences of proteins at TAD borders. Here, we propose a TAD-free model to directly estimate the blocking effects of architectural proteins, insulators and DNA motifs on long-range contacts, making the model intuitive and biologically meaningful. In addition, the model allows analyzing the whole Hi-C information content (2D information) instead of only focusing on TAD borders (1D information). The model outperforms multiple logistic regression at TAD borders in terms of parameter estimation accuracy and is validated by enhancer-blocking assays. In Drosophila, the results support the insulating role of simple sequence repeats and suggest that the blocking effects depend on the number of repeats. Motif analysis uncovered the roles of the transcriptional factors pannier and tramtrack in blocking long-range contacts. In human, the results suggest that the blocking effects of the well-known architectural proteins CTCF, cohesin and ZNF143 depend on the distance between loci, where each protein may participate at different scales of the 3D chromatin organization. INTRODUCTION In higher eukaryotes, chromosomes are packed in three dimensions and form complex structures (1). Such three-dimensional (3D) structure has recently been investigated by chromosome conformation capture combined with high-throughput sequencing technique (Hi-C) at an unprecedented resolution (2–4). Hi-C experiments reveal multiple levels of genome organization including compartments A/B (5) and topologically associating domains (TADs) (2,3). Most notably, TADs are relatively constant between different cell types and are highly conserved across species. These TADs play important roles in key cell processes such as long-range regulation of genes by enhancers (4) or replication-timing regulation (6). The identification of architectural proteins and functional elements involved in shaping the genome in 3D represents an intensive field of research (7). Seminal works using enhancer-blocking assays (EBAs) revealed that functional elements called insulators (or boundary elements) can suppress the activation of a promoter by a distant enhancer when interposed (8,9). Multiple evidence actually supports the role of insulator binding proteins (IBPs) such as CTCF, and co-factors like cohesin, as mediators of long-range chromatin contacts (3,10–13), which may in turn result in blocking enhancers from contacting promoters by forming alternative DNA loops. In mammals, high-resolution mapping of long-range contacts has recently revealed that loops occur at domain boundaries and bind CTCF in a convergent orientation where cohesin is recruited (12,14). Depletion of CTCF and cohesin decreased chromatin contacts (13). However, the impact of those depletions was limited suggesting that other proteins might be involved in shaping the chromosome in 3D. Accordingly, other IBPs, co-factors and functional elements were also shown to colocalize at TAD borders (11,15). A classical approach to identify proteins involved in shaping the 3D genome structure consists in assessing their enrichments at TAD borders (2,3,12). Among a set of enriched proteins, multiple logistic regression (MLR) can be further used to characterize which proteins are more likely to influence the presence of borders (15). However, an important drawback of the enrichment test and MLR is that they rely on accurate TAD mapping, which is problematic for multiple reasons: (i) TAD mapping strongly depends on the algorithm used (16), (ii) TADs only capture a fraction of the information from Hi-C data, and other important 3D domains including A/B compartments (5), loop domains (12) and subTADs (4) were discovered and (iii) TAD borders are blurry (11). Here, we propose a model named ‘blocking model’, to systematically analyze the roles of architectural proteins and functional elements in blocking long-range contacts between loci. The proposed model does not rely on TAD mapping from Hi-C data. Thus, the model’s outcome is not affected by the blurriness of borders. Instead of testing the enrichment/influence of protein binding at TAD borders, the model directly estimates the blocking effect of proteins on long-range contacts between flanking loci, making the model intuitive and biologically meaningful. The model only depends on a simple biological parameter: the distance between insulated loci. The model directly analyzes the Hi-C contact matrix, thus taking advantage of the whole Hi-C information content (2D information) instead of only focusing on TAD borders (1D information). Moreover, the model successfully predicts in silico the outcomes from low-throughput enhancer blocking assays, thus enabling genome-wide analyses. Using recent Drosophila and human Hi-C data at high resolution, combined with a large number of ChIP-seq and DNA motif data, we revealed numerous combinations of proteins, functional elements and DNA motifs that block long-range contacts depending on scale and synergistic/antagonistic effects. MATERIALS AND METHODS Hi-C data For Drosophila data analysis, we used publicly available high-throughput chromatin conformation capture (Hi-C) data of embryonic Kc167 cells from Gene Expression Omnibus (GEO) accession GSE62904 (17). We also used Kc167 Hi-C data from GEO accession GSE89112 (18). Hi-C data were binned at 1, 2 and 5 kb resolutions. For human data analysis, we used publicly available Hi-C data of lymphoblastoid GM12878 cells from GEO accession GSE63525 (12). We used Hi-C data binned at 10, 40 and 100 kb resolution. ChIP-seq data For Drosophila data analysis, we used publicly available protein-binding profiles of Kc167 cells (except for Pnr whose data were from 6–8 h embryos). ChIP-seq data for CP190, Su(Hw), dCTCF and BEAF-32 were obtained from GEO accession GSE30740 (19). ChIP-seq data for Barren (condensin I), Cap-H2 (condensin II), Chromator, Rad21 (cohesin), GAF and dTFIIIC were obtained from GEO accession GSE54529 (11). ChIP-seq data for Fs(1)h-L were obtained from GEO accession GSE42086 (20). ChIP-seq data for Ttk69k were obtained from GEO accession GSE34698 (21). ChIP-seq peak calling was done using MACS 2.1.0 with default parameters for all proteins (https://github.com/taoliu/MACS). ChIP-chip peaks for Pnr were directly downloaded from (22). For human data analysis, we used publicly available binding peaks of 73 chromatin proteins (Rad21, CTCF, YY1, ZBTB33, MAZ, JUND, ZNF143, EZH2, ATF2, ATF3, BATF, BCL11A, BCL3, BCLAF1, BHLHE40, BRCA1, CEBPB, CFOS, CHD1, CHD2, CMYC, COREST, E2F4, EBF1, EGR1, ELF1, ELK1, FOXM1, GABP, IKZF1, IRF4, MAX, MEF2C, MTA3, MXI1, NFATC1, NFE2, NFIC, NFKB, NFYA, NFYB, NRF1, NRSF, P300, PAX5, PBX3, PML, POL2, POL3, POU2F2, RFX5, RUNX3, RXRA, SIN3A, SIX5, SMC3, SP1, SPI1, SRF, STAT1, STAT3, STAT5, TBLR1, TBP, TCF12, TCF3, TR4, USF1, USF2, WHIP, ZEB1, ZNF274 and ZZZ3) of GM12878 cells from ENCODE (23). We downloaded peaks that were uniformly processed (Uniform Peaks). DNA motifs To scan the genome for motif occurrences, we used Find Individual Motif Occurrences (FIMO) with default parameters and with position-specific priors (PSPs) to improve the identification of true motif occurrences (24). GM12878 DNase data from ENCODE were used as PSPs (23). The motif information was taken either from the litterature (using consensus motif) or from JASPAR database (http://jaspar.genereg.net/). For Drosophila data analysis, we used transcription factor-binding site (TFBS) motifs from the JASPAR database. For some proteins, we used instead motif consensuses from the litterature: BEAF-32 (CGATA) (25), dCTCF (AGGTGGCG) (26), Su(Hw) (TGCATATTT) (27), GAF (GAGAGA) (28), ZW5 (GCTGMG) (29), DREF (TATCGATA) (30), M1BP (GGTCACACT) (31), Ttk69k (GGTCCTGC) (32), dTFIIIC A box (TGGNNNAGNNG), Pita (GGTTNNNNNNNNNGCT) (29), ZIPIC (AGGGNTG) (29), Ibf (ATGTANAA) (33), Elba (CCAATAAG) (34) and Zelda (CAGGTAG) (35). For human data analysis, we also used TFBS motifs from the JASPAR database. In human, motifs with <2000 occurrences were removed from the analysis to reduce uncertainty in the β estimation. The blocking model To illustrate the blocking model, we first plotted the example of a Drosophila genomic region with embryonic Kc167 cell Hi-C heatmap and ChIP-seq peaks of well-known architectural proteins (Figure 1A). We observed that all architectural proteins BEAF-32, dCTCF, dTFIIIC, GAF and Su(Hw) accumulated on a specific locus (green frame) that acted as an insulator of long-range contacts between flanking regions. This observation suggested that the binding of those proteins blocked long-range contacts (Figure 1B), thereby contributing to the formation of 3D domains. Figure 1. Open in new tabDownload slide Illustration of the blocking model. (A) Example showing that the accumulation of insulator-binding proteins (IBPs) is associated with a blocking effect of long-range contacts between flanking loci in Drosophila (see green frame). (B) Schema representing the blocking effect of protein binding on long-range contacts between two loci, such as between an enhancer and a promoter. Figure 1. Open in new tabDownload slide Illustration of the blocking model. (A) Example showing that the accumulation of insulator-binding proteins (IBPs) is associated with a blocking effect of long-range contacts between flanking loci in Drosophila (see green frame). (B) Schema representing the blocking effect of protein binding on long-range contacts between two loci, such as between an enhancer and a promoter. By integrating Hi-C data with ChIP-seq data or DNA motif data, we propose to model the blocking effects of protein bindings with a generalized linear model: \begin{equation*} \log \big ( \mathrm{E} \big [ {\bf y} | {\bf d},{\bf B},{\bf I} \big ] \big ) = \beta _0 + \beta _d{\bf d} + \boldsymbol{\beta }_B{\bf B} - \boldsymbol{\beta }_I{\bf I} \end{equation*} (1) where, variable y denotes Hi-C count for any pair of bins on the same chromosome. The log-distance variable d accounts for the background polymer effect (power law decay relation between distance and Hi-C count modeled by a log–log linear relation) (36). Bias variables B = {len, GC, map} are known Hi-C biases including fragment length (len), GC-content (GC) and mappability (map) that are computed as in (37). Including those bias variables into the model allows correcting for biases in Hi-C data. Note that bias variables do not need to be included in the model if Hi-C counts were previously normalized by matrix balancing (38). Variable set I = {i1, ..., ip} represents the p blocking variables of interest. A blocking variable stores a value corresponding to a ‘blocking region’ (Figure 1B), which is the region in-between two bins whose Hi-C contacts are measured. For ChIP-seq data, a blocking variable is defined as the average of the base coverage computed from the log2 fold-enrichments of peaks found into the blocking region divided by the length of the blocking region. A base within a peak has a coverage value equal to the log2 fold-enrichment of the peak and a base outside a peak has a coverage value equal to zero. For DNA motif data, a blocking variable is defined as the number of motif occurrences found into the blocking region divided by the length of the blocking region. The corresponding βi parameter value reflects the blocking effect of the protein on Hi-C counts. A positive value (βi > 0) reveals a blocking effect on long-range contacts. Conversely, a negative value (βi < 0) shows a facilitating effect on contacts. A null value (βi = 0) means that the protein does not have any effect in blocking or facilitating contacts. Using the model, one can also assess the co-blocking effects of two or more proteins using statistical interaction terms: \begin{eqnarray*} \log \big ( \mathrm{E} \big [ {\bf y} | {\bf d},{\bf B},{\bf i}_1,{\bf i}_2 \big ] \big ) & = \beta _0 + \beta _d{\bf d} + \boldsymbol{\beta }_B{\bf B} \nonumber \\ & - \beta _{i_1}{\bf i}_1 - \beta _{i_2}{\bf i}_2 - \beta _{i_{12}}{\bf i}_1 {\bf i}_2 \end{eqnarray*} (2) where, variables i1 and i2 are two blocking variables. The product i1i2 is a second-order statistical interaction. The corresponding parameter |$\beta _{i_{12}}$| reflects the co-blocking effect of the two proteins on contacts. A positive value (|$\beta _{i_{12}}>0$|) reveals a synergistic effect of the two proteins in blocking contacts. Conversely, a negative value (|$\beta _{i_{12}}<0$|) shows an antagonistic effect of the two proteins in blocking contacts. In equation (2), a second-order interaction was included, but higher-order interactions (products of more than two variables) can be included to model co-blocking effects of more than two proteins. The model only depends on a single parameter: the distance range between insulated loci. This parameter has a strong biological meaning since it reflects the analysis scale of hierarchical 3D genome organization. For instance, in Drosophila, we will focus on Hi-C data for 20–50 kb distances which are below the median size of TADs (median size of 60 kb (3)), therefore allowing TAD-scale analyses. But we will also vary the scale of analysis in human (see below). In some situations, we standardize the blocking variables before computing the model. Standardization allows to reduce the effect of very large differences in the blocking variables between different proteins when estimating the βs and makes the latter more comparable in magnitude. In fact, these blocking variable differences might be due to very large differences in the ChIP-seq signal and the number of peaks that might not be linked to the real blocking activity of proteins. For instance, when analyzing human ChIP-seq data, we found that the highest βs were often associated to proteins with few binding sites when no standardization was used, and that these βs were strongly reduced after standardization (see below). Because of Hi-C count overdispersion, we use negative binomial regression as the most appropriate specification of the generalized linear model. However, Poisson regression with lasso shrinkage can also be used. We believe that the choice between both depends mainly on the number of variables to analyze. On the one hand, if there are a few candidate variables (<10), it is interesting to estimate β parameters together with corresponding P-values to assess significance using negative binomial regression. On the other hand, if there are a large number of variables (10 or more), it is more convenient to use Poisson lasso regression in order to select the key variables and to account for correlations among the variables (frequent in ChIP-seq and motif occurrence data). The model is available in the R package ‘HiCblock’ which can be downloaded from the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/HiCblock/index.html). For the negative binomial regression, model βs are learned by iterative weighted least squares (glm.nb function from MASS R package with default parameters). For the Poisson lasso regression, model βs are learned by cyclical coordinate descent and lambda parameter is estimated with 10-fold cross-validation (cv.glmnet function from glmnet R package with default parameters). Simulation of random protein-binding sites and motif occurrences For Poisson lasso regression in human, we simulated protein binding sites by randomly drawing genomic regions from the genome whose numbers and fold-enrichments were similar to those observed from real proteins. We then used these random proteins to compute associated β coefficients with the Poisson lasso regression. We expected these βs to be close to zero but with a certain standard deviation |$\hat{\sigma }$|. We then used this standard deviation to compute a confidence interval as |$0 \pm 1.96\times \hat{\sigma }$| under the null hypothesis that a random protein did not have any blocking or facilitating effect on long-range contacts. For DNA motifs, we used a slightly different approach. We randomly draw 14 base DNA sequences (random motifs) whose number of occurrences over the genome were similar to those of real DNA motifs. We scanned the genome for random motif occurrences. Then, we used these random motif occurrences to compute associated β coefficients with the Poisson lasso regression. As for random proteins, we used these βs to compute a confidence interval under the null hypothesis. RESULTS Model validation with enhancer-blocking assays We first sought to validate our model using EBAs from Drosophila. EBA is a classical low-throughput method that can be used to show the ability of an insulator sequence to block the activation of a promoter by a distant enhancer when interposed between them (39) (Figure 2A). We used the model to predict the blocking effect of an insulator region depending on protein binding. For this purpose, we used a compilation of EBA results from (11). It consisted of 32 regions with varying reported insulating activity (15 regions with insulating activity and 17 regions with no insulating activity). In the first benchmark, we selected the 15 regions with insulating activity (positive class). In order to have a large set of regions with no insulating activity, we generated >100 control regions (negative class) by randomly drawing from the Drosophila genome with sizes, GC and repeat contents similar to those of the abovementioned 15 regions (40). For each region, we computed blocking variables I = {i1, ..., ip} using p ChIP-seq data from Kc167 cells. We also used |$\hat{\beta }_I=\lbrace \hat{\beta }_{i_1},...,\hat{\beta }_{i_p} \rbrace$| model parameters independently learned from Kc167 Hi-C data from Li et al. (17) at 2 kb resolution and for 20–50 kb distances, for which Hi-C coverage was high. Model parameters were not learned from EBA assays to prevent overestimation of predictive performance. We predicted insulating activities of the regions by the matrix product |$\hat{\beta }_I{\bf I}$|. We then assessed the accuracy of our model’s predictions using receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC). We found that predicted insulating activity was very close to the observed insulator activity from EBA (AUC = 0.981; Figure 2b). In the second benchmark, we did not use generated controls but instead the 17 regions reported to have no insulating activity as negative class. We again predicted insulating activity, and found that predictions were still good (AUC = 0.808; Figure 2C). We found that changing Hi-C data resolution to 1 or 5 kb only slightly affected predictions for the two benchmarks (Supplementary Figure S1). In the third benchmark, we assessed the blocking effect of simple sequence repeats (SSRs) of GATA that were shown to have an insulating activity by EBAs in both drosophila and human (41). In drosophila, we estimated a blocking effect for SSRs that comprised >4 repeats (Figure 2D and Supplementary Table S1). In particular, we found a significant blocking effect for SSRs with five to six repeats (|$\hat{\beta }=0.046$|, P = 2 × 10−8). SSRs with >6 repeats were too few to detect any significant blocking effect (only 8 SSRs with 7 to 8 repeats and 9 SSRs with >11 repeats). In human, we detected significant blocking effects for all GATA repeat counts (P < 10−20) at short distances (100–250 kb at 10 kb resolution; Figure 2E and Supplementary Table S2). Most notably, we found the highest blocking effects for SSRs with 9 to 10 repeats (|$\hat{\beta }>0.07$|, P < 10−20), revealing that the blocking effect depends on the number of repeats. For larger distances (950–1000 kb), we could only detect a slight blocking effect for eight repeats, suggesting that SSR blocking effect acted at short distance (Supplementary Figure S2 and Table 3). Using EBAs, we thus concluded that the model was successfully validated. Figure 2. Open in new tabDownload slide Validation of the model with enhancer-blocking assays (EBAs) from Drosophila and human. (A) Illustration of the EBAs. (B) ROC curves of the prediction of insulating regions (positives) as compared to randomly drawn regions (negatives) in Drosophila. Area under the ROC curve (AUC) is plotted. (C) ROC curves of the prediction of insulating regions (positives) as compared to non-insulating regions (negatives) in Drosophila. (D) Blocking effects of GATA SSRs depending on the repeat count in Drosophila. (E) Blocking effects of GATA SSRs depending on the repeat count in human. Figure 2. Open in new tabDownload slide Validation of the model with enhancer-blocking assays (EBAs) from Drosophila and human. (A) Illustration of the EBAs. (B) ROC curves of the prediction of insulating regions (positives) as compared to randomly drawn regions (negatives) in Drosophila. Area under the ROC curve (AUC) is plotted. (C) ROC curves of the prediction of insulating regions (positives) as compared to non-insulating regions (negatives) in Drosophila. (D) Blocking effects of GATA SSRs depending on the repeat count in Drosophila. (E) Blocking effects of GATA SSRs depending on the repeat count in human. Analysis of insulator proteins and comparison with current approaches A major problem of testing protein enrichment at TAD borders is that different algorithms have been developed for TAD mapping which can yield large differences of enrichments for the same protein (42). Accordingly, we observed that the enrichments of BEAF-32, dCTCF, dTFIIIC, GAF and Su(Hw) could greatly vary depending on the TAD algorithm used in Drosophila (Figure 3A). For instance, GAF presented an odds ratio (OR) of 4.3 with HiCseg (43), an OR of 4 with Arrowhead (12), whereas it only showed an OR of 2.5 with TopDom TADs (16). Conversely, dCTCF presented an OR of 3.7 with HiCseg, and ORs around 5 with Arrowhead and TopDom. Figure 3. Open in new tabDownload slide Analysis of IBPs in Drosophila. (A) Enrichment of IBPs at TAD borders, depending on the TAD mapping algorithm used. (B) Blocking effect (β) estimated separately. (C) Blocking effect (β) estimated jointly. (D) MLR βs estimated from TAD borders (15). (E) Parameter estimation accuracy of the proposed model compared to MLR. Figure 3. Open in new tabDownload slide Analysis of IBPs in Drosophila. (A) Enrichment of IBPs at TAD borders, depending on the TAD mapping algorithm used. (B) Blocking effect (β) estimated separately. (C) Blocking effect (β) estimated jointly. (D) MLR βs estimated from TAD borders (15). (E) Parameter estimation accuracy of the proposed model compared to MLR. Instead of testing protein enrichments at TAD borders, we used our model to directly assess the blocking effect of protein binding on long-range contacts. We first estimated separately the blocking effects of IBPs, by including only one IBP in the model at a time. This allowed to compare with previous enrichments. We used Kc167 Hi-C data from Li et al. (17) at 2 kb resolution and focused on 20–50 kb distances. Using our model, we found that BEAF-32, dCTCF and dTFIIIC showed the strongest blocking effects (Figure 3B), which was similar to the enrichments observed at TAD borders (Figure 3A) and previously observed by Sexton et al. (3). Because the blocking effect might be influenced by the number of protein-binding sites, we sampled different numbers of peaks from BEAF-32 and estimated the corresponding βs. As expected, we found that β accuracy was lower for smaller number of peaks (Supplementary Figure S3). We also observed that the blocking effect was inflated, but such inflation remained reasonable (+63%), even for 1000 sampled peaks which represented only 15% of all BEAF-32 peaks. Because IBPs often colocalize linearly (e.g. correlate) on the chromosome, one might estimate a blocking effect for a protein, although the protein does not directly impede long-range contacts (15). Hence, we re-estimated blocking effects of IBPs jointly (e.g. by including all IBPs within the same model). BEAF-32 presented the highest blocking effect (|$\hat{\beta }=0.86$|, P < 10−20) compared to the other proteins (Figure 3C), similarly to previously published MLR analysis at TAD borders (15) (Figure 3D). Our model also estimated a negative β for dTFIIIC, suggesting that the protein could in fact facilitate long-range contacts between flanking regions, contrary to what is found by the separate estimation (previous paragraph). This meant that dTFIIIC blocking effect estimated by separate estimation was in fact due to the colocalization (correlation) of dTFIIIC with other IBPs such as BEAF-32 (correlation between dTFIIIC and BEAF-32 blocking variables equals 0.59, P < 10−20). Our model outperformed MLR in terms of parameter estimation accuracy. Standard errors of beta parameters were dramatically lower than the ones from MLR, revealing the higher performance of our model in assessing blocking effects of proteins (Figure 3E). To further compare our new model with MLR, we assessed the ability to discriminate between known architectural proteins (11 true positives including IBPs and co-factors) and random protein peaks (200 false positives) using ROC curves (Supplementary Figure S4). Based on the absolute values of βs, we found that our blocking model was highly accurate (AUC = 0.991) and performed better than MLR (AUC = 0.827). Moreover, we performed the joint analysis of IBPs for different binning resolutions (1 and 5 kb) and found similar results with 2 kb, revealing that the resolution did not have a big impact on the estimation of blocking effects (Supplementary Figure S5). In addition, we analyzed recent Hi-C data with higher coverage from Eagen et al. (18) at 1 kb resolution and obtained results that were close to those obtained from Li et al. data (Supplementary Figure S6). Thus, by processing the whole Hi-C matrix information, instead of focusing only on TAD borders, the proposed model was more accurate than MLR. Numerous protein-binding DNA motifs act as blockers We next sought to analyze the blocking effects of protein-binding DNA-motifs (Figure 4A and Supplementary Table S4). Interestingly, our model found motif 1-binding protein (M1BP) as the motif with the strongest blocking effect (|$\hat{\beta }=1.46$|), which was recently found to be enriched at TAD borders during development (35) and was implicated in transcriptional pausing of genes (31). Such transcriptional pausing was recently shown to be involved in long-range contacts (44). When we looked at Hi-C heatmaps, we observed that M1BP motifs accumulated at the borders of 3D domains (Figure 4B; DNase I hypersensitivity is shown to represent the potential activity of the motifs). We also identified other motifs with strong blocking effects including bcd (|$\hat{\beta }=0.65$|), Pita (|$\hat{\beta }=0.63$|), vis (|$\hat{\beta }=0.60$|), Pnr (|$\hat{\beta }=0.59$|) and Ttk69k (|$\hat{\beta }=0.55$|). Among those proteins, Pita was a recently discovered insulator protein able to target CP190 to chromatin (45) and was found at 3D domain borders (Figure 4C). When we used Ttk69k ChIP-seq and Pnr ChIP-chip data, we found that both Ttk69k and Pnr colocalized at or near architectural protein peaks (Supplementary Figure S7a). For instance, Pnr was enriched at condensin I (Barren), CP190, BEAF-32 and Chromator peaks (Supplementary Figure S7b). Interestingly, Ttk69k was mostly enriched near architectural proteins but did not overlap them, except for condensin I, suggesting that Ttk69k might participate to the formation of 3D domains in a very specific way (Supplementary Figure S7c). Accordingly, we found numerous Pnr and Ttk69k motifs located between 3D domains (Figure 4C and D). We also identified architectural proteins ZW5 (|$\hat{\beta }=0.33$|), dCTCF (|$\hat{\beta }=0.32$|) and Ibf (|$\hat{\beta }=0.29$|). Of note, Ibf was shown to be a novel CP190 interacting protein with insulating activity (33). When we compared with MLR, we also found that M1BP presented a very high positive influence on TAD borders (|$\hat{\beta }=8.65$|; Supplementary Table S5). However another motif, Zelda, presented the highest positive influence (|$\hat{\beta }=9.32$|), whereas the same motif was identified as a long-range contact facilitator with the blocking model (|$\hat{\beta }=-0.41$|; Supplementary Table S4). This suggests that the blocking model can capture effects on long-range contacts that could not be assessed by the analysis at the TAD border level. Using the blocking model, we could conclude that many proteins including pannier, a transcriptional regulator involved in several developmental processes (46) and tramtrack 69k, a widely expressed transcriptional factor (TF) related to cell fate specification, cell proliferation and cell-cycle regulation (47), might represent novel candidate architectural proteins in Drosophila. Figure 4. Open in new tabDownload slide Analysis of protein binding DNA motifs in Drosophila. (A) Blocking effect (β) in function of motif abundance (|$|\hat{\beta }|>0.2$| are shown in red; known architectural proteins are written in blue). (B) Example showing the accumulation of M1BP motifs and DNase I hypersensitive sites between 3D domains. (C) Example showing the accumulation of Pita and Pnr motifs between 3D domains. (D) Example showing the accumulation of Ttk69k motifs between 3D domains. Figure 4. Open in new tabDownload slide Analysis of protein binding DNA motifs in Drosophila. (A) Blocking effect (β) in function of motif abundance (|$|\hat{\beta }|>0.2$| are shown in red; known architectural proteins are written in blue). (B) Example showing the accumulation of M1BP motifs and DNase I hypersensitive sites between 3D domains. (C) Example showing the accumulation of Pita and Pnr motifs between 3D domains. (D) Example showing the accumulation of Ttk69k motifs between 3D domains. Co-blocking effects of insulator-binding proteins and co-factors Long-range contacts not only involve IBPs but also co-factors that regulate or stabilize them (11,12,48). Hence, we sought to analyze potential effects of IBPs and co-factors in co-blocking long-range contacts. We first modeled the co-blocking effects of protein pairs using second-order statistical interactions (for every protein pair, we estimated a co-blocking effect). We detected 38/55 significant interactions after Bonferroni correction. Among the significant interactions, the model identified 19 positive co-blocking effects (|$\hat{\beta }>0$|), reflecting protein pairs that synergistically blocked long-range contacts (Supplementary Table S6). We represented these synergistic blocking effects by a network of proteins (Figure 5A). In agreement with (49), CP190 co-blocked contacts with BEAF-32 (|$\hat{\beta }=0.76$|, P < 10−20) and with GAF (|$\hat{\beta }=0.67$|, P < 10−20). Interestingly, we found that Condensin II (Cap-H2) played a central role in helping other proteins to block contacts, including dCTCF (|$\hat{\beta }=1.33$|, P = 4 × 10−13), Barren (|$\hat{\beta }=0.78$|, P < 10−20), dTFIIIC (|$\hat{\beta }=0.70$|, P = 10−6) and GAF (|$\hat{\beta }=0.68$|, P = 2 × 10−10). dTFIIIC also represented an important protein for co-blocking effects. Conversely, Fs(1)h-L had only one co-blocking partner, dTFIIIC. The model also estimated 19 negative co-blocking effects (|$\hat{\beta }<0$|), reflecting protein pairs that had antagonistic effects in blocking long-range contacts (Figure 5B and Supplementary Table S6). Most notably, we found numerous antagonistic effects of CP190 in blocking contacts with other proteins, such as dTFIIIC (|$\hat{\beta }=-2.33$|, P < 10−20), Su(Hw) (|$\hat{\beta }=-1.78$|, P < 10−20), Chromator (|$\hat{\beta }=-1.68$|, P < 10−20), dCTCF (|$\hat{\beta }=-0.87$|, P < 10−20) and Fs(1)h-L (|$\hat{\beta }=-0.53$|, P = 4 × 10−6). Interestingly, Su(Hw) had a slight blocking effect on long-range contacts (|$\hat{\beta }= 0.20$|, P < 10−20; Figure 3C), but when combined with CP190, they presented a strong antagonistic effect which reduced its blocking effect (|$\hat{\beta }=-1.78$|, P < 10−20; Figure 5B). Among the synergistic and antagonistic effects, we found that many corresponded to physical interactions reported in Flybase and previous studies (49), supporting the idea that physical interactions may account for some of them. Analysis of second-order interactions thus revealed the complexity behind the establishment of 3D domains. This may notably depend on numerous synergistic and antagonistic effects of IBPs with key architectural co-factors such as structural maintenance complex (SMC) family of proteins including cohesin and condensin (50,51). Figure 5. Open in new tabDownload slide Effects of IBPs and co-factors in co-blocking long-range contacts. (A) Synergistic blocking effects estimated by positive second-order interaction βs. An edge between two protein nodes i and j means |$\hat{\beta }_{ij}>0.5$|. (B) Antagonistic blocking effects estimated by negative second-order interaction βs. An edge between two protein i and j nodes means |$\hat{\beta }_{ij}<0.5$|. Blue cross: physical interaction reported in Flybase. Figure 5. Open in new tabDownload slide Effects of IBPs and co-factors in co-blocking long-range contacts. (A) Synergistic blocking effects estimated by positive second-order interaction βs. An edge between two protein nodes i and j means |$\hat{\beta }_{ij}>0.5$|. (B) Antagonistic blocking effects estimated by negative second-order interaction βs. An edge between two protein i and j nodes means |$\hat{\beta }_{ij}<0.5$|. Blue cross: physical interaction reported in Flybase. Analysis in human We then analyzed blocking effects of proteins and DNA motifs in human, depending on the scale of 3D genome organization. For this purpose, we used GM12878 Hi-C data for varying distance ranges: [200–400 kb], [400–600 kb], [600–800 kb], [800–1000 kb], [1000–1300 kb], [1700–2000 kb], [2700–3000 kb], [2700–3000 kb], [3700–4000 kb] and [4700–5000 kb]. We performed analyses at 40 kb resolution to have sufficient coverage at long distance (even though for short distance higher resolution could be used). By varying the distance range, we could assess blocking effects at different scales, thus allowing the analysis of the well-known hierarchical nature of 3D domains (52). Because of the large number of variables (>50), we used Poisson lasso regression. Moreover, for ChIP-seq data analysis, we scaled the blocking variables because the ChIP-seq peak numbers and fold-enrichments greatly varied between proteins and that prevented further comparison of βs. For each analysis, we also computed confidence intervals under the null hypothesis that a protein or DNA motif did not have any blocking or facilitating effect on long-range contacts (see ‘Materials and Methods’ section, simulation of random protein-binding sites and motif occurrences). We first focused on known architectural proteins CTCF, Rad21 (cohesin subunit) and ZNF143. Remarkably, we observed that the blocking effects of architectural proteins strongly depended on the distance between loci (Figure 6A and Supplementary Table S7), a question that could not be addressed by previous enrichment or MLR analyses at TAD borders. For instance, CTCF blocking effects peaked around 3 Mb. Interestingly, the main looping partner of CTCF, cohesin, had a blocking effect that peaked at a lower distance, from 1000 to 2000 kb. Another partner of CTCF, ZNF143, also showed a different blocking effect that strikingly peaked at 800–900 kb. This means that although CTCF, cohesin and ZNF143 were known to act together in establishing chromatin loops (7), they might participate at different scales. We next studied the blocking effects of TFs (Figure 6B and Supplementary Table S7). Compared to architectural proteins, TFs were less abundant over the genome (around few thousands peaks, compared to tens of thousands of peaks for architectural proteins). Among the strongest blockers, we found ATF2, FOXM1, PML and POU2F2, whose effects also depended on distance. POU2F2 effect peaked at 3800 kb, and FOXM1 and PML both peaked at 3 Mb. Interestingly, some TFs, such as ATF2, presented high blocking effects for very large distance (>5 Mb). Thus, although TFs were less frequent over the genome than architectural proteins, they might collectively contribute significantly to the establishment or maintenance of 3D organization. Lastly, we analyzed protein-binding DNA motifs (Figure 6C and Supplementary Table S8). CTCF motif showed a strong blocking effect that peaked from 1000 to 2000 kb, at a shorter distance than found using ChIP-seq data. However, another motif, TFAP2C, presented the strongest blocking effect, especially at long distance. TFAP2C has been implicated in breast cancer oncogenesis, and was previously shown to be a collaborative factor in estrogen-mediated long-range interaction and transcription (53). We also identified ELK4 and PAX1 as strong blockers at long distance. ELK4 is a member of the Ets family of transcription factors, and PAX1, is essential during fetal development. We thus concluded that architectural proteins, but also transcription factors, shaped the 3D human genome at different genomic scales. Figure 6. Open in new tabDownload slide Analysis of protein binding and DNA motif in human. (A) Blocking effects of architectural proteins depending on the distance between loci. (B) Blocking effects of TFs depending on the distance between loci. (C) Blocking effects of protein binding motifs depending on the distance between loci. For all three subfigures, we also plotted confidence intervals under the null hypothesis that a random protein or DNA motif did not have any effect on long-range contacts. Figure 6. Open in new tabDownload slide Analysis of protein binding and DNA motif in human. (A) Blocking effects of architectural proteins depending on the distance between loci. (B) Blocking effects of TFs depending on the distance between loci. (C) Blocking effects of protein binding motifs depending on the distance between loci. For all three subfigures, we also plotted confidence intervals under the null hypothesis that a random protein or DNA motif did not have any effect on long-range contacts. DISCUSSION In this paper, we propose a model to comprehensively study the roles of architectural proteins, insulators and DNA motifs in blocking long-range contacts between flanking loci at different scales, thereby demarcating the genome into functional 3D domains. The proposed approach is TAD-free: it does not rely on any TAD mapping algorithm, it does not focus on TADs but instead on all possible 3D domains at all scales, and it is not affected by the blurriness of TAD borders. The model is validated by numerous EBAs. It outperformed previous MLR of TAD borders (15) in terms of blocking effect estimation accuracy. The model is flexible and can identify both synergistic and antagonistic effects of architectural proteins depending on the presence of specific IBPs and co-factors. The proposed model also uncovers a number of results. In Drosophila, we find that the blocking effect for the GATA SSRs depends of the number of repeats, and in particular, we estimate a significant blocking effect for 5–6 repeats. In human, we find that GATA repeat effect peaks for 9–10 repeats. Moreover, analysis of motifs identifies pannier and tram track as two novel candidate architectural proteins. Interestingly, the protein pannier is a member of the GATA family known to bind to GATA motifs (46), which may explain the insulating activity of GATA repeats by recruiting multiple pannier proteins contiguously to DNA. Moreover, tram track has a homomeric dimerization BTB/POZ domain that could help bridging two distant proteins through long-range contacts (54) and that is known to interact with GAF (55). Analysis of co-blocking effects between architectural proteins further suggests a role for co-factor condensin II in helping other proteins to block contacts. Conversely, CP190 presents numerous antagonistic effects with other proteins, meaning that it reduces their blocking activities. Such co-blocking analyses thus reveal the modulating effects of specific proteins in blocking contacts with other proteins. In human, analyses for varying distance ranges uncover strong distance-dependent blocking effects depending on the protein or DNA motif, that could not be addressed by enrichment test or MLR at TAD borders. For instance, we find that CTCF, cohesin and ZNF143 blocking effects peak at different distances, although the three proteins are known to act together in establishing chromatin loops (7). This suggests that they may participate at different 3D chromatin scales, or alternatively that their mechanisms of action is not always associated with their binding. Supporting this idea, recent results showed that cohesin is recruited at transcription start sites and positioned to CTCF sites by transcription-mediated translocation (56). In addition, we observed changes of the β sign depending on the distance. For instance, ZNF143 presented a blocking effect at short distance (<2500 kb) and a facilitating effect at longer distance. This can be due to ZNF143-mediated loops at short distance that have allosteric effects on long distance interactions (57). There are different reasons why we restricted our analysis within a limited distance range, e.g. 20–50 kb in Drosophila (and not 20–1000 kb, for instance). First, at the high resolution of 2 kb, most of the Hi-C signal is observed within short distance (20–50 kb). Second, our model assumes a power law decay between Hi-C count and distance (equivalent to a log–log linear relation between Hi-C count and distance) which only holds for a limited distance range. Third, not restricting the analysis to a limited distance range can lead to heavy computational burden. One simple way to analyze Hi-C data within a wider distance range would be to analyze data at 10–20 kb resolutions. There are several limitations of the proposed approach. First, model learning can be computationally demanding in time and memory depending on the distance range or Hi-C data resolution. New big data learning algorithms could be used to process the data at a higher resolution that would allow in-depth analysis of 3D chromatin drivers (58). Second, the model makes the assumption that the accumulation of protein binding blocks long-range contacts, but other scenarios could explain the formation of borders. For instance, attraction/repulsion forces between histone marks can predict the folding of chromatin (59). Third, in human, we observed large changes of βs over distance, for instance for protein ZNF143 and DNA motif TFAP2C(var.3). Because lasso regression is not designed to estimate beta standard deviations, the significance of the difference between two βs obtained for two different distances cannot be tested. Instead, one could use a standard regression with selected variables to assess the significance. AVAILABILITY The model is available in the R package ‘HiCblock’ which can be downloaded from the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/HiCblock/index.html). SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS The authors are grateful to Corces lab (Emory University, USA) for data. FUNDING University of Toulouse; CNRS. Funding for open access charge: Fondation pour la Recherche Médicale (DEQ20160334940) to our team (R.M. and O.C.). Conflict of interest statement. None declared. REFERENCES 1. Halverson J.D. , Smrek J. , Kremer K. , Grosberg A.Y. From a melt of rings to chromosome territories: the role of topological constraints in genome folding . Rep. Prog. Phys. 2014 ; 77 : 022601 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Dixon J.R. , Selvaraj S. , Yue F. , Kim A. , Li Y. , Shen Y. , Hu M. , Liu J.S. , Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions . Nature . 2012 ; 485 : 376 – 380 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Sexton T. , Yaffe E. , Kenigsberg E. , Bantignies F. , Leblanc B. , Hoichman M. , Parrinello H. , Tanay A. , Cavalli G. Three-dimensional folding and functional organization principles of the Drosophila genome . Cell . 2012 ; 148 : 458 – 472 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Jin F. , Li Y. , Dixon J.R. , Selvaraj S. , Ye Z. , Lee A.Y. , Yen C.A. , Schmitt A.D. , Espinoza C.A. , Ren B. A high-resolution map of the three-dimensional chromatin interactome in human cells . Nature . 2013 ; 503 : 290 – 294 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Lieberman-Aiden E. , van Berkum N.L. , Williams L. , Imakaev M. , Ragoczy T. , Telling A. , Amit I. , Lajoie B.R. , Sabo P.J. , Dorschner M.O. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome . Science . 2009 ; 326 : 289 – 293 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Pope B.D. , Ryba T. , Dileep V. , Yue F. , Wu W. , Denas O. , Vera D.L. , Wang Y. , Hansen R.S. , Canfield T.K. et al. Topologically associating domains are stable units of replication-timing regulation . Nature . 2014 ; 515 : 402 – 405 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Cubenas-Potts C. , Corces V.G. Architectural proteins, transcription, and the three-dimensional organization of the genome . FEBS Lett. 2015 ; 589 : 2923 – 2930 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Kellum R. , Schedl P. A position-effect assay for boundaries of higher order chromosomal domains . Cell . 1991 ; 64 : 941 – 950 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Kellum R. , Schedl P. A group of scs elements function as domain boundaries in an enhancer-blocking assay . Mol. Cell. Biol. 1992 ; 12 : 2424 – 2431 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Phillips-Cremins J.E. , Sauria M. E.G. , Sanyal A. , Gerasimova T.I. , Lajoie B.R. , Bell J.S. , Ong C.T. , Hookway T.A. , Guo C. , Sun Y. et al. Architectural protein subclasses shape 3D organization of genomes during lineage commitment . Cell . 2013 ; 153 : 1281 – 1295 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Van Bortle K. , Nichols M.H. , Li L. , Ong C.-T. , Takenaka N. , Qin Z.S. , Corces V.G. Insulator function and topological domain border strength scale with architectural protein occupancy . Genome Biol. 2014 ; 15 : R82 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Rao S.S.P. , Huntley M.H. , Durand N.C. , Stamenova E.K. , Bochkov I.D. , Robinson J.T. , Sanborn A.L. , Machol I. , Omer A.D. , Lander E.S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping . Cell . 2015 ; 159 : 1665 – 1680 . Google Scholar Crossref Search ADS WorldCat 13. Zuin J. , Dixon J.R. , van der Reijden M.I.J.A. , Ye Z. , Kolovos P. , Brouwer R.W. , van de Corput M.P. , van de Werken H.J. , Knoch T.A. , van IJcken W.F. et al. Cohesin and CTCF differentially affect chromatin architecture and gene expression in human cells . Proc. Natl. Acad. Sci. U.S.A. 2014 ; 111 : 996 – 1001 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Vietri-Rudan M. , Barrington C. , Henderson S. , Ernst C. , Odom D. , Tanay A. , Hadjur S. Comparative Hi-C reveals that CTCF underlies evolution of chromosomal domain architecture . Cell Rep. 2015 ; 10 : 1297 – 1309 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Mourad R. , Cuvier O. Computational identification of genomic features that influence 3D chromatin domain formation . PLoS Comput. Biol. 2016 ; 12 : e1004908 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Shin H. , Shi Y. , Dai C. , Tjong H. , Gong K. , Alber F. , Zhou X.J. TopDom: an efficient and deterministic method for identifying topological domains in genomes . Nucleic Acids Res. 2016 ; 44 : e70 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Li L. , Lyu X. , Hou C. , Takenaka N. , Nguyen H.Q. , Ong C.T. , Cubeñas-Potts C. , Hu M. , Lei E.P. , Bosco G. et al. Widespread rearrangement of 3D chromatin organization underlies Polycomb-mediated stress-induced silencing . Mol. Cell . 2015 ; 58 : 216 – 231 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Eagen K.P. , Lieberman Aiden E. , Kornberg R.D. Polycomb-mediated chromatin loops revealed by a sub-kilobase resolution chromatin interaction map . Proc. Natl. Acad. Sci. U.S.A. 2017 ; 114 : 8764 – 8769 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Wood A.M. , Van Bortle K. , Ramos E. , Takenaka N. , Rohrbaugh M. , Jones B.C. , Jones K.C. , Corces V.G. Regulation of chromatin organization and inducible gene expression by a Drosophila insulator . Mol. Cell . 2011 ; 44 : 29 – 38 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Kellner W.A. , Van Bortle K. , Li L. , Ramos E. , Takenaka N. , Corces V.G. Distinct isoforms of the Drosophila Brd4 homologue are present at enhancers, promoters and insulator sites . Nucleic Acids Res. 2013 ; 41 : 9274 – 9283 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Negre N. , Brown C.D. , Ma L. , Bristow C.A.A. , Miller S.W. , Wagner U. , Kheradpour P. , Eaton M.L. , Loriaux P. , Sealfon R. et al. A cis-regulatory map of the Drosophila genome . Nature . 2011 ; 471 : 527 – 531 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Junion G. , Spivakov M. , Girardot C. , Braun M. , Gustafson E.H. , Birney E. , Furlong E.E. A transcription factor collective defines cardiac cell fate and reflects lineage history . Cell . 2012 ; 148 : 473 – 486 . Google Scholar Crossref Search ADS PubMed WorldCat 23. The ENCODE Consortium An integrated encyclopedia of DNA elements in the human genome . Nature . 2012 ; 489 : 57 – 74 . Crossref Search ADS PubMed WorldCat 24. Cuellar-Partida G. , Buske F.A. , McLeay R.C. , Whitington T. , Noble W.S. , Bailey T.L. Epigenetic priors for identifying active transcription factor binding sites . Bioinformatics . 2012 ; 28 : 56 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Zhao K. , Hart C.M. , Laemmli U.K. Visualization of chromosomal domains with boundary element-associated factor BEAF-32 . Cell . 1995 ; 81 : 879 – 889 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Holohan E.E. , Kwong C. , Adryan B. , Bartkuhn M. , Herold M. , Renkawitz R. , Russell S. , White R. CTCF genomic binding sites in Drosophila and the organisation of the Bithorax complex . PLoS Genet. 2007 ; 3 : e112 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Adryan B. , Woerfel G. , Birch-Machin I. , Gao S. , Quick M. , Meadows L. , Russell S. , White R. Genomic mapping of suppressor of hairy-wing binding sites in Drosophila . Genome Biol. 2007 ; 8 : R167 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Negre N. , Brown C.D. , Shah P.K. , Kheradpour P. , Morrison C.A. , Henikoff J.G. , Feng X. , Ahmad K. , Russell S. , White R.A. et al. A comprehensive map of insulator elements for the Drosophila genome . PLoS Genet. 2010 ; 6 : e1000814 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Zolotarev N. , Fedotova A. , Kyrchanova O. , Bonchuk A. , Penin A.A. , Lando A.S. , Eliseeva I.A. , Kulakovskiy I.V. , Maksimenko O. , Georgiev P. Architectural proteins Pita, Zw5 and ZIPIC contain homodimerization domain and support specific long-range interactions in Drosophila . Nucleic Acids Res. 2016 ; 44 : 7228 – 7241 . Google Scholar PubMed WorldCat 30. Hart C.M. , Cuvier O. , Laemmli U.K. Evidence for an antagonistic relationship between the boundary element-associated factor BEAF and the transcription factor DREF . Chromosoma . 1999 ; 108 : 375 – 383 . Google Scholar Crossref Search ADS PubMed WorldCat 31. Li J. , Gilmour D.S. Distinct mechanisms of transcriptional pausing orchestrated by GAGA factor and M1BP, a novel transcription factor . EMBO J. 2013 ; 32 : 1829 – 1841 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Read D. , Manley J.L. Alternatively spliced transcripts of the Drosophila tramtrack gene encode zinc finger proteins with distinct DNA binding specificities . EMBO J. 1992 ; 11 : 1035 – 1044 . Google Scholar PubMed WorldCat 33. Cuartero S. , Fresan U. , Reina O. , Planet E. , Espinas M.L. Ibf1 and Ibf2 are novel CP190-interacting proteins required for insulator function . EMBO J. 2014 ; 33 : 637 – 647 . Google Scholar Crossref Search ADS PubMed WorldCat 34. Dai Q. , Ren A. , Westholm J.O. , Duan H. , Patel D.J. , Lai E.C. Common and distinct DNA-binding and regulatory activities of the BEN-solo transcription factor family . Genes Dev. 2015 ; 29 : 48 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Hug C.B. , Grimaldi A.G. , Kruse K. , Vaquerizas J.M. Chromatin architecture emerges during zygotic genome activation independent of transcription . Cell . 2017 ; 169 : 216 – 228 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Dekker J. , Marti-Renom M.A. , Mirny L.A. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data . Nat. Rev. Genet. 2013 ; 14 : 390 – 403 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Hu M. , Deng K. , Selvaraj S. , Qin Z. , Ren B. , Liu J.S. HiCNorm: removing biases in Hi-C data via Poisson regression . Bioinformatics . 2012 ; 28 : 3131 – 3133 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Imakaev M. , Fudenberg G. , McCord R.P. , Naumova N. , Goloborodko A. , Lajoie B.R. , Dekker J. , Mirny L.A. Iterative correction of Hi-C data reveals hallmarks of chromosome organization . Nat. Methods . 2012 ; 9 : 999 – 1003 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Gaszner M. , Felsenfeld G. Insulators: exploiting transcriptional and epigenetic mechanisms . Nat. Rev. Genet. 2006 ; 7 : 703 – 713 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Ghandi M. , Mohammad-Noori M. , Ghareghani N. , Lee D. , Garraway L. , Beer M.A. gkmSVM: an R package for gapped-kmer SVM . Bioinformatics . 2016 ; 32 : 2205 – 2207 . Google Scholar Crossref Search ADS PubMed WorldCat 41. Kumar R.P. , Krishnan J. , Singh N.P. , Singh L. , Mishra R.K. GATA simple sequence repeats function as enhancer blocker boundaries . Nat. Commun. 2013 ; 4 : 1844 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Dali R. , Blanchette M. A critical assessment of topologically associating domain prediction tools . Nucleic Acids Res. 2017 ; 45 : 2994 – 3005 . Google Scholar Crossref Search ADS PubMed WorldCat 43. Levy-Leduc C. , Delattre M. , Mary-Huard T. , Robin S. Two-dimensional segmentation for analyzing Hi-C data . Bioinformatics . 2014 ; 30 : i386 – i392 . Google Scholar Crossref Search ADS PubMed WorldCat 44. Ghavi-Helm Y. , Klein F.A. , Pakozdi T. , Ciglar L. , Noordermeer D. , Huber W. , Furlong E.E. Enhancer loops appear stable during development and are associated with paused polymerase . Nature . 2014 ; 512 : 96 – 100 . Google Scholar Crossref Search ADS PubMed WorldCat 45. Maksimenko O. , Bartkuhn M. , Stakhov V. , Herold M. , Zolotarev N. , Jox T. , Buxa M.K. , Kirsch R. , Bonchuk A. , Fedotova A. et al. Two new insulator proteins, Pita and ZIPIC, target CP190 to chromatin . Genome Res. 2015 ; 25 : 89 – 99 . Google Scholar Crossref Search ADS PubMed WorldCat 46. Herranz H. , Morata G. The functions of pannier during Drosophila embryogenesis . Development . 2001 ; 128 : 4837 – 4846 . Google Scholar PubMed WorldCat 47. Wang C. , Xi R. Keeping intestinal stem cell differentiation on the Tramtrack . Fly . 2015 ; 9 : 110 – 114 . Google Scholar Crossref Search ADS PubMed WorldCat 48. Djekidel M.N. , Liang Z. , Wang Q. , Hu Z. , Li G. , Chen Y. , Zhang M.Q. 3CPET: finding co-factor complexes from ChIA-PET data using a hierarchical Dirichlet process . Genome Biol. 2015 ; 16 : 288 . Google Scholar Crossref Search ADS PubMed WorldCat 49. Liang J. , Lacroix L. , Gamot A. , Cuddapah S. , Queille S. , Lhoumaud P. , Lepetit P. , Martin P.G.P. , Vogelmann J. , Court F. et al. Chromatin immunoprecipitation indirect peaks highlight functional long-range interactions among insulator proteins and RNAII pausing . Mol. Cell . 2014 ; 53 : 672 – 681 . Google Scholar Crossref Search ADS PubMed WorldCat 50. Hirano T. Condensins: organizing and segregating the genome . Curr. Biol. 2005 ; 15 : R265 – R275 . Google Scholar Crossref Search ADS PubMed WorldCat 51. Hirano T. At the heart of the chromosome: SMC proteins in action . Nat. Rev. Mol. Cell Biol. 2006 ; 7 : 311 – 322 . Google Scholar Crossref Search ADS PubMed WorldCat 52. Gibcus J. , Dekker J. The hierarchy of the 3D genome . Mol. Cell . 2013 ; 49 : 773 – 782 . Google Scholar Crossref Search ADS PubMed WorldCat 53. Tan S.K. , Lin Z.H. , Chang C.W. , Varang V. , Chng K.R. , Pan Y.F. , Yong E.L. , Sung W.K. , Cheung E. AP-2γ regulates oestrogen receptor-mediated long-range chromatin interaction and gene transcription . EMBO J. 2011 ; 30 : 2569 – 2581 . Google Scholar Crossref Search ADS PubMed WorldCat 54. Vogelmann J. , Le Gall A. , Dejardin S. , Allemand F. , Gamot A. , Labesse G. , Cuvier O. , Nègre N. , Cohen-Gonsaud M. , Margeat E. et al. Chromatin insulator factors involved in long-range DNA interactions and their role in the folding of the Drosophila genome . PLoS Genet. 2014 ; 10 : e1004544 . Google Scholar Crossref Search ADS PubMed WorldCat 55. Pagans S. , Ortiz-Lombardia M. , Espinas M.L. , Bernues J. , Azorin F. The Drosophila transcription factor tramtrack (TTK) interacts with Trithorax-like (GAGA) and represses GAGA-mediated activation . Nucleic Acids Res. 2002 ; 30 : 4406 – 4413 . Google Scholar Crossref Search ADS PubMed WorldCat 56. Busslinger G.A. , Stocsits R.R. , van der Lelij P. , Axelsson E. , Tedeschi A. , Galjart N. , Peters J.-M. Cohesin is positioned in mammalian genomes by transcription, CTCF and Wapl . Nature . 2017 ; 544 : 503 – 507 . Google Scholar Crossref Search ADS PubMed WorldCat 57. Doyle B. , Fudenberg G. , Imakaev M. , Mirny L.A. Chromatin loops as allosteric modulators of enhancer-promoter interactions . PLoS Comput. Biol. 2014 ; 10 : e1003867 . Google Scholar Crossref Search ADS PubMed WorldCat 58. Facchinei F. , Scutari G. , Sagratella S. Parallel selective algorithms for nonconvex big data optimization . IEEE Trans. Sig. Process. 2015 ; 63 : 1874 – 1889 . Google Scholar Crossref Search ADS WorldCat 59. Jost D. , Carrivain P. , Cavalli G. , Vaillant C. Modeling epigenome folding: formation and dynamics of topologically associated chromatin domains . Nucleic Acids Res. 2014 ; 42 : 9553 – 9561 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
(Po)STAC (Polycistronic SunTAg modified CRISPR) enables live-cell and fixed-cell super-resolution imaging of multiple genesNeguembor, Maria, V;Sebastian-Perez,, Ruben;Aulicino,, Francesco;Gomez-Garcia, Pablo, A;Cosma, Maria, P;Lakadamyali,, Melike
doi: 10.1093/nar/gkx1271pmid: 29294098
Abstract CRISPR/dCas9-based labeling has allowed direct visualization of genomic regions in living cells. However, poor labeling efficiency and signal-to-background ratio have limited its application to visualize genome organization using super-resolution microscopy. We developed (Po)STAC (Polycistronic SunTAg modified CRISPR) by combining CRISPR/dCas9 with SunTag labeling and polycistronic vectors. (Po)STAC enhances both labeling efficiency and fluorescence signal detected from labeled loci enabling live cell imaging as well as super-resolution fixed-cell imaging of multiple genes with high spatiotemporal resolution. INTRODUCTION Visualization of endogenous gene loci in living cells is highly valuable for studying dynamic changes to genome organization during any cellular process. Programmable DNA-binding proteins such as clustered regularly interspersed short palindromic repeat (CRISPR)—associated protein 9 (Cas9) have recently been adopted to visualize endogenous repetitive and non-repetitive genomic sequences in living cells (1). This approach relies on the use of a deactivated version of Cas9 lacking enzymatic activity (dCas9) fused to a fluorescent protein (FP), which can be targeted to a number of genomic sequences by using guide RNAs (sgRNAs). The programmable nature of this approach is particularly attractive as it allows targeting a large number of genomic loci. In recent years, this approach has also been adopted to multi-color labeling using orthogonal Cas9 proteins or by introducing RNA aptamers into the sgRNAs (2–5). However, the efficiency of labeling (percentage of cells with fluorescently detectable loci) as well as the amount of fluorescence signal detected from individual loci using this approach have been traditionally low (5,6). The labeling efficiency is limited partially by the efficiency of delivery of many plasmids and partially by the level of expression (5,6). The amount of fluorescence signal detected is limited by the small copy number of dCas9-FPs specifically bound to the locus over a high background introduced by the unbound dCas9-FPs in the nucleoplasm. The low efficiency and low signal combined limit the general applicability of this method for long term, fast imaging of genome dynamics as well as super-resolution imaging of gene architecture. Therefore, a strategy that can boost the detected signal as well as the labeling efficiency is essential. To overcome these limitations, we took advantage of two separate and complementary strategies: the use of SunTag and polycistronic vectors. SunTag is a repeating peptide array that can be used to recruit multiple copies of an antibody-fusion protein to the target of interest (7). Using this strategy, up to 24 copies of superfolder GFP (sfGFP) fused to the antibody have been recruited to single protein molecules fused to a repeated SunTag array, which is targeted by the antibody. Polycistronic vectors allow the expression of multiple sgRNAs from a single synthetic gene including tRNA–sgRNA modules in tandem. The insertion of the tRNA in between the sgRNAs allows the precise excision of transcripts by the endogenous RNases (8,9). This system has previously been used to demonstrate efficient genome editing in plant and Drosophila cells with up to eight sgRNAs (10,11) but has never been validated in mammalian cells and for imaging applications. Here, we fully characterized the labeling efficiency of SunTag combined with CRISPR/dCas9, which we termed STAC for simplicity (SunTAg modified CRISPR). Further, we developed PoSTAC (Polycistronic SunTAg-modified CRISPR) for enhanced genome visualization by combining SunTag alone or SunTag and polycistronic vectors with CRISPR/dCas9. MATERIALS AND METHODS Plasmid synthesis pHRdSV40-dCas9–10xGCN4_v4-P2A-BFP (Addgene # 60903), pHRdSV40-NLS-dCas9–24xGCN4_v4-NLS-P2A-BFP-dWPRE (Addgene # 60910) and pHR-scFv-GCN4-sfGFP-GB1-NLS-dWPRE (Addgene # 60906) were a gift from Ron Vale (7). pSLQ1658-dCas9-EGFP (Addgene # 51023), pSLQ1651-sgTelomere(F+E) (Addgene # 51024) and pSLQ1661-sgMUC4-E3(F+E) (Addgene # 51025) were a gift from Bo Huang and Stanley Qi (1). To eliminate the red fluorescence from sgRNA plasmids, mCherry gene was truncated from pSLQ1651-sgTelomere(F+E) and pSLQ1661-sgMUC4-E3(F+E) plasmids by digestion with AgeI + SgrAI and ligation of compatible ends. pSLQ1661-sgMUC1-E1(F+E) with truncated mCherry was generated by Gibson assembly using pSLQ1661-sgMUC4-E3(F+E) with truncated mCherry and replacing sgMUC4-E3 with sgMUC1-E1 sequence with the use of the following primers: Forward Fragment A: CCGCGCCACATAGCAGAACTTTAAA Reverse Fragment A: tgggctgggggggcggtggagcCAACAAGGTGGTTCTCCAAGGGA Forward Fragment B: gctccaccgcccccccagcccaGTTTAAGAGCTATGCTGGAAACA Reverse Fragment B: TTTAAAGTTCTGCTATGTGGCGCGG The underlined sequence corresponds to sgMUC1-E1 sequence (1). Polycistronic vectors were generated by gene synthesis and cloned into a pUC57 backbone by GenScript. Sequences are listed below: >hU6 Promoter_Plant tRNAGly_sgRNA MUC1-E1(F+E)_ Plant tRNAGly_sgRNA MUC4-E3(F+E)_Terminator: TTTCCCATGATTCCTTCATATTTGCATATACGATACAAGGCTGTTAGAGAGATAATTGGAATTAATTTGACTGTAAACACAAAGATATTAGTACAAAATACGTGACGTAGAAAGTAATAATTTCTTGGGTAGTTTGCAGTTTTAAAATTATGTTTTAAAATGGACTATCATATGCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAACAAAGCACCAGTGGTCTAGTGGTAGAATAGTACCCTGCCACGGTACAGACCCGGGTTCGATTCCCGGCTGGTGCAGCTCCACCGCCCCCCCAGCCCAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCAACAAAGCACCAGTGGTCTAGTGGTAGAATAGTACCCTGCCACGGTACAGACCCGGGTTCGATTCCCGGCTGGTGCAGTGGCGTGACCTGTGGATGCTGGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTGTTT >hU6 Promoter_Human tRNAGly_sgRNA MUC1-E1(F+E)_ Human tRNAGly_sgRNA MUC4-E3(F+E)_Terminator: TTTCCCATGATTCCTTCATATTTGCATATACGATACAAGGCTGTTAGAGAGATAATTGGAATTAATTTGACTGTAAACACAAAGATATTAGTACAAAATACGTGACGTAGAAAGTAATAATTTCTTGGGTAGTTTGCAGTTTTAAAATTATGTTTTAAAATGGACTATCATATGCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAACAAAGCATTGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCAATGCAGCTCCACCGCCCCCCCAGCCCAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCAACAAAGCATTGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCAATGCAGTGGCGTGACCTGTGGATGCTGGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTGTTT Cell culture and transgene expression HeLa, HeLa 1.3 (kindly provided by Titia De Lange, The Rockefeller University, USA), C2C12 (kindly provided by Pura Muñoz-Canoves, UPF Barcelona, Spain) and HEK293T cells were cultured in Dulbecco’s modified Eagle’s medium (DMEM) (#41965062, Gibco) supplemented with 10% Fetal Bovine Serum (FBS) (#10270106, Gibco), 1× penicillin/streptomycin (#15140122, Gibco). mES cells were cultured on gelatin (#ES-006-B, Merck) coated dishes in sLif medium composed by DMEM supplemented with 15% FBS, 1× penicillin/streptomycin, 1× GlutaMax (#35050061, Gibco), 1× sodium pyruvate (#11360070, Gibco), 1× MEM non-essential amino acid (#11140050, Gibco), 0.2% 2-Mercaptoethanol (#31350010, Gibco) and 1000 U/ml LIF ESGRO (#ESG1107, Merck). Transfections were performed in suspension with Fugene HD (#E2311, Promega) for HeLa, HEK293T and C2C12 and with Mouse ES Cell Nucleofector® Kit (#VAPH-1001, Lonza) for mESC under manufacturer’s conditions and with equimolar amounts of plasmids. Transfected cells were directly plated on 8-well Lab-Tek I borosilicate chambers (#155411, Nunc) at a density of 3.5 × 105 cells/cm2. Live cell imaging and analysis Transfected cells were imaged at 48 h post-transfection in DMEM without Phenol red (#21063029, Gibco) supplemented with 10% FBS, 1× penicillin/streptomycin. Images were acquired in a Leica TCS SP5 II confocal microscope, with a 63.0 × 1.4 NA HCX PL APO lambda blue oil immersion objective. To quantify the labeling efficiency and the intensity of loci, the complete nuclear volume of living cells was imaged in 0.1 μm z-steps stacks at 700 Hz of bidirectional scanning and 95.55 μm pinhole. Images were analyzed in ImageJ. Maximum intensity Z projections of the entire volume imaged were generated for each file. For Signal to Noise Ratio (SNR) analysis, telomeres and MUC4 loci were manually segmented using a circle with a diameter of 3 pixels to quantify the average intensity within loci. Five circles of the same dimension were randomly placed in regions that contain background to obtain the average background intensity and the noise (standard deviation of the background). SNR was calculated by dividing the signal (locus intensity—average background intensity) by the noise. Signal over background was measured to compare polycistronic vectors by generating a plot of intensity values across a 2 μm line along every locus. Every intensity curve was background corrected and normalized curves were averaged to obtain signal over background curves for each condition. The area under the curves was calculated by the trapezoidal method and normalized by h_tRNA. For dynamic studies, living cells were imaged at 1400 Hz of bidirectional scanning speed, 191.1 μm of pinhole for 2000 frames at 5 or 10 Hz of frame rate. See Telomere tracking section for detailed analysis information. Chromatin immunoprecipitation (ChIP) Chromatin was prepared from HeLa cells 48h post-transfection as previously described (12). Chromatin immunoprecipitation (ChIP) was performed as previously published (12) with the following modifications. A total of 25 μg of chromatin were incubated with 50 μl Protein G dynabeads (#10003D, ThermoFisher) previously bound for 5 h at 4°C to 2 μl of rabbit anti-Cas9 (#C15310258–20, Diagenode) or 2 μl of whole molecule rabbit IgG as negative control (#ab37415, Abcam). qPCR was performed with Lightcycler 480 SYBR green I master (#4887352001, Roche) and the primers listed below using a Lightcycler 480 (Roche) qPCR instrument. MUC1 sgRNA targeted region A Fw: AGGCTCTGCATCAGGCTCAG MUC1 sgRNA targeted region A Rv: TCTTGGTGCTATGGCTGGCA MUC1 sgRNA targeted region B Fw: AGCCCACGGTGTCACCTC MUC1 sgRNA targeted region B Rv: CGGGGCCGGCCTGGTGT MUC4 sgRNA targeted region Fw: GCCACCCCTCTTCCTGTCAC MUC4 sgRNA targeted region Rv: GTGACCTGTGGATGCTGAGG Untargeted region MUC4 A (Negative control) Fw: TCCACACAGAGCAGGCACTC Untargeted region MUC4 A (Negative control) Rv: CACTGCAAGGGGTCCAGGAA Untargeted region MUC4 B (Negative control) Fw: TCAATGGTGGTCGTGTGATT (1) Untargeted region MUC4 B (Negative control) Rv: AAGTCGGTGCAGCTGTCTCT (1) Immunolabeling for STORM For STORM imaging, transfected cells (48 h post-transfection) were incubated in DMEM with NileRed 240 nm beads (#FP-0256–2, Spherotech) at 1:450 dilution for 30 min in the incubator. After washing three times for 5 min with growth medium, cells were further kept in the incubator with growth medium for another one hour. Cells were then washed once with phosphate-buffered saline (PBS) and fixed with 10% PFA (#43368, Alfa Aeasar) 10 min at room temperature (RT) and then washed three times in PBS for 5 min each. For immunolabeling with anti-GFP nanobodies, cells were permeabilized with PBS—0.3% Triton X100 for 15 min at RT. Blocking was performed in PBS—4% Horse Serum (#26050088, Gibco)—1% bovine serum albumin (#A7906, Sigma) for 45 min at RT. AlexaFluor-647 labeled anti-GFP nanobodies (NHS conjugated) (13) were incubated in blocking buffer at 1:100 dilution for 30 min at RT in the dark. Cells were washed three times in PBS for 5 min each at RT. AF-647 anti-GFP nanobodies were a kind gift from Jonas Ries (EMBL Heidelberg, Germany). STORM imaging and data analysis STORM images were acquired in a N-STORM 4.0 microscope (Nikon) equipped with a CFI HP Apochromat TIRF 100 × 1.49 oil objective and a iXon Ultra 897 camera (Andor). Cells stained with anti-GFP nanobodies were imaged for 60 000 frames at 16 ms frame rate using continuous 405 nm illumination, which was gradually increased over the imaging duration. Conventional fluorescence images were taken at the beginning of each imaging cycle. Imaging buffer composition was 150 mM Tris–HCl pH 8.8—100 mM Cysteamine MEA (#30070, Sigma-Aldrich)—1% Glox Solution (0.5 mg/ml glucose oxidase, 40 mg/ml catalase (#G2133 and #C100, Sigma-Aldrich))—5% Glucose (#G8270, Sigma-Aldrich). STORM images were analyzed and rendered in Insight3 as previously described (14,15). Localizations were identified based on an intensity threshold (minimum intensity 2000) and fit to a simple Gaussian with a width between 200 and 400 nm to determine the x and y positions. Images were rendered with localizations represented as uniform Gaussian peaks having a width of 9 nm. Same contrast parameters were applied to each image to allow one to one comparison and to allow proper visualization of both the background and foreground signals reflecting the overall signal density of the image. Identified loci were verified by overlapping super resolution images with conventional fluorescence images of GFP and AF647 where telomeres are enriched for GFP and AF647 signals. Individual loci were manually selected to quantify in Insight3 the number of localizations per telomere, and the elipticity of telomeres from σx and σy. Voronoi tessellation (16,17) was performed with a customized Matlab code based on ClusterViSu (16). A common threshold (maximum area of Voronoi polygon of 99.84 nm2) was applied to all nuclei analyzed and an additional threshold for minimum number of localizations per cluster was adjusted considering the variability of background signal and labeling density observed between experiments, in order to obtain optimal telomere identification. Overlay of Voronoi tessellation to conventional fluorescence images showed good correlation of the identified clusters with telomeres. The area of telomeres in nm2 was calculated as the sum of Voronoi polygons that form a telomeric cluster. The density (number of localizations/μm2) of telomeres was obtained with dividing the number of localizations of the telomeric cluster by its area. The density of the background was obtained by sampling three 1 μm2 areas within every nucleus analyzed. Telomere tracking Videos were analyzed with a custom written Matlab-based software that combines TrackRecord (18) and @msdanalyzer (19) with custom algorithms. Loci with intensity values above a threshold of 30 (8-bit images) and dimensions within 5 × 5 pixels (subROI) were automatically identified by TrackRecord at every frame and then a 2D Gaussian fitting was performed to calculate the x and y coordinates of the loci. Tracks were generated using a nearest-neighbors approach. If a merging or splitting event occurred, the track was discarded. Further computational analysis was carried out to estimate the diffusion coefficient (D μm2/s) of telomeres, the SNR and the duration of the tracks. The Time Mean Squared Displacement curves (T-MSD) were used for a linear fitting, using the first four points of each MSD curve corresponding to each individual track (20). Only those tracks longer than 10 frames were analyzed. Diffusion coefficients were obtained for every locus, data were plotted as a histogram of the logarithm of the diffusion coefficients of all the tracks and the diffusion coefficient of the whole population were reported as the mean of all the coefficients ± the standard deviation. Using the Time-Ensemble Mean Squared Displacement curve (TE-MSD), the global behaviour of the particles was studied. In all the cases, a confined type of motion was observed and the radius of confinement that represents the whole population of tracks was obtained. This radius was estimated as the square root of the MSD value of the horizontal asymptote and the error was calculated from the weighted standard deviation of the TE-MSD (21). The photobleaching kinetics were estimated by fitting the evolution of the number of counts per frame with a two-component exponential decay: N(t) = f1*exp(-kb1*t) + (1-f1)*exp(-kb2*t) (22). Statistical analysis Statistical analysis has been performed in Graphpad Prism (v5.04) and in Matlab 2014b. Unpaired two-tailed t test has been applied for statistical comparison of two experimental conditions. One way Anova with Tukey’s multiple comparison test has been applied for statistical comparison of datasets with more than two conditions. Statistical significance is represented in the following manner: ns P > 0.05, *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001, ****P ≤ 0.0001. RESULTS AND DISCUSSION (Po)STAC improves labeling efficiency and signal to background ratio for genome visualization We tested the effectiveness of STAC by labeling high-repeat sequences in telomeres (Figure 1A) as well as moderate-to-low repeat sequences in the Mucin 1 and 4 (MUC1 and 4) loci (Figure 1D and Supplementary Figure S1A). Cells were transfected with equal total amount of dCas9-STAC and dCas9-GFP. SunTag containing 24 repeats of the peptide array (24X-STAC) improved labeling efficiency by 2.5-fold (Percentage of cells with detectable loci = 35, 69 and 87% for dCas9 alone, 10× STAC and 24× STAC, respectively; n ≥ 4 experiments) (Figure 1B) and the SNR by 5.4-fold in the case of telomeres compared to dCas9-GFP (SNR calculated as the signal divided by the standard deviation of the background = 7.2 ± 3.7 SD, averaged over 248 telomeres, 26.7 ± 12.9 SD averaged over 620 telomeres and 38.6 ± 8.92 SD averaged over 638 telomeres for dCas9 alone, 10× STAC and 24× STAC, respectively; n = 3 experiments) (Figure 1C). The fluorescent puncta corresponding to the gene loci were only detected in cells expressing the sgRNA and no fluorescent puncta were detectable in control cells lacking sgRNA (Supplementary Figure S1A). The labeling of the Mucin loci was further validated using ChIP followed by qPCR (ChIP-qPCR) using an antibody against dCas9. dCas9 was enriched in regions of MUC1 and MUC4 genes containing the sgRNA target sequences and this enrichment was not observed for non-target regions (Supplementary Figure S2). Importantly, STAC was applicable to label genes in several cell lines including HeLa cells, HEK 293T cells, C2C12 cells and mouse embryonic stem cells (Supplementary Figure S3A). Figure 1. Open in new tabDownload slide (A) Maximum intensity projections of confocal images of HeLa cells transfected with sgRNA Telomere and dCas9-GFP, dCas9–10xSTAC or dCas9–24xSTAC. (B) Labeling efficiency as percentage of transfected HeLa cells with detectable loci. n ≥ 4 experiments. (C) SNR of telomeres and MUC4 loci measured in HeLa cells transfected with dCas9-GFP, dCas9–10xSTAC or dCas9–24xSTAC.n = 3 experiments. (D) Maximum intensity projections of confocal images of HeLa cells transfected with dCas9–24xSTAC and sgRNAs for MUC1, MUC4, MUC1+MUC4, plant_tRNA_MUC1_MUC4 or human_tRNA_MUC1_MUC4. (E) Labeling efficiency as percentage of transfected HeLa cells with detectable loci. n = 3 experiments. (F) Mean number of loci detected per nucleus. n = 3 experiments. For all plots Mean ± SD is displayed. Stars indicate P-values (ns P > 0.05, *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001) for one way Anova with Tukey’s multiple comparison test. Figure 1. Open in new tabDownload slide (A) Maximum intensity projections of confocal images of HeLa cells transfected with sgRNA Telomere and dCas9-GFP, dCas9–10xSTAC or dCas9–24xSTAC. (B) Labeling efficiency as percentage of transfected HeLa cells with detectable loci. n ≥ 4 experiments. (C) SNR of telomeres and MUC4 loci measured in HeLa cells transfected with dCas9-GFP, dCas9–10xSTAC or dCas9–24xSTAC.n = 3 experiments. (D) Maximum intensity projections of confocal images of HeLa cells transfected with dCas9–24xSTAC and sgRNAs for MUC1, MUC4, MUC1+MUC4, plant_tRNA_MUC1_MUC4 or human_tRNA_MUC1_MUC4. (E) Labeling efficiency as percentage of transfected HeLa cells with detectable loci. n = 3 experiments. (F) Mean number of loci detected per nucleus. n = 3 experiments. For all plots Mean ± SD is displayed. Stars indicate P-values (ns P > 0.05, *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001) for one way Anova with Tukey’s multiple comparison test. In order to extend this labeling approach to multiple genes, we further designed a mammalian cell optimized polycistronic vector including a human tRNAGly (h_tRNA vector) interspersed between the sgRNAs to enable expression of multiple sgRNAs using a single plasmid (PoSTAC, Supplementary Figure S1B). PoSTAC further allowed the expression of multiple genes, in this case MUC1 and MUC4, simultaneously in single cells (Figure 1D) with an efficiency similar to single gene labeling and higher than co-transfection with plasmids containing individual sgRNAs (Figure 1E). ChIP-qPCR experiments confirmed that the detected loci indeed corresponded to MUC1 and MUC4 (Supplementary Figure S2). Labeling efficiency was higher when using the polycistronic vector optimized for human (h_tRNA) compared to p_tRNA (82.5 versus 65.9% cells with fluorescently detectable loci with h_ and p_tRNA, respectively, n ≥ 4 experiments) (Figure 1E). Further, the fluorescence intensity of each locus was similar to single gene labeling with h_tRNA (AUC = 1 a.u. for single gene labeling averaged over 217 MUC4 loci versus 0.98 a.u. for multi-gene labeling averaged over 413 MUC1/4 loci; n = 3 experiments), while slightly reduced for the p_tRNA vector (0.76 a.u. averaged over 322 MUC1/4 loci; n = 3 experiments). HeLa cells are triploid for both MUC1 and MUC4, therefore, we expected to visualize up to 6 or 12 loci depending on if the cells were in G1 or G2/S cell cycle phase, respectively. The number of loci indeed ranged from 1 to 12 with an average of 5.5 ± 0.69 SD and 4.4 ± 0.91 SD detected loci per cell for the h_tRNA and p_tRNA vectors, respectively (Figure 1F and Supplementary Figure S1C). In the case of h_tRNA vector, this number is close to the expected value of 6 since the large majority of cells were imaged in G1 having three copies of each gene. Cells containing less than six loci are likely either due to occasional not efficient cleavage of the tRNA or to HeLa cells containing less than three copies of one or both of the genes. Importantly, the average number of loci detected per cell was higher with the polycistronic vectors than the co-transfection strategy with single sgRNAs, further indicating the superior performance of this imaging strategy (Figure 1E and Supplementary Figure S1C). STAC enables super-resolution imaging of telomere compaction in fixed cells Telomere length is tightly regulated in mammalian cells and is key for cell survival. During normal cell homeostasis telomere length is rigorously controlled during DNA replication (23). Short telomeres induce cellular senescence, apoptosis (23) and age-associated diseases (24). Telomere length is maintained in cancer cells, which can elongate telomeres to escape senescence and proliferate indefinitely (23). Adult stem cells have long telomeres, which are also elongated during somatic cell reprogramming (25). The mechanisms by which telomeres protect chromosome ends from double-stranded break repair and solve the end protection problem is subject to intense debate (26–28). Recent super-resolution studies of telomere compaction have relied on either harsh sample preparation methods such as DNA fluorescence in situ hybridization (FISH) (26–28) or indirect visualization of telomere-binding proteins (26) to study telomere compaction at high resolution. Importantly, these methods are not compatible with visualizing genome dynamics in living cells. For these reasons an imaging method to visualize dynamics and compaction of genomic regions like telomeres with high spatiotemporal resolution is highly valuable for many biological applications. However, the poor labeling efficiency as well as the low signal-to-background ratio of the CRISPR-dCas9 label has made its application to super-resolution studies highly challenging. We therefore applied STAC to image telomeres at high resolution in fixed cells in two different HeLa cell lines that have different telomere lengths (HeLa and HeLa 1.3 with average telomere lengths of 6 and 23 kb, respectively) (29). The labeling efficiency was similar between HeLa (83.4% of cells with detectable telomeres) and HeLa 1.3 cells (87.9% of cells with detectable telomeres) and telomeres appeared brighter in the HeLa 1.3 cells as expected from their longer lengths (Supplementary Figure S3B and C). We next analyzed the telomeres in the two cell lines using stochastic optical reconstruction microscopy (STORM) by labeling the dCas9-STAC/scFv-GFP complex with an AlexaFluor647-tagged anti-GFP nanobody (13) (Figure 2A). We observed that telomere loci overlapped with those detected in the conventional fluorescence images but whose size was much smaller (Figure 2A). The detected localization density was much higher in the telomere loci than background regions not containing the specific loci (27393 ± 2875 SD and 399.3 ± 166.4 SD localizations per μm2 in telomere loci versus background, respectively). On average, 3.7-fold higher fluorophore localizations were detected from the longer telomeres in HeLa 1.3 cells (Average number of Localizations = 320 ± 326.7 SD per telomere for HeLa and 1179 ± 931.3 SD per telomere for HeLa 1.3 cells; n = 3 experiments and n = 46 and 44 cells, HeLa and HeLa 1.3, respectively) (Figure 2B). This increase in the number of detected localizations per telomere correlates well with the previously determined 3.8-fold increase in the number of repeats for telomeres in HeLa 1.3 cells (29). Telomeres were slightly elliptical in their shape with a minor-to-major axis ratio of 0.82 ± 0.1 SD and 0.85 ± 0.1 SD for HeLa and HeLa 1.3 cells, respectively. We used Voronoi tessellation (16,17) to segment the telomeres in the super-resolution images (Figure 2C and D) and determined their area by summing over the area of individual Voronoi cells for each telomere (Figure 2C and D). Telomere area was increased by 2.8-fold in HeLa1.3 cells (Figure 2C) corresponding to a volume increase of roughly 4.7 fold. The increase in volume was only slightly larger than the increase observed for the number of localizations per telomere and the linear length of the telomeres. These results therefore suggest that telomere compaction is not very different in the two cell lines. Super-resolution imaging was possible not only for high-repeat sequences such as telomeres but also for moderate-to-low repeat sequences such as MUC4, allowing visualization and discrimination of nearby alleles too close to be properly resolved by conventional microscopy (Supplementary Figure S4). Figure 2. Open in new tabDownload slide (A) Super resolution images (STORM) of fixed HeLa (top) and HeLa 1.3 (bottom) cells transfected with dCas9–24xSTAC and sgRNA Telomere, immunolabeled with AlexaFluor647 anti-GFP nanobody (NB). From left to right: conventional AF647 fluorescence image (green), STORM image (orange), zoom-in overlay of conventional and STORM image (white square) and high magnification zoom-in (red square). The numbers represent the standard deviation of the localizations along the x- and y-axis. (B) Number of localizations of telomeres in HeLa and HeLa 1.3 cells. n = 3 experiments, n = 46 (HeLa) and 44 (HeLa 1.3) cells. (C) Area of telomeres identified by Voronoi tesselation analysis expressed in nm2. (D) Example of Voronoi tessellation in HeLa and HeLa 1.3, the telomeres shown correspond to the high magnification panels in (A). For all plots Mean ± SD is displayed. Stars indicate P-values (****P ≤ 0.0001) for two-tailed unpaired t-test. Figure 2. Open in new tabDownload slide (A) Super resolution images (STORM) of fixed HeLa (top) and HeLa 1.3 (bottom) cells transfected with dCas9–24xSTAC and sgRNA Telomere, immunolabeled with AlexaFluor647 anti-GFP nanobody (NB). From left to right: conventional AF647 fluorescence image (green), STORM image (orange), zoom-in overlay of conventional and STORM image (white square) and high magnification zoom-in (red square). The numbers represent the standard deviation of the localizations along the x- and y-axis. (B) Number of localizations of telomeres in HeLa and HeLa 1.3 cells. n = 3 experiments, n = 46 (HeLa) and 44 (HeLa 1.3) cells. (C) Area of telomeres identified by Voronoi tesselation analysis expressed in nm2. (D) Example of Voronoi tessellation in HeLa and HeLa 1.3, the telomeres shown correspond to the high magnification panels in (A). For all plots Mean ± SD is displayed. Stars indicate P-values (****P ≤ 0.0001) for two-tailed unpaired t-test. STAC enables live-cell imaging of telomere dynamics Live cell compatibility of the STAC labeling approach further enabled visualization of telomere dynamics at high temporal resolution. Tracking of individual telomeres in HeLa cells (Supplementary Videos S1–2 and Figure 3A) showed that telomere dynamics were unaffected by the increased GFP-recruitment to the gene locus with STAC since the diffusion coefficient of telomeres labeled with STAC were comparable to dCas9-GFP alone tagging strategy (Figure 3B). Both populations showed comparable confined movement (Diffusion coefficient of 10.4*10−4 μm2/s ± 6.1*10−4 SD and 9.92*10−4 μm2/s ± 5.51*10−4 SD; P = 0.03. Radius of Confinement of 123 nm ± 43 SD and 109 nm ± 38 SD for dCas9 and 24× STAC, respectively). However, compared to dCas9-GFP alone, telomeres labeled with STAC showed reduced photobleaching over time (Supplementary Videos S1–2 and Figure 3C) and more telomeres could be tracked per cell with STAC (Figure 3D). Therefore, STAC enables following genome dynamics for longer time. In addition, typically signal intensity is a limitation for fast image acquisition, and hence STAC enables faster acquisition speeds by enhancing the SNR. However, caution must be taken as for any other live-cell imaging method to minimize phototoxicity, which is often a limiting factor in every live-cell imaging experiment. Finally, telomeres in HeLa 1.3 cells had lower diffusion coefficient (Diffusion coefficient of 4.32*10−3 μm2/s ± 4*10−3 SD and 7.23*10−3 μm2/s ± 5*10−3 SD, for HeLa 1.3 and HeLa, respectively; P = 9.8e-41) and lower radius of confinement (Radius of confinement 142 nm ± 40 SD and 181 nm ± 50 SD for HeLa 1.3 and HeLa, respectively) compared to HeLa cells, likely due to the increased volume they occupy in physical space (Figure 3E and F). These results further demonstrate the power of STAC as it can be readily used for correlative studies of gene dynamics in living cells and super-resolution imaging of the architecture of these genes in fixed cells. Figure 3. Open in new tabDownload slide (A) Representative image showing telomere tracks over 2000 frames. (B) LogD (Logarithm of Diffusion coefficient) plot for HeLa telomere tracks imaged at 5 Hz with dCas9-GFP and dCas9–24xSTAC. n = 7103 and 14450 tracks for dCas9-GFP and dCas9–24xSTAC, respectively. (C) Photobleaching kinetics for HeLa telomere loci imaged at 5 Hz with dCas9-GFP and dCas9–24xSTAC for the same cells and tracks of (B), n = 7103 and 14450 tracks for dCas9-GFP and dCas9–24xSTAC, respectively. The plot shows the normalized number of localizations per frame (crosses) and the corresponding two-component exponential fit (lines). Photobleaching constants for HeLa dCas9-GFP k2−1 = 282 s and for HeLa dCas9–24xSTAC k2−1 = 415 s. (D) Track length plots for telomere tracks imaged at 5 Hz with dCas9-GFP and dCas9–24xSTAC in equal number of cells. n = 6 cells per condition with n = 7103 and 14450 tracks for dCas9-GFP and dCas9–24xSTAC, respectively. (E) Diffusion coefficient comparison of HeLa and HeLa 1.3 telomere tracks imaged at 10 Hz. n = 1260 and 1068 tracks for HeLa and HeLa 1.3, respectively. Stars indicate P-values (****P ≤ 0.0001) for two-tailed unpaired t-test. (F) Time Ensemble Mean Squared Displacement (TE-MSD) for HeLa and HeLa 1.3 telomere tracks imaged at 10 Hz. n = 1260 and 1068 tracks for HeLa and HeLa 1.3, respectively (same cells as panel E). Radius of confinement of R = 142 ± 40 nm and R = 181 ± 50 nm. The error bars were calculated as the weighted standard deviation of the MSD values divided by the square root of the number of degrees of freedom in the weighted mean. Figure 3. Open in new tabDownload slide (A) Representative image showing telomere tracks over 2000 frames. (B) LogD (Logarithm of Diffusion coefficient) plot for HeLa telomere tracks imaged at 5 Hz with dCas9-GFP and dCas9–24xSTAC. n = 7103 and 14450 tracks for dCas9-GFP and dCas9–24xSTAC, respectively. (C) Photobleaching kinetics for HeLa telomere loci imaged at 5 Hz with dCas9-GFP and dCas9–24xSTAC for the same cells and tracks of (B), n = 7103 and 14450 tracks for dCas9-GFP and dCas9–24xSTAC, respectively. The plot shows the normalized number of localizations per frame (crosses) and the corresponding two-component exponential fit (lines). Photobleaching constants for HeLa dCas9-GFP k2−1 = 282 s and for HeLa dCas9–24xSTAC k2−1 = 415 s. (D) Track length plots for telomere tracks imaged at 5 Hz with dCas9-GFP and dCas9–24xSTAC in equal number of cells. n = 6 cells per condition with n = 7103 and 14450 tracks for dCas9-GFP and dCas9–24xSTAC, respectively. (E) Diffusion coefficient comparison of HeLa and HeLa 1.3 telomere tracks imaged at 10 Hz. n = 1260 and 1068 tracks for HeLa and HeLa 1.3, respectively. Stars indicate P-values (****P ≤ 0.0001) for two-tailed unpaired t-test. (F) Time Ensemble Mean Squared Displacement (TE-MSD) for HeLa and HeLa 1.3 telomere tracks imaged at 10 Hz. n = 1260 and 1068 tracks for HeLa and HeLa 1.3, respectively (same cells as panel E). Radius of confinement of R = 142 ± 40 nm and R = 181 ± 50 nm. The error bars were calculated as the weighted standard deviation of the MSD values divided by the square root of the number of degrees of freedom in the weighted mean. CONCLUSION Overall, we demonstrate increased labeling efficiency and enhanced signal in visualizing multiple genomic loci in both living and fixed cells by combining CRISPR/dCas9 with SunTag and polycistronic vectors. STAC and PoSTAC are enhanced labeling strategies that enabled both longer term and faster imaging of genome dynamics in living cells as well as super-resolution imaging of their spatial organization in fixed cells (Supplementary Figure S5). These advancements overcome limitations of previous methods such as DNA FISH in visualizing genomic sequences with high-spatiotemporal resolution. The use of polycistronic vectors is not limited to gene visualization but can also be used for activating/repressing multiple genes simultaneously with CRISPR-Cas9 and for multiple gene editing in mammalian cells. In addition, PoSTAC should enhance visualization of unique, non-repetitive sequences by enabling efficient delivery of multiple sgRNAs into single cells using minimal number of plasmids and by enhancing the signal over background via the use of SunTag. In the case of multi-gene labeling, the identity of different loci labeled by PoSTAC can potentially be determined by carrying out correlative live-cell imaging and DNA FISH after fixation, as recently demonstrated (30). In the future PoSTAC can potentially be used for multi-color imaging by encoding sgRNAs specifically recognized by different Cas9 orthologs such as Streptococcus pyogenes Sp dCas9, Neisseria meningitidis Nm dCas9 and Streptococcus thermophilus St1 dCas9 (2) or even Cpf1 orthologs. DATA AVAILABILITY Data is available upon request. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS Authors acknowledge Prof. Titia De Lange (The Rockefeller University), Prof. Pura Muñoz-Cánoves (Pompeu Fabra University), Dr Jonas Ries (EMBL Heidelberg) and Dr Bo Huang (UCSF) for kindly sharing HeLa 1.3 cells, C2C12 cells, GFP nanobodies and Insight3, respectively. Authors acknowledge Dr Jason Otterstrom for his help with Voronoi tessellation analysis. FUNDING European Union’s Horizon 2020 Research and Innovation Programme [CellViewer No 686637 to M.L., M.P.C.]; Ministerio de Economia y Competitividad [BFU2013–49867-EXP to M.L., M.P.C.]; Fundació Cellex Barcelona (to M.L); European Union Seventh Framework Programme under the European Research Council Grants [337191-MOTORS to M.L.]; ‘Severo Ochoa’ Programme for Centres of Excellence in R&D [SEV-2015- 0522 to M.L.]; Ministerio de Economia y Competitividad and FEDER Funds [BFU2014–54717-P, BFU2015–71984-ERC to M.P.C.]; AGAUR Grant [2014 SGR1137 to M.P.C.]; Spanish Ministry of Economy and Competitiveness (to M.P.C.); Centro de Excelencia Severo Ochoa [2013–2017 to M.P.C.]; CERCA Programme/Generalitat de Catalunya (to M.P.C); Ministerio de Ciencia e Innovacion FPI (to F.A.); People Program (Marie Curie Actions) FP7/2007–2013 under REA grant [608959 to M.V.N.]. Funding for open access charge: European Union’s Horizon 2020 Research and Innovation Programme [CellViewer No 686637]. Conflict of interest statement. None declared. REFERENCES 1. Chen B. , Gilbert L.A. , Cimini B.A. , Schnitzbauer J. , Zhang W. , Li G.W. , Park J. , Blackburn E.H. , Weissman J.S. , Qi L.S. et al. Dynamic imaging of genomic loci in living human cells by an optimized CRISPR/Cas system . Cell . 2013 ; 155 : 1479 – 1491 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Ma H. , Naseri A. , Reyes-Gutierrez P. , Wolfe S.A. , Zhang S. , Pederson T. Multicolor CRISPR labeling of chromosomal loci in human cells . Proc. Natl. Acad. Sci. U.S.A. 2015 ; 112 : 3002 – 3007 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Shechner D.M. , Hacisuleyman E. , Younger S.T. , Rinn J.L. Multiplexable, locus-specific targeting of long RNAs with CRISPR-Display . Nat. Methods . 2015 ; 12 : 664 – 670 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Shao S. , Zhang W. , Hu H. , Xue B. , Qin J. , Sun C. , Sun Y. , Wei W. , Sun Y. Long-term dual-color tracking of genomic loci by modified sgRNAs of the CRISPR/Cas9 system . Nucleic Acids Res. 2016 ; 44 : e86 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Qin P. , Parlak M. , Kuscu C. , Bandaria J. , Mir M. , Szlachta K. , Singh R. , Darzacq X. , Yildiz A. , Adli M. Live cell imaging of low- and non-repetitive chromosome loci using CRISPR-Cas9 . Nat. Commun. 2017 ; 8 : 14725 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Chen B. , Guan J. , Huang B. Imaging specific genomic DNA in living cells . Annu. Rev. Biophys. 2016 ; 45 : 1 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Tanenbaum M.E. , Gilbert L.A. , Qi L.S. , Weissman J.S. , Vale R.D. A protein-tagging system for signal amplification in gene expression and fluorescence imaging . Cell . 2014 ; 159 : 635 – 646 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Schiffer S. , Rosch S. , Marchfelder A. Assigning a function to a conserved group of proteins: the tRNA 3′-processing enzymes . EMBO J. 2002 ; 21 : 2769 – 2777 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Evans D. , Marquez S.M. , Pace N.R. RNase P: interface of the RNA and protein worlds . Trends Biochem. Sci. 2006 ; 31 : 333 – 341 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Xie K. , Minkenberg B. , Yang Y. Boosting CRISPR/Cas9 multiplex editing capability with the endogenous tRNA-processing system . Proc. Natl. Acad. Sci. U.S.A. 2015 ; 112 : 3570 – 3575 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Port F. , Bullock S.L. Augmenting CRISPR applications in Drosophila with tRNA-flanked sgRNAs . Nat. Methods . 2016 ; 13 : 852 – 854 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Neguembor M.V. , Xynos A. , Onorati M.C. , Caccia R. , Bortolanza S. , Godio C. , Pistoni M. , Corona D.F. , Schotta G. , Gabellini D. FSHD muscular dystrophy region gene 1 binds Suv4-20h1 histone methyltransferase and impairs myogenesis . J. Mol. Cell Biol. 2013 ; 5 : 294 – 307 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Ries J. , Kaplan C. , Platonova E. , Eghlidi H. , Ewers H. A simple, versatile method for GFP-based super-resolution microscopy via nanobodies . Nat. Methods . 2012 ; 9 : 582 – 584 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Rust M.J. , Bates M. , Zhuang X. Sub-diffraction-limit imaging by stochastic optical reconstruction microscopy (STORM) . Nat. Methods . 2006 ; 3 : 793 – 795 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Bates M. , Huang B. , Dempsey G.T. , Zhuang X. Multicolor super-resolution imaging with photo-switchable fluorescent probes . Science . 2007 ; 317 : 1749 – 1753 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Andronov L. , Orlov I. , Lutz Y. , Vonesch J.L. , Klaholz B.P. ClusterViSu, a method for clustering of protein complexes by Voronoi tessellation in super-resolution microscopy . Sci. Rep. 2016 ; 6 : 24084 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Levet F. , Hosy E. , Kechkar A. , Butler C. , Beghin A. , Choquet D. , Sibarita J.B. SR-Tesseler: a method to segment and quantify localization-based super-resolution microscopy data . Nat. Methods . 2015 ; 12 : 1065 – 1071 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Mazza D. , Ganguly S. , McNally J.G. Monitoring dynamic binding of chromatin proteins in vivo by single-molecule tracking . Methods Mol. Biol. 2013 ; 1042 : 117 – 137 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Tarantino N. , Tinevez J.Y. , Crowell E.F. , Boisson B. , Henriques R. , Mhlanga M. , Agou F. , Israel A. , Laplantine E. TNF and IL-1 exhibit distinct ubiquitin requirements for inducing NEMO-IKK supramolecular structures . J. Cell Biol. 2014 ; 204 : 231 – 245 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Michalet X. , Berglund A.J. Optimal diffusion coefficient estimation in single-particle tracking . Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 2012 ; 85 : 061916 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Kusumi A. , Sako Y. , Yamamoto M. Confined lateral diffusion of membrane receptors as studied by single particle tracking (nanovid microscopy). Effects of calcium-induced differentiation in cultured epithelial cells . Biophys. J. 1993 ; 65 : 2021 – 2040 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Mazza D. , Abernathy A. , Golob N. , Morisaki T. , McNally J.G. A benchmark for chromatin binding measurements in live cells . Nucleic Acids Res. 2012 ; 40 : e119 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Shay J.W. Role of Telomeres and Telomerase in Aging and Cancer . Cancer Discov. 2016 ; 6 : 584 – 593 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Stanley S.E. , Armanios M. The short and long telomere syndromes: paired paradigms for molecular medicine . Curr. Opin. Genet. Dev. 2015 ; 33 : 1 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Marion R.M. , Strati K. , Li H. , Tejera A. , Schoeftner S. , Ortega S. , Serrano M. , Blasco M.A. Telomeres acquire embryonic stem cell characteristics in induced pluripotent stem cells . Cell Stem Cell . 2009 ; 4 : 141 – 154 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Bandaria J.N. , Qin P. , Berk V. , Chu S. , Yildiz A. Shelterin protects chromosome ends by compacting telomeric chromatin . Cell . 2016 ; 164 : 735 – 746 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Vancevska A. , Douglass K.M. , Pfeiffer V. , Manley S. , Lingner J. The telomeric DNA damage response occurs in the absence of chromatin decompaction . Genes Dev. 2017 ; 31 : 567 – 577 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Timashev L.A. , Babcock H. , Zhuang X. , de Lange T. The DDR at telomeres lacking intact shelterin does not require substantial chromatin decompaction . Genes Dev. 2017 ; 31 : 578 – 589 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Takai K.K. , Hooper S. , Blackwood S. , Gandhi R. , de Lange T. In vivo stoichiometry of shelterin components . J. Biol. Chem. 2010 ; 285 : 1457 – 1467 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Guan J. , Liu H. , Shi X. , Feng S. , Huang B. Tracking multiple genomic elements using correlative CRISPR imaging and sequential DNA FISH . Biophys. J. 2017 ; 112 : 1077 – 1084 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
ExoCET: exonuclease in vitro assembly combined with RecET recombination for highly efficient direct DNA cloning from complex genomesWang,, Hailong;Li,, Zhen;Jia,, Ruonan;Yin,, Jia;Li,, Aiying;Xia,, Liqiu;Yin,, Yulong;Müller,, Rolf;Fu,, Jun;Stewart, A, Francis;Zhang,, Youming
doi: 10.1093/nar/gkx1249pmid: 29240926
Abstract The exponentially increasing volumes of DNA sequence data highlight the need for new DNA cloning methods to explore the new information. Here, we describe ‘ExoCET’ (Exonuclease Combined with RecET recombination) to directly clone any chosen region from bacterial and mammalian genomes with nucleotide precision into operational plasmids. ExoCET combines in vitro exonuclease and annealing with the remarkable capacity of full length RecET homologous recombination (HR) to retrieve specified regions from genomic DNA preparations. Using T4 polymerase (T4pol) as the in vitro exonuclease for ExoCET, we directly cloned large regions (>50 kb) from bacterial and mammalian genomes, including DNA isolated from blood. Employing RecET HR or Cas9 cleavage in vitro, the directly cloned region can be chosen with nucleotide precision to position, for example, a gene into an expression vector without the need for further subcloning. In addition to its utility for bioprospecting in bacterial genomes, ExoCET presents straightforward access to mammalian genomes for various applications such as region-specific DNA sequencing that retains haplotype phasing, the rapid construction of optimal, haplotypic, isogenic targeting constructs or a new way to genotype that presents advantages over Southern blotting or polymerase chain reaction. The direct cloning capacities of ExoCET present new freedoms in recombinant DNA technology. INTRODUCTION Recombinant DNA cloning, mutagenesis and engineering are central to molecular biology and biotechnology. Although relatively short DNA segments can be obtained by polymerase chain reaction (PCR) or de novo synthesis, DNA cloning of regions longer than about 10 kb has traditionally relied on the construction of DNA libraries followed by screening protocols to find the desired cloned sequences. This classical approach to DNA cloning has been powerful but is laborious and the target DNA segment once obtained usually must be recloned, reduced or reassembled from the piece(s) identified by the library screen(s) for functional studies. Recently we established a new path for direct cloning from genomic DNA samples that bypasses DNA library construction, screening and subcloning (1). This direct DNA cloning breakthrough was based on the discovery that the full-length Rac prophage protein RecE, with its partner RecT, mediates highly efficient homologous recombination (HR) between two linear DNA substrates. To promote bioprospecting, we applied full length RecE/RecT to directly clone secondary metabolite gene clusters up to 50-kb long from prokaryotic genomes into expression vectors (1–6). However the application of full length RecET direct cloning to mammalian genomes proved to be more challenging. Bacterial and mammalian genomes differ by approximately three orders of magnitude (∼5 × 106 versus ∼3 × 109 bp) and the efficiency of RecET direct cloning is obviously constrained by the chance that the linear cloning vector and the target genomic fragment simultaneously enter into an Escherichia coli host cell upon electroporation. The work described here began with the idea that direct cloning efficiencies could be improved by annealing the linear vector and target genomic fragment together in vitro prior to transformation into E. coli for HR in vivo by full length RecE/RecT. We first evaluated a variety of protocols and reagents for the in vitro assembly step. Several exonucleases were found to be suitable and we settled on the 3′ exonuclease activity of T4 polymerase (T4pol) as the best for direct cloning from genomic DNA preparations. Having established an efficient T4pol protocol, we explored mechanistic aspects of the in vitro annealing and in vivo HR combination before pursuing various challenging applications including direct cloning from mammalian genomes. MATERIALS AND METHODS Bacteria strains and pSC101 expression plasmids Escherichia coli GB2005 was derived from DH10B by deleting fhuA, ybcC and recET (7). GB05-dir was derived from GB2005 by integrating the PBAD-ETgA operon (full length recE, recT, redγ and recA under the arabinose-inducible PBAD promoter) at the ybcC locus (1). GB08-red was derived from GB2005 by integrating the PBAD-gbaA operon (redγ, redβ, redα and recA under the arabinose-inducible PBAD promoter) at the ybcC locus (8). pSC101-BAD-ETgA-tet (1) conveys tetracycline resistance and carries the PBAD- full length ETgA operon and a temperature sensitive pSC101 replication origin which replicates at 30°C but not at 37°C so it can be easily eliminated from the host by temperature shift in the absence of selection (9). Genomic DNA isolation and digestion Gram-negative Photobacterium phosphoreum ANT-2200 and Photorhabdus luminescens DSM15139 were cultured overnight in 50 ml medium. After centrifugation, the cells were resuspended thoroughly in 8 ml of 10 mM Tris–Cl (pH 8.0). Five hundred microliters of 20 mg ml−1 proteinase K and 1 ml of 10% sodium dodecyl sulphate (SDS) were added and incubated at 50°C for 2 h until the solution became clear. Genomic DNA was recovered from the lysate by phenol-chloroform-isoamyl alcohol (25:24:1, pH 8.0) extraction and ethanol precipitation. The DNA was dissolved in 10 mM Tris–Cl (pH 8.0) and digested with BamHI + KpnI for cloning of the 14-kb lux gene cluster. Gram-positive Streptomyces albus DSM41398 was cultured in 50 ml of tryptic soy broth at 30°C for 2 days. The genomic DNA was isolated according to the method described in ref. (10) with slight modification. After centrifugation, the cells were resuspended thoroughly in 8 ml of SET buffer (75 mM NaCl, 25 mM ethylenediaminetetraacetic acid (EDTA), 20 mM Tris, pH 8.0) and 10 mg lysozyme was added. After incubation at 37°C for 1 h, 500 μl of 20 mg ml−1 proteinase K and 1 ml of 10% SDS were added and incubated at 50°C for 2 h until the solution became clear. Three and a half milliliters of 5 M NaCl was added into the lysate. Genomic DNA was recovered from the lysate by phenol-chloroform-isoamyl alcohol (25:24:1, pH 8.0) extraction and ethanol precipitation. The DNA was dissolved in 10 mM Tris–Cl (pH 8.0). Genomic DNA was purified from mouse melanoma B16 cells, human embryonic kidney 293T cells and human blood using Qiagen Blood & Cell Culture DNA Kits according to the manufacturer’s instructions, except DNA was recovered from the Proteinase K treated lysate by phenol–chloroform–isoamyl alcohol (25:24:1, pH 8.0) extraction and ethanol precipitation. The DNA was dissolved in 10 mM Tris–Cl (pH 8.0). Restriction digested genomic DNA was extracted with phenol–chloroform–isoamyl alcohol (25:24:1, pH 8.0) and precipitated with ethanol. The DNA was dissolved in 10 mM Tris–Cl (pH 8.0). End cut pipette tips were used to avoid shearing genomic DNA. The genomic DNA of P. luminescens DSM15139 was digested with XbaI for plu3535-plu3532 cloning, and XbaI + XmaI for plu2670 cloning. The genomic DNA of S. albus was digested with EcoRV or Cas9–gRNA complexes for cloning of the salinomycin gene cluster. The mouse genomic DNA was digested with HpaI for Prkar1a cloning, BamHI + KpnI for Dpy30 cloning, and SwaI for Wnt4 or Lmbr1l-Tuba1a cloning. The human genomic DNA was digested with SpeI for DPY30 cloning, NdeI + BstZ17I for IGFLR1-LIN37 cloning, BstZ17I for IGFLR1-ARHGAP33 cloning and NdeI for ZBTB32-LIN37 cloning. Digested genomic DNA was extracted with phenol–chloroform–isoamyl alcohol (25:24:1, pH 8.0) and precipitated with ethanol. The DNA was dissolved in ddH2O and concentrated to 1 μg μl−1. End cut pipette tips were used to avoid shearing genomic DNA. Ten micrograms of digested genomic DNA was used for ExoCET cloning. Cas9 digestion of S. albus genomic DNA The S. pyogenes Cas9 protein was provided by David N. Drechsel (Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany). To obtain the best Cas9 in vitro cleavage efficiency, four CRISPR guide sequences were selected for both regions flanking the salinomycin gene cluster (Figure 3A). The eight CRISPR guide sequences were incorporated into the oligos used to amplify the CRISPR minigenes (Supplementary Table S1). The CRISPR minigenes were amplified with PCR using BstZ17I-linearized pBR322-U6-ccdB-cm-tracrRNA (11) as a template and the proof reading PrimeSTAR Max DNA polymerase (Takara). The eight CRISPR minigenes were extracted from agarose gels after electrophoresis and purified using the QIAquick gel extraction kit (Qiagen). The HiScribe™ T7 High Yield RNA Synthesis Kit (NEB, cat. no. # E2040S) was used for in vitro transcription of CRISPR gRNAs and 500 ng minigenes were used as the template. gRNAs were purified with a RNA clean Kit (Tiangen, Beijing China; cat. no. DP412) and eluted into DNase/RNase-free ddH2O. The cleavage efficiency of Cas9–gRNA complexes were tested with the PCR amplified DNA fragment containing the Cas9 cutting sites (Figure 3B). The primers used to amplify the target DNA fragments, A (1602 bp) and B (1500 bp), are listed in Supplementary Table S1. The cleavage efficiency of Cas9–gRNAs was evaluated by digesting 100 ng of target DNA fragments in a 10-μl reaction containing 150 ng of Cas9, 400 ng of gRNA and 1 μl of Cas9 nuclease reaction buffer (NEB). After incubating at 37°C for 2 h, 4 μg RNase A (Thermo Scientific) was added and incubated at 37°C for 15 min. After addition of 1 μl stop solution (30% glycerol, 1.2% SDS, 250 mM EDTA pH 8.0) and incubation at 37°C for 15 min, an aliquot was analyzed on an agarose gel. Cas9 digestion of S. albus genomic DNA was carried out in a 800 μl reaction containing 80 μl of 10 × Cas9 nuclease reaction buffer (NEB), 80 μg of genomic DNA, 40 μg of gRNA-2, 40 μg of gRNA-7 and 20 μg of Cas9. The genomic DNA used for Cas9 digestion was extracted three times with phenol–chloroform–isoamyl alcohol (25:24:1, pH8.0) and dissolved in DNase/RNase-free ddH2O. After incubating at 37°C for 6 h, 100 μg of RNase A (Thermo Scientific) was added and incubated at 37°C for 1 h. Then 100 μg of proteinase K (Roche) was added and incubated at 50°C for 1 h. Then, the digested genomic DNA was extracted with phenol–chloroform–isoamyl alcohol (25:24:1, pH 8.0) and precipitated with ethanol. The genomic DNA was dissolved in ddH2O and concentrated to 1 μg μl−1. End cut pipette tips were used to avoid shearing genomic DNA. Ten micrograms of digested genomic DNA was used for ExoCET cloning. Preparation of linear cloning vectors The p15A-cm vector was amplified with PCR (Supplementary Figure S3A) using the PrimeSTAR Max DNA Polymerase (Takara) according to the manufacturer's instructions. Oligonucleotides containing the 80 nt homology arms and ∼20 nt standard PCR primers at the 3′ end (Supplementary Tables S2 and 3) were polyacrylamide gel electrophoresis (PAGE) purified. The PCR products were extracted from agarose gels after electrophoresis and purified using the QIAquick gel extraction kit (Qiagen) according to the manufacturer's instructions, except that DNA was eluted from the column with ddH2O and concentrated to 200 ng μl−1. Two hundred nanograms of p15A-cm vectors were used for ExoCET cloning unless otherwise stated. The pBeloBAC11 vector used to clone the salinomycin gene cluster and the pBAC2015 vector used to clone plu3535-3532 were previously described (6,12). Bacterial artificial chromosome (BAC) vectors were linearized with BamHI to expose both homology arms, and extracted with phenol–chloroform–isoamyl alcohol (25:24:1, pH 8.0) and precipitated with isopropanol. The DNA was dissolved in ddH2O and concentrated to 1 μg μl−1. One microgram of linear BAC vectors were used for ExoCET cloning. Preparation of the mVenus-PGK-neo cassette for recombineering The mVenus-PGK-neo cassette was amplified from pR6K-2Ty1-2PreS-mVenus-Biotin-PGK-em7-neo (13) with PCR using the proof reading PrimeSTAR Max DNA Polymerase (Takara) according to the manufacturer’s instructions. The primers are listed in Supplementary Table S4. The PCR products were purified with QIAquick PCR Purification Kit (Qiagen) according to the manufacturer’s instructions, except that DNA was eluted from the column with ddH2O and concentrated to 100 ng μl−1. Two hundred nanograms of the cassette was used for recombineering. In vitro assembly Ten micrograms of genomic DNA and 200 ng of 2.2-kb p15A-cm linear vector (1 μg of 8-kb linear BAC vector) were assembled in 20 μl reactions consisting of 2 μl of 10 × NEBuffer 2.1 and 0.13 μl of 3 U μl−1 T4pol (NEB, cat. no. M0203). Assembly reactions were prepared in 0.2 ml PCR tubes and cycled in a thermocycler as follows: 25°C for 1 h, 75°C for 20 min, 50°C for 30 min, then held at 4°C. Assembly reactions with other exonucleases were cycled as follows: T5 exonuclease (T5exo; NEB, cat. no. M0363): 50°C for 30 min, then held at 4°C; T7 exonuclease (T7exo; NEB, cat. no. M0263): 25°C for 20 min, 50°C for 30 min, then held at 4°C; DNA polymerase I Klenow fragment (Kle; NEB, cat. no. M0210), T7 DNA polymerase (T7pol; NEB, cat. no. M0274) and λ exonuclease (λexo; NEB, cat. no. M0262): 25°C for 20 min, 75°C for 20 min, 50°C for 30 min, then held at 4°C; Exonuclease III (ExoIII; NEB, cat. no. M0206): 37°C for 20 min, 75°C for 20 min, 50°C for 30 min, then held at 4°C; Phusion DNA polymerase (Phu; NEB, cat. no. M0530): 37°C for 20 min, 50°C for 30 min, then held at 4°C. Gibson assembly was performed at 50°C for 1 h with Gibson Assembly Master Mix (NEB, cat. no. E2611). For reactions performed in triplicates, we used single genomic DNA preparations with same vector preparation and the same batch of electrocompetent cells. Three in vitro reactions, subsequent electroporations, plating and counting were performed in parallel. The in vitro assembly products were desalted at room temperature for 30 min by drop dialysis against ddH2O using Millipore Membrane Filters (Merck-Millipore, cat. no. VSWP01300), then 5 μl of each reaction product was electroporated into E. coli cells. Electroporation Forty microliters of E. coli overnight culture (OD600 = 3∼4) was inoculated into 1.4 ml of LB medium supplemented with appropriate antibiotics and incubated at 30°C for 2 h with shaking at 950 r.p.m. in an Eppendorf thermomixer (OD600 = 0.35∼0.4). Thirty five microliters of 10% L-arabinose (Sigma-Aldrich, cat. no. A3256; w/v, in ddH2O) was added to induce the expression of ETgA or gbaA. The cells were grown at 37°C for another 40 min (OD600 = 0.7∼0.8), then centrifuged at 9400 × g for 30 s, 2°C. The supernatant was discarded and the cell pellet was resuspended in 1 ml of ice-cold ddH2O. The cell suspension was centrifuged again to repeat the washing once more. Cells were resuspended in 20 μl of ice-cold ddH2O. Five microliters of desalted in vitro assembly products were added into cell suspensions in ExoCET cloning experiments. In recombineering exercises, 200 ng plasmid and 200 ng PCR products were used. Electroporation was performed using chilled 1-mm cuvettes (Bio-Rad, cat. no. 1652089) and an Eppendorf electroporator 2510 at 1350 V, 10 μF, 600Ω. After electroporation, 1 ml of LB medium was added into the cuvette to suspend the cells and the cell suspension was transferred into a 1.5-ml Eppendorf tube with a puncture in its cap for aeration. The culture was incubated at 37°C for 1 h with shaking at 950 r.p.m. in an Eppendorf thermomixer. Appropriate volumes of the cell suspension were taken from the culture and spread on LB plates with antibiotics (chloramphenicol, 15 μg ml−1; kanamycin, 15 μg ml−1). The number of colonies was counted after overnight incubation and the colony number per ml (c.f.u./ml) was calculated. RESULTS Concerted action of in vitro assembly and full length RecE/RecT improves the efficiency of direct cloning The efficiency of RecET direct cloning is limited by the need for the linear vector and the target genomic segment to simultaneously enter one E. coli cell before productive HR can take place. To associate the two DNA molecules in vitro before electroporation into E. coli, we evaluated a variety of exonucleases and annealing protocols using a model experiment based on direct cloning of a 14kb fragment encoding the lux gene cluster from the Gram-negative luminous piezophile marine bacterium P. phosphoreum ANT-2200 (14) (Figure 1A). The basic protocol involved mixing restriction digested genomic DNA (10 μg) with a linear direct cloning vector (2.2 kb p15A-cm; 200 ng) that was flanked by short sequence regions (homology arms) identical to the ends of the 14 kb lux restriction fragment. For reaction and process optimization, the cloning efficiency was compared using the total colony number on selection plates from 1 ml culture (c.f.u. ml−1) and the fidelity was assessed by restriction analysis (Figures 1 and 2). Various exonucleases were tested and several were satisfactory (Supplementary Figure S1) before we settled on the 3′ exonuclease activity of T4 polymerase (T4pol) because it consistently gave the highest efficiency and fidelity. For annealing after exonuclease digestion, results were largely indifferent to the cooling rate so we opted for the convenience of the default rate of the Eppendorf MC nexus gradient thermocycler (2°C s−1) (Supplementary Figure S2). We next evaluated the contribution of homology arm length to the outcome of ExoCET direct cloning. Not unexpectedly, the longest homology arm length tested (80 bp) was the most efficient (Figure 1B). Because synthesis of 100mer oligonucleotides is convenient and reliable, all further experiments employed 100mers (i.e. 5′ 80 nucleotide (nt) homology arm plus 3′ 20 nt PCR primer to the p15A plasmid; Supplementary Figure S3A). We also evaluated the impact of T4pol input (Figure 1C), time of exonuclease digestion (Figure 1D) and RecE/RecT expression level (Figure 1E) on ExoCET. For the latter experiment, we expressed full length RecE/RecT from either one integrated copy of the arabinose-inducible BADrecE/recT/redγ/recA (ETgA) operon in GB05-dir or a pSC101 plasmid borne version of the same operon (approximately five copies per cell) or both together (approximately six copies per cell). Expression from pSC101 delivered better direct cloning efficiency than from the chromosomal single copy in GB05-dir, which was further increased when both the plasmid and chromosomal operons were employed (Figure 1E), thereby indicating that the results were reflective of ETgA expression levels and also that the toxicity associated with overexpressed RecT had not been provoked. However, ETgA expression from a pR6K-pir plasmid (∼15 copies per cell) appeared to provoke toxicity (data not shown). Regarding the other genes in the ETgA operon: Redγ inhibits the major E. coli exonuclease RecBCD (15) so linear DNA molecules persist for much longer and hence their ability to promote recombination is greater; and RecA promotes transformation efficiency (7). Figure 1. Open in new tabDownload slide Concerted action of in vitro assembly and full length RecE/RecT improves the efficiency of direct cloning. (A) A schematic diagram illustrating direct cloning of the 14-kb lux gene cluster from Photobacterium phosphoreum ANT-2200. The linear p15A-cm vector and target genomic segment have identical sequences at both ends. (B) Longer homology arms increase the cloning efficiency of ExoCET. The linear vector flanked by 25-, 40- or 80-bp homology arms was mixed with genomic DNA and treated with 0.02 U μl−1 T4pol at 25°C for 20 min before annealing and electroporation into arabinose induced Escherichia coli GB05-dir. Error bars, s.d.; n = 3. (C) Titration of T4pol amount for ExoCET. The linear vector with 80-bp homology arms and genomic DNA were treated as in (B) except the amount of T4pol was altered as indicated. (D) Incubation time of T4pol on cloning efficiency. As for (C) using 0.02 U μl−1 T4pol except the incubation time was altered as indicated. (E) Higher copy number of ETgA increases ExoCET cloning efficiency. As for (D) using 1 h and electroporation into arabinose induced E. coli GB05-dir (one copy of ETgA on the chromosome), GB2005 harboring pSC101-BAD-ETgA-tet (approximately five copies of ETgA on pSC101 plasmids) or GB05-dir harboring pSC101-BAD-ETgA-tet (approximately six copies of ETgA) as indicated. (F) ExoCET increases direct cloning efficiency. As for (E) using E. coli GB05-dir harboring pSC101-BAD-ETgA-tet (ExoCET) or omission of T4pol from the in vitro assembly (ETgA) or omission of the arabinose induction of pSC101-BAD-ETgA-tet (T4pol). (G) As for (F) except the 53 kb plu2670 gene cluster was directly cloned. Accuracy denotes the success of direct cloning as evaluated by restriction digestions (Supplementary Figure S4). Each experiment was performed in triplicate (n = 3) and error bars show standard deviation (s.d). Figure 1. Open in new tabDownload slide Concerted action of in vitro assembly and full length RecE/RecT improves the efficiency of direct cloning. (A) A schematic diagram illustrating direct cloning of the 14-kb lux gene cluster from Photobacterium phosphoreum ANT-2200. The linear p15A-cm vector and target genomic segment have identical sequences at both ends. (B) Longer homology arms increase the cloning efficiency of ExoCET. The linear vector flanked by 25-, 40- or 80-bp homology arms was mixed with genomic DNA and treated with 0.02 U μl−1 T4pol at 25°C for 20 min before annealing and electroporation into arabinose induced Escherichia coli GB05-dir. Error bars, s.d.; n = 3. (C) Titration of T4pol amount for ExoCET. The linear vector with 80-bp homology arms and genomic DNA were treated as in (B) except the amount of T4pol was altered as indicated. (D) Incubation time of T4pol on cloning efficiency. As for (C) using 0.02 U μl−1 T4pol except the incubation time was altered as indicated. (E) Higher copy number of ETgA increases ExoCET cloning efficiency. As for (D) using 1 h and electroporation into arabinose induced E. coli GB05-dir (one copy of ETgA on the chromosome), GB2005 harboring pSC101-BAD-ETgA-tet (approximately five copies of ETgA on pSC101 plasmids) or GB05-dir harboring pSC101-BAD-ETgA-tet (approximately six copies of ETgA) as indicated. (F) ExoCET increases direct cloning efficiency. As for (E) using E. coli GB05-dir harboring pSC101-BAD-ETgA-tet (ExoCET) or omission of T4pol from the in vitro assembly (ETgA) or omission of the arabinose induction of pSC101-BAD-ETgA-tet (T4pol). (G) As for (F) except the 53 kb plu2670 gene cluster was directly cloned. Accuracy denotes the success of direct cloning as evaluated by restriction digestions (Supplementary Figure S4). Each experiment was performed in triplicate (n = 3) and error bars show standard deviation (s.d). Figure 2. Open in new tabDownload slide ExoCET mechanism. (A) Juxtaposition of the 80-bp homology arms between the p15A-cm (chloramphenicol) vector and the 14-kb lux genomic segment is illustrated: (a) both homology arms were located at the termini; (b and c) one homology arm was located at a terminus and the other 1 kb from the other end; (d) both homology arms were 1 kb from each end. (B) Number of colonies obtained from ETgA, T4pol or ExoCET using the homology arm combinations (a–d) as indicated. Reaction conditions were the same as for Figure 1F. (C) Protein combinations as indicated expressed from pSC101 plasmids in GB2005 were tested for direct cloning of the 14-kb lux gene cluster using terminal homology arms and ExoCET conditions except for the omission of RecA (ETg); RecA and RecT (Eg), RecA and RecE (Tg) and all (pSC101-tet). Error bars, s.d.; n = 3. Corresponding DNA analyses are shown in Supplementary Figure S5. Figure 2. Open in new tabDownload slide ExoCET mechanism. (A) Juxtaposition of the 80-bp homology arms between the p15A-cm (chloramphenicol) vector and the 14-kb lux genomic segment is illustrated: (a) both homology arms were located at the termini; (b and c) one homology arm was located at a terminus and the other 1 kb from the other end; (d) both homology arms were 1 kb from each end. (B) Number of colonies obtained from ETgA, T4pol or ExoCET using the homology arm combinations (a–d) as indicated. Reaction conditions were the same as for Figure 1F. (C) Protein combinations as indicated expressed from pSC101 plasmids in GB2005 were tested for direct cloning of the 14-kb lux gene cluster using terminal homology arms and ExoCET conditions except for the omission of RecA (ETg); RecA and RecT (Eg), RecA and RecE (Tg) and all (pSC101-tet). Error bars, s.d.; n = 3. Corresponding DNA analyses are shown in Supplementary Figure S5. Having established optimal reaction conditions (Supplementary Figure S3B), we compared direct cloning frequencies using full length RecE/RecT alone, the T4pol in vitro annealing protocol alone or both together (Figure 1F). As expected from our previous work, direct cloning of the 14kb lux gene cluster by RecE/RecT alone was successful at high fidelity (Supplementary Figure S4). However, T4pol in vitro assembly followed by electroporation into a standard E. coli recombinant host was much better (4880 versus 427). This success indicates (i) the endogenous E. coli machinery is highly adept at completing and sealing plasmid scaffolds; and (ii) our T4pol in vitro assembly protocol is efficient. However the ExoCET combination of T4pol in vitro assembly and full length RecE/RecT in vivo HR was significantly better than either process alone. To further validate our conclusions, we compared RecET, T4pol alone and ExoCET for direct cloning of a 53 kb gene cluster, plu2670. As expected for a much larger target region, overall yield was reduced. However ExoCET was clearly superior (Figure 1G). Mechanistic aspects of ExoCET To analyze the synergism between combining in vitro annealing with full length RecET HR, we designed the experiment illustrated in Figure 2A. The experiments in Figure 1 employed a direct cloning vector with 80nt homology arms to the termini of the 14 kb lux gene cluster. Three variations were generated by moving either 80 nt homology arm, or both, to 80 nt regions located 1 kb from the termini. ExoCET efficiencies were again compared to T4pol alone or RecET alone (Figure 2B). The highest ExoCET and T4pol alone efficiencies were achieved when both homology arms were terminal. When one homology arm (or both) was internal, T4pol alone was ineffective indicating that direct cloning using annealing after T4pol exonuclease action depends on terminal complementarities and that annealing at only one end is insufficient to promote direct cloning without subsequent RecET HR. In contrast, RecET HR efficiently utilized the internal homology arms and did not require terminal homologies. Together, the combination of T4pol annealing in vitro at one end with RecET HR at the other promoted direct cloning yields about 12× over RecET alone. These data indicate that the major contribution of T4pol to ExoCET is due to the in vitro annealing of an end, which greatly increases the co-transformation efficiency. RecET HR then promotes recombination at the other end in vivo. Notably, when both homology arms were positioned at the very end of the target genomic fragment, ExoCET was about six to eight times more efficient than T4pol alone (Figures 1F and 2B). This indicates that most (>85%; Supplementary Note S1) of the in vitro assembly products with two terminal homology arms were associated at only one end. Because RecT is a single strand annealing protein (SSAP), we also considered the possibility that RecT annealing of the single stranded regions exposed by T4pol could contribute to ExoCET. To test this possibility, T4pol annealed products were electroporated into E. coli cells induced for expression of RecT without RecE (from pSC101-Tg). No co-operation between T4pol and RecT was observed (Figure 2C). RecE/RecT is a 5′-3′ exonuclease/SSAP syn/exo pair (16) and a specific protein–protein interaction is required for HR using double stranded DNAs (17). No recombination was observed when RecT was omitted (pSC101-Eg). Therefore ExoCET requires both RecE and RecT. Omission of RecA had only a modest impact (Figure 2C), consistent with the previous finding that RecA increased transformation efficiency rather than recombination (7). Validation of ExoCET and combination with CRISPR/Cas9 cleavage in vitro To validate ExoCET, we applied it to several tasks that had previously proven challenging with RecET alone. First, direct cloning of the 38kb plu3535-3532 gene cluster from Photohabdus luminescens into an expression plasmid where previously we achieved efficiencies of 2/12 (1). With ExoCET we achieved 12/12 (Table 1). Similarly our previous attempts to directly clone the 106 kb salinomycin biosynthesis gene cluster from S. albus in one step had been fruitless and to succeed we had to break the task into three pieces (6). To apply ExoCET to this task, we introduced PCR generated homology arms from each end of the 106 kb EcoRV restriction fragment that encompasses the gene cluster into a BAC vector. After linearization of the BAC between the homology arms and mixing with EcoRV digested genomic DNA, 2/24 clones examined were correct (Figure 3 and Table 1). This exercise benefitted from a fortuitous disposition of EcoRV sites. Potentially, the use of a programmable nuclease, specifically the RNA guided endonuclease, Cas9 (18,19), can obviate the reliance on similar good fortune for other exercises. To evaluate this idea, we repeated the 106kb salinomycin direct cloning exercise with the same BAC vector and genomic DNA digested in vitro using Cas9 programmed with guide RNAs to deliver cleavages very close to the EcoRV sites. A similar success was achieved (Figure 3 and Table 1). Compared to our previous experiences with RecET, ExoCET delivered significantly improved performance for direct cloning of large regions and this advantage can also be coupled with use of Cas9 as a programmable endonuclease. As with RecET alone, most of the wrong ExoCET products arose from self-circularization of the empty cloning vector. Microrepeats larger than 6 bp shared between the homology arms promote self-circularization and there were five pairs of 8∼10bp direct repeats between the two homology arms used to clone the 106-kb region. Potentially Cas9 cleavages can be chosen to minimize or avoid the presence of microrepeats and hence reduce unwanted background. For optimal efficiency, the homology arms should be adjacent to the restriction sites used to liberate the target genomic DNA fragment. Restriction analysis showed several clones that were neither empty vector, which is the usual source of unwanted product, nor the correct product (lane 9 and 23 in Figure 3D and lane 11 in Figure 3E). These clones contained some of the intended DNA region rearranged by intramolecular recombination among repetitive sequences in the salinomycin gene cluster to delete segments. Large genomic segments directly cloned from bacteria, mammalian cells and human blood with ExoCET Table 1. Large genomic segments directly cloned from bacteria, mammalian cells and human blood with ExoCET Target Source Genome (Mb) Digestion enzymes Fragment (kb) Vector c.f.u. (/ml) Correct/checked plu3535-3532 P. luminescens DSM15139 5.69 XbaI 38 pBAC2015 1815±132 12/12 plu2670 P. luminescens DSM15139 5.69 XbaI+XmaI 53 p15A 787±194 10/12 salinomycin cluster S. albus DSM41398 8.38 EcoRV 106 pBeloBAC11 425±91 2/24 salinomycin cluster S. albus DSM41398 8.38 Cas9 106 pBeloBAC11 260±14 1/24 Wnt4 Mouse melanoma B16 cell 2800.06 SwaI 45 p15A 76±16 8/25 Lmbr1l-Tuba1a Mouse melanoma B16 cell 2800.06 SwaI 53 p15A 52±6 1/12 Prkar1a Mouse melanoma B16 cell 2800.06 HpaI 8 p15A 205±17 10/12 IGFLR1-ARHGAP33 Human blood 3221.49 BstZ17I 41 p15A 275±76 5/48 ZBTB32-LIN37 Human blood 3221.49 NdeI 45 p15A 115±35 2/48 Dpy30 Mouse melanoma B16 cell 2800.06 BamHI+KpnI 8.7 p15A 273±18 9/12 DPY30 HEK 293T cell 3221.49 SpeI 9.1 p15A 40±10 17/24 DPY30 Human blood 3221.49 SpeI 9.1 p15A 45±2 5/24 Oct4-Venus Mouse R1 ES cells 2800.06 EcoRV+PacI 9.6 p15A 34±1 9/36 Nanog-Cherry Mouse R1 ES cells 2800.06 NdeI 13 p15A 49±12 17/54 Gata2-Venus Mouse GM8 ES cells 2800.06 BstZ17l 16.8 p15A 212±27 5/45 Mll4 (1) Mouse R1 ES cells 2800.06 SspI+SpeI 17.1 p15A 127±38 7+3/24 Mll4 (2) 323±65 2+2/36 Mll4 (3) 142±27 6+9/72 Mll4 (4) 483±91 3+5/36 Target Source Genome (Mb) Digestion enzymes Fragment (kb) Vector c.f.u. (/ml) Correct/checked plu3535-3532 P. luminescens DSM15139 5.69 XbaI 38 pBAC2015 1815±132 12/12 plu2670 P. luminescens DSM15139 5.69 XbaI+XmaI 53 p15A 787±194 10/12 salinomycin cluster S. albus DSM41398 8.38 EcoRV 106 pBeloBAC11 425±91 2/24 salinomycin cluster S. albus DSM41398 8.38 Cas9 106 pBeloBAC11 260±14 1/24 Wnt4 Mouse melanoma B16 cell 2800.06 SwaI 45 p15A 76±16 8/25 Lmbr1l-Tuba1a Mouse melanoma B16 cell 2800.06 SwaI 53 p15A 52±6 1/12 Prkar1a Mouse melanoma B16 cell 2800.06 HpaI 8 p15A 205±17 10/12 IGFLR1-ARHGAP33 Human blood 3221.49 BstZ17I 41 p15A 275±76 5/48 ZBTB32-LIN37 Human blood 3221.49 NdeI 45 p15A 115±35 2/48 Dpy30 Mouse melanoma B16 cell 2800.06 BamHI+KpnI 8.7 p15A 273±18 9/12 DPY30 HEK 293T cell 3221.49 SpeI 9.1 p15A 40±10 17/24 DPY30 Human blood 3221.49 SpeI 9.1 p15A 45±2 5/24 Oct4-Venus Mouse R1 ES cells 2800.06 EcoRV+PacI 9.6 p15A 34±1 9/36 Nanog-Cherry Mouse R1 ES cells 2800.06 NdeI 13 p15A 49±12 17/54 Gata2-Venus Mouse GM8 ES cells 2800.06 BstZ17l 16.8 p15A 212±27 5/45 Mll4 (1) Mouse R1 ES cells 2800.06 SspI+SpeI 17.1 p15A 127±38 7+3/24 Mll4 (2) 323±65 2+2/36 Mll4 (3) 142±27 6+9/72 Mll4 (4) 483±91 3+5/36 All experiments were done in triplicate; c.f.u. includes standard deviation and fidelity was monitored by restriction analysis of the indicated number of colonies. For the Mll4 experiments, fidelity shows the targeted allele + wt allele/colonies examined. DNA analyses are shown in Supplementary Figure S6. Open in new tab Table 1. Large genomic segments directly cloned from bacteria, mammalian cells and human blood with ExoCET Target Source Genome (Mb) Digestion enzymes Fragment (kb) Vector c.f.u. (/ml) Correct/checked plu3535-3532 P. luminescens DSM15139 5.69 XbaI 38 pBAC2015 1815±132 12/12 plu2670 P. luminescens DSM15139 5.69 XbaI+XmaI 53 p15A 787±194 10/12 salinomycin cluster S. albus DSM41398 8.38 EcoRV 106 pBeloBAC11 425±91 2/24 salinomycin cluster S. albus DSM41398 8.38 Cas9 106 pBeloBAC11 260±14 1/24 Wnt4 Mouse melanoma B16 cell 2800.06 SwaI 45 p15A 76±16 8/25 Lmbr1l-Tuba1a Mouse melanoma B16 cell 2800.06 SwaI 53 p15A 52±6 1/12 Prkar1a Mouse melanoma B16 cell 2800.06 HpaI 8 p15A 205±17 10/12 IGFLR1-ARHGAP33 Human blood 3221.49 BstZ17I 41 p15A 275±76 5/48 ZBTB32-LIN37 Human blood 3221.49 NdeI 45 p15A 115±35 2/48 Dpy30 Mouse melanoma B16 cell 2800.06 BamHI+KpnI 8.7 p15A 273±18 9/12 DPY30 HEK 293T cell 3221.49 SpeI 9.1 p15A 40±10 17/24 DPY30 Human blood 3221.49 SpeI 9.1 p15A 45±2 5/24 Oct4-Venus Mouse R1 ES cells 2800.06 EcoRV+PacI 9.6 p15A 34±1 9/36 Nanog-Cherry Mouse R1 ES cells 2800.06 NdeI 13 p15A 49±12 17/54 Gata2-Venus Mouse GM8 ES cells 2800.06 BstZ17l 16.8 p15A 212±27 5/45 Mll4 (1) Mouse R1 ES cells 2800.06 SspI+SpeI 17.1 p15A 127±38 7+3/24 Mll4 (2) 323±65 2+2/36 Mll4 (3) 142±27 6+9/72 Mll4 (4) 483±91 3+5/36 Target Source Genome (Mb) Digestion enzymes Fragment (kb) Vector c.f.u. (/ml) Correct/checked plu3535-3532 P. luminescens DSM15139 5.69 XbaI 38 pBAC2015 1815±132 12/12 plu2670 P. luminescens DSM15139 5.69 XbaI+XmaI 53 p15A 787±194 10/12 salinomycin cluster S. albus DSM41398 8.38 EcoRV 106 pBeloBAC11 425±91 2/24 salinomycin cluster S. albus DSM41398 8.38 Cas9 106 pBeloBAC11 260±14 1/24 Wnt4 Mouse melanoma B16 cell 2800.06 SwaI 45 p15A 76±16 8/25 Lmbr1l-Tuba1a Mouse melanoma B16 cell 2800.06 SwaI 53 p15A 52±6 1/12 Prkar1a Mouse melanoma B16 cell 2800.06 HpaI 8 p15A 205±17 10/12 IGFLR1-ARHGAP33 Human blood 3221.49 BstZ17I 41 p15A 275±76 5/48 ZBTB32-LIN37 Human blood 3221.49 NdeI 45 p15A 115±35 2/48 Dpy30 Mouse melanoma B16 cell 2800.06 BamHI+KpnI 8.7 p15A 273±18 9/12 DPY30 HEK 293T cell 3221.49 SpeI 9.1 p15A 40±10 17/24 DPY30 Human blood 3221.49 SpeI 9.1 p15A 45±2 5/24 Oct4-Venus Mouse R1 ES cells 2800.06 EcoRV+PacI 9.6 p15A 34±1 9/36 Nanog-Cherry Mouse R1 ES cells 2800.06 NdeI 13 p15A 49±12 17/54 Gata2-Venus Mouse GM8 ES cells 2800.06 BstZ17l 16.8 p15A 212±27 5/45 Mll4 (1) Mouse R1 ES cells 2800.06 SspI+SpeI 17.1 p15A 127±38 7+3/24 Mll4 (2) 323±65 2+2/36 Mll4 (3) 142±27 6+9/72 Mll4 (4) 483±91 3+5/36 All experiments were done in triplicate; c.f.u. includes standard deviation and fidelity was monitored by restriction analysis of the indicated number of colonies. For the Mll4 experiments, fidelity shows the targeted allele + wt allele/colonies examined. DNA analyses are shown in Supplementary Figure S6. Open in new tab Figure 3. Open in new tabDownload slide Direct cloning of the 106-kb salinomycin gene cluster from EcoRV or Cas9 digested genomic DNA of Streptomyces albus. (A) Positions of EcoRV sites and eight Cas9 guide sequences on the salinomycin gene cluster. (B) In vitro cleavage with the eight Cas9–gRNAs to evaluate gRNA efficiency on PCR products amplified from the cleavage sites; gRNAs 2 and 7 were selected. cB (lane 9) and cA (lane 10) are negative controls with Cas9 and without gRNA. (C) The salinomycin gene cluster released from genomic DNA with EcoRV or Cas9–gRNA2/Cas9–gRNA7 was cloned into the pBeloBAC11 vector using ExoCET. Homology arms (blue) had been inserted into the BAC as previously described (6) and then cleaved with BamH1 to generate the illustrated direct cloning vector. The amount of sequence overlap between the ends of the genomic DNA and vector is indicated at the ends of the genomic DNA. (D and E) PvuII restriction analysis of the recombinant DNA obtained with ExoCET cloning. Correct clones are indicated with arrows. Figure 3. Open in new tabDownload slide Direct cloning of the 106-kb salinomycin gene cluster from EcoRV or Cas9 digested genomic DNA of Streptomyces albus. (A) Positions of EcoRV sites and eight Cas9 guide sequences on the salinomycin gene cluster. (B) In vitro cleavage with the eight Cas9–gRNAs to evaluate gRNA efficiency on PCR products amplified from the cleavage sites; gRNAs 2 and 7 were selected. cB (lane 9) and cA (lane 10) are negative controls with Cas9 and without gRNA. (C) The salinomycin gene cluster released from genomic DNA with EcoRV or Cas9–gRNA2/Cas9–gRNA7 was cloned into the pBeloBAC11 vector using ExoCET. Homology arms (blue) had been inserted into the BAC as previously described (6) and then cleaved with BamH1 to generate the illustrated direct cloning vector. The amount of sequence overlap between the ends of the genomic DNA and vector is indicated at the ends of the genomic DNA. (D and E) PvuII restriction analysis of the recombinant DNA obtained with ExoCET cloning. Correct clones are indicated with arrows. Direct cloning from mammalian genomes using ExoCET We next evaluated whether the added efficiencies of ExoCET would deliver the additional reach required for direct cloning from mammalian genomic DNA. A 45-kb restriction fragment containing the mouse Wnt4 gene was selected as a suitable challenge. Of 25 colonies checked, 8 were correct (Table 1). Illumina sequencing revealed the single-nucleotide polymorphism (SNP) haplotype linkage patterns of both alleles of this 45-kb region (Supplementary Table S5), thereby indicating that application of ExoCET to SNP analysis has the potential to bypass haplotype scrambling inherent in PCR-based approaches. Given the Wnt4 success, we explored the potential for SNP analysis by directly cloning another large segment (53 kb) including the Lmbr1l-Tuba1a genes followed by, using purified genomic DNA from human blood, a 41-kb region including the IGFLR1-ARHGAP33 genes. Next, a 45-kb region including the ZBTB32-LIN37 genes was directly cloned (Table 1). These results indicate that ExoCET offers an unrivalled capacity for haplotype phasing in SNP analyses. ExoCET mammalian applications: haplotypic isogenic targeting constructs To exploit ExoCET access to mammalian genomes, we developed two further applications. First, the generation of isogenic targeting constructs for mammalian genome engineering. Drawing upon experience gained in mouse embryonic stem (ES) cells (20), we aimed to directly clone 8–10 kb sections to provide the optimal length for gene targeting employing isogenic homology arms. Notably with ExoCET, these sections are not only isogenic but also maintain haplotypic linkages of polymorphisms. Hence we call them ‘HIT’ (haplotypic isogenic targeting) constructs. Approximately 9-kb HIT segments for two genes, Dpy30/DPY30 and Prkar1a, were directly cloned from mouse or human genomic DNA, either isolated from a cultured cell line or from blood (Table 1). Thereafter they were modified by Red recombineering to insert a selectable gene cassette for targeting in mammalian cell lines (8) (Figure 4). The cassette insertion site was selected to destroy a Cas9 guide RNA recognition site so that the HIT construct can be used with Cas9-assisted targeting (11). Figure 4. Open in new tabDownload slide Generation of HIT constructs from mammalian genomic DNA applied to DPY30. (A) Scheme illustrating the DPY30 stop codon region cloned from human genomic DNA after SpeI digestion. Once directly cloned, the C-terminus of DPY30 was tagged with a mVenus cassette using Redαβ recombineering and standard cassettes (8,13). (B) EcoRI restriction analysis of the recombinant clones obtained by ExoCET using genomic DNA isolated from human blood. (C) EcoRI restriction analysis of the recombinant clones obtained by ExoCET using genomic DNA isolated from human embryonic kidney 293T cells. (D) PvuII restriction analysis after mVenus cassette insertion using Redαβ recombineering. As expected, every clone was correct. Lane 11 is the original plasmid without mVenus cassette insertion. Correct clones are indicated with arrows. Figure 4. Open in new tabDownload slide Generation of HIT constructs from mammalian genomic DNA applied to DPY30. (A) Scheme illustrating the DPY30 stop codon region cloned from human genomic DNA after SpeI digestion. Once directly cloned, the C-terminus of DPY30 was tagged with a mVenus cassette using Redαβ recombineering and standard cassettes (8,13). (B) EcoRI restriction analysis of the recombinant clones obtained by ExoCET using genomic DNA isolated from human blood. (C) EcoRI restriction analysis of the recombinant clones obtained by ExoCET using genomic DNA isolated from human embryonic kidney 293T cells. (D) PvuII restriction analysis after mVenus cassette insertion using Redαβ recombineering. As expected, every clone was correct. Lane 11 is the original plasmid without mVenus cassette insertion. Correct clones are indicated with arrows. ExoCET mammalian applications: an alternative to southern analysis with the option to sequence In a second application, we aimed to develop an alternative to Southern blotting for genotyping. To pilot the exercise, we used three mouse ES cell lines carrying previously validated targeted genes, namely Oct4, Nanog and Gata2. All three targeted alleles included a neomycin resistance gene with a bacterial promoter that conveys kanamycin resistance in E. coli (8). After appropriate restriction digestion of genomic DNA preparations and ExoCET, chloramphenicol resistant candidate colonies were evaluated for co-resistance to kanamycin and by restriction digestion. The success rate, that is the number of correctly retrieved targeted alleles compared to the total number of chloramphenicol resistant colonies, ranged from 11 to 31% (Table 1). We then applied ExoCET to genotyping candidate ES cell clones from a Cas9 assisted RAC tag targeting exercise (11) to insert a tag at the N-terminus of Mll4. Of four successfully targeted ES clones, both targeted and untargeted Mll4 alleles were retrieved by ExoCET at about the same frequency (Table 1) and the remaining ExoCET candidates contained empty vectors and so were readily discounted (Supplementary Figure S6). Except for Gata2 when 1.5 μg genomic DNA was used, the above exercises all used 10 μg. ExoCET was also successful using 0.5 and 1.0 μg genomic DNA inputs (Supplementary Table S6) indicating that higher throughput applications can be developed. ExoCET metagenomic applications Environmental samples usually contain more than 104 species (21–23). To test if ExoCET can retrieve a gene cluster from metagenomic samples, we diluted 10, 5, 2 and 1 ng of P. phosphoreum genomic DNA into 10 μg of Bacillus subtilis genomic DNA. We managed to directly clone the 14 kb lux gene cluster with a reasonable efficiency even from the 10−4 diluted genomic samples (Table 2). Direct cloning of the 14 kb lux gene cluster from diluted P. phosphoreum genomic DNA with ExoCET Table 2. Direct cloning of the 14 kb lux gene cluster from diluted P. phosphoreum genomic DNA with ExoCET P. phosphoreum (ng) (BamHI + KpnI) B. subtilis (μg) (BamHI) Vector c.f.u. (/ml) Correct/checked 10 10 p15A-cm 200 ± 2 7/12 5 10 p15A-cm 142 ± 22 5/12 2 10 p15A-cm 102 ± 8 2/12 1 10 p15A-cm 104 ± 18 2/24 P. phosphoreum (ng) (BamHI + KpnI) B. subtilis (μg) (BamHI) Vector c.f.u. (/ml) Correct/checked 10 10 p15A-cm 200 ± 2 7/12 5 10 p15A-cm 142 ± 22 5/12 2 10 p15A-cm 102 ± 8 2/12 1 10 p15A-cm 104 ± 18 2/24 Open in new tab Table 2. Direct cloning of the 14 kb lux gene cluster from diluted P. phosphoreum genomic DNA with ExoCET P. phosphoreum (ng) (BamHI + KpnI) B. subtilis (μg) (BamHI) Vector c.f.u. (/ml) Correct/checked 10 10 p15A-cm 200 ± 2 7/12 5 10 p15A-cm 142 ± 22 5/12 2 10 p15A-cm 102 ± 8 2/12 1 10 p15A-cm 104 ± 18 2/24 P. phosphoreum (ng) (BamHI + KpnI) B. subtilis (μg) (BamHI) Vector c.f.u. (/ml) Correct/checked 10 10 p15A-cm 200 ± 2 7/12 5 10 p15A-cm 142 ± 22 5/12 2 10 p15A-cm 102 ± 8 2/12 1 10 p15A-cm 104 ± 18 2/24 Open in new tab DISCUSSION Functional analysis of the exponentially increasing volumes of genome sequencing data will benefit from faster and simpler ways to build expression constructs and targeting vectors. Here we extend the reach of direct DNA cloning by adding exonuclease digestion and annealing in vitro to full length RecE/RecT recombination in E. coli. Now regions up to 50kb and beyond can be readily retrieved from genomes with complexities up to at least 3.0 × 109 bp. Furthermore, the ability to retrieve DNA segments from a metagenomic mimic DNA preparation (Table 2) points towards the application of ExoCET for direct cloning in metagenomics. The promise that direct cloning can bypass library construction and screening has driven various exercises in the past, including our recent description of the properties of full length RecE/RecT. Larionov et al. developed the yeast-based transformation-associated recombination (TAR) cloning approach (24). Ongoing improvements to TAR include a recent combination with Cas9 cleavage. Although only one example was shown, the combination with Cas9 enhances the possibility that TAR will be less reliant on skills and could become more widely applicable (25,26). Recently a method called ‘CATCH’ (Cas9-Assisted Targeting of CHromosome Segments) used in vitro Cas9 cleavage and Gibson assembly (27) to clone large regions into BACs (28). This method, which has only been applied to prokaryotic genomes, appears to rely on a substantial primary PCR screen to identify candidate colonies for closer examination and is considerably more work than ExoCET. We also tested whether Gibson assembly would be an effective alternative first step for ExoCET using direct cloning of the mouse Wnt4 as the assay and found that Gibson assembly promoted a huge amount of empty vector background with or without RecET (Supplementary Table S7). Several other direct cloning applications have been described (Supplementary Table S8) but they also appear to be inefficient or restricted to specialist applications and none have been widely employed. Consequently, there is a need for a direct cloning method that is easy to use and broadly applicable across a wide range of fragment sizes and genome complexities. By improving upon RecET direct cloning, we suggest that ExoCET will fulfill this need because it is a simple and efficient method generically applicable to a broad range of direct cloning challenges from size (up to 106 kb so far) and genome complexities (to at least 3 × 109 bp). Furthermore, ExoCET presents advantages over PCR for amplification of DNA because it has a much higher fidelity, is not limited in size and does not scramble haplotypes. Homology arm position plays an important role in the design of ExoCET direct cloning exercises. To obtain the highest efficiency, both homology arms should be positioned at the very end of the target fragment. Indeed, to benefit from the in vitro T4pol exonuclease and subsequent annealing step, one homology arm must be terminal. However, the other homology arm can be internal because RecE/RecT can locate it for HR. This information can be employed when using direct cloning to establish an expression construct. One homology arm can be directed to the very 3′ end of the target fragment whereas the 5′ end of the target gene can be precisely positioned under a promoter and ribosome recognition sequence by the choice of an internal homology arm. Alternatively, in the absence of usefully positioned restriction sites, it is possible to use Cas9 as an in vitro endonuclease to optimize by design the release of the target fragment for direct cloning by ExoCET into an expression construct. Among other applications, direct cloning from mammalian genomes, including DNA isolated from blood or patient-specific cell lines, will facilitate haplotype phasing of SNPs (29) as well as rapid cloning of HIT constructs for nuclease assisted targeting in human stem cells. The increasing relevance of human stem cells for biomedical research, isolated either from patients, cord blood or by somatic cell reprogramming has focused attention on methods to accurately engineer stem cell genomes. Engineering human genomes presents greater challenges than engineering laboratory mouse genomes because humans are genetically diverse. For HR, isogenicity, that is sequence identity, is obviously a critical issue as recognized many years ago for gene targeting in mouse ES cells (30). However, the impact of sequence mismatches in regions involved in HR is still not well understood. How much does a single mismatch (e.g. an SNP or an indel) debilitate HR? And how does the position of the mismatch close to or far from the intended recombination site affect efficiency? How do multiple mismatches affect HR efficiencies? These questions, and more, remain largely unanswered. Nevertheless it is clearly advisable to use exactly identical sequences for gene targeting. ExoCET presents a rapid way to obtain the ideal targeting construct both in terms of isogenicity and the optimal lengths of homology arms. As opposed to the application of PCR to amplify homology arms from genomic DNA, ExoCET is not severely limited in size or prone to mutagenesis and furthermore will also maintain haplotypic linkages. The ends of the homology arms can be selected according to both targeting efficiency and the genotyping strategy, either by Southern blotting, junction PCR or now ExoCET. Hence ExoCET offers several advantages for personal genome surgery especially when combined with CRISPR/Cas9 (11). ExoCET is also potentially the most reliable method for genotyping. Whereas, both Southern blotting and junction PCR may produce false positive signals, this is highly unlikely with ExoCET. Furthermore Southern blotting in mammalian genomes is often complicated by the laborious search for a good hybridization probe and junction PCR is restricted to short amplification products that do not encompass high GC or secondary structural contents. ExoCET bypasses these problems. Through the ability to selectively acquire large DNA segments from complex genomic preparations including blood, ExoCET also presents options for diagnostics and pathology tests such as directed sequence acquisition for personal medicine or the isolation of DNA viruses from patient materials. ExoCET will have broad applications in functional and comparative genomics, as well as bioprospecting with prokaryotic biosynthetic pathways. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank David N. Drechsel (Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany) for providing the Cas9 protein, Longfei Wu (Laboratoire de Chimie Bactérienne, Aix-Marseille Université, Marseille, France) for providing Photobacterium phosphoreum ANT-2200, Junying Miao (College of Life Science, Shandong University, Jinan, People’s Republic of China) for providing the mouse melanoma B16 cell line and human embryonic kidney 293T cell line, Jinan Blood Center for providing human blood from multiple anonymous donors. Author contributions: H.W., L.X., Y.Y., R.M., J.F., A.F.S. and Y.Z. designed the experiments. H.W., Z.L., R.J., J.Y. and A.L. performed the experiments. H.W., J.F., Y.Z. and A.F.S. wrote the manuscript. FUNDING International S&T Cooperation Program of China [ISTCP 2015DFE32850 to J.F.]; National Natural Science Foundation of China [31670097 to Y.Z.]; Shandong Innovation and Transformation of Achievements Grant [2014ZZCX02601 to Y.Z.]; Major Project of Science and Technology of Shandong Province [2015ZDJS04001 to Y.Z.]; 111 Project [B16030 to Y.Z.]; Recruitment Program of Global Experts in Shandong University (to Y.Z.); China Postdoctoral Science Foundation [2015T80710 to H.W.]; Postdoctoral Innovation Program of Shandong Province [201303110 to H.W.]; Key Research and Development Program of Shandong Province [2015GSF12101 to A.L.]; TUD Elite University Support the Best program (to A.F.S.); Deutches Krebshilfe [110560 to A.F.S.]. Funding for open access charge: International S&T Cooperation Program of China [ISTCP 2015DFE32850]. Conflict of interest statement. R.M., A.F.S. and Y.Z. are shareholders in Gene Bridges GmbH, which holds exclusive commercial rights to recombineering. REFERENCES 1. Fu J. , Bian X. , Hu S. , Wang H. , Huang F. , Seibert P.M. , Plaza A. , Xia L. , Müller R. , Stewart A.F. et al. Full-length RecE enhances linear-linear homologous recombination and facilitates direct cloning for bioprospecting . Nat. Biotechnol. 2012 ; 30 : 440 – 446 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Bian X. , Huang F. , Stewart F.A. , Xia L. , Zhang Y. , Müller R. Direct cloning, genetic engineering, and heterologous expression of the syringolin biosynthetic gene cluster in E. coli through Red/ET recombineering . Chembiochem . 2012 ; 13 : 1946 – 1952 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Bian X. , Plaza A. , Zhang Y. , Müller R. Luminmycins A-C, cryptic natural products from Photorhabdus luminescens identified by heterologous expression in Escherichia coli . J. Nat. Prod. 2012 ; 75 : 1652 – 1655 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Bian X. , Huang F. , Wang H. , Klefisch T. , Müller R. , Zhang Y. Heterologous production of glidobactins/luminmycins in Escherichia coli Nissle containing the glidobactin biosynthetic gene cluster from Burkholderia DSM7029 . Chembiochem . 2014 ; 15 : 2221 – 2224 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Tang Y. , Frewert S. , Harmrolfs K. , Herrmann J. , Karmann L. , Kazmaier U. , Xia L. , Zhang Y. , Müller R. Heterologous expression of an orphan NRPS gene cluster from Paenibacillus larvae in Escherichia coli revealed production of sevadicin . J. Biotechnol. 2015 ; 194 : 112 – 114 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Yin J. , Hoffmann M. , Bian X. , Tu Q. , Yan F. , Xia L. , Ding X. , Stewart A.F. , Müller R. , Fu J. et al. Direct cloning and heterologous expression of the salinomycin biosynthetic gene cluster from Streptomyces albus DSM41398 in Streptomyces coelicolor A3(2) . Sci. Rep. 2015 ; 5 : 15081 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Wang J. , Sarov M. , Rientjes J. , Fu J. , Hollak H. , Kranz H. , Xie W. , Stewart A.F. , Zhang Y. An improved recombineering approach by adding RecA to lambda Red recombination . Mol. Biotechnol. 2006 ; 32 : 43 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Fu J. , Teucher M. , Anastassiadis K. , Skarnes W. , Stewart A.F. A recombineering pipeline to make conditional targeting constructs . Methods Enzymol. 2010 ; 477 : 125 – 144 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Hashimoto-Gotoh T. , Sekiguchi M. Mutations of temperature sensitivity in R plasmid pSC101 . J. Bacteriol. 1977 ; 131 : 405 – 412 . Google Scholar PubMed WorldCat 10. Pospiech A. , Neumann B. A versatile quick-prep of genomic DNA from Gram-positive bacteria . Trends Genet. 1995 ; 11 : 217 – 218 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Baker O. , Gupta A. , Obst M. , Zhang Y. , Anastassiadis K. , Fu J. , Stewart A.F. RAC-tagging: recombineering and Cas9-assisted targeting for protein tagging and conditional analyses . Sci. Rep. 2016 ; 6 : 25529 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Wang H. , Li Z. , Jia R. , Hou Y. , Yin J. , Bian X. , Li A. , Müller R. , Stewart A.F. , Fu J. et al. RecET direct cloning and Redαβ recombineering of biosynthetic gene clusters, large operons or single genes for heterologous expression . Nat. Protoc. 2016 ; 11 : 1175 – 1190 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Hofemeister H. , Ciotta G. , Fu J. , Seibert P.M. , Schulz A. , Maresca M. , Sarov M. , Anastassiadis K. , Stewart A.F. Recombineering, transfection, Western, IP and ChIP methods for protein tagging via gene targeting or BAC transgenesis . Methods . 2011 ; 53 : 437 – 452 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Zhang S. , Barbe V. , Garel M. , Zhang W. , Chen H. , Santini C.L. , Murat D. , Jing H. , Zhao Y. , Lajus A. et al. Genome sequence of luminous piezophile Photobacterium phosphoreum ANT-2200 . Genome Announc. 2014 ; 2 : e0009614 . Google Scholar Crossref Search ADS WorldCat 15. Murphy K.C. Lambda Gam protein inhibits the helicase and chi-stimulated recombination activities of Escherichia coli RecBCD enzyme . J. Bacteriol. 1991 ; 173 : 5808 – 5821 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Weller S.K. , Sawitzke J.A. Recombination promoted by DNA viruses: phage lambda to herpes simplex virus . Annu. Rev. Microbiol. 2014 ; 68 : 237 – 258 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Muyrers J.P. , Zhang Y. , Buchholz F. , Stewart A.F. RecE/RecT and Redα/Redβ initiate double-stranded break repair by specifically interacting with their respective partners . Genes Dev. 2000 ; 14 : 1971 – 1982 . Google Scholar PubMed WorldCat 18. Jinek M. , Chylinski K. , Fonfara I. , Hauer M. , Doudna J.A. , Charpentier E. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity . Science . 2012 ; 337 : 816 – 821 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Gasiunas G. , Barrangou R. , Horvath P. , Siksnys V. Cas9-crRNA ribonucleoprotein complex mediates specific DNA cleavage for adaptive immunity in bacteria . Proc. Natl. Acad. Sci. U.S.A. 2012 ; 109 : E2579 – E2586 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Skarnes W.C. , Rosen B. , West A.P. , Koutsourakis M. , Bushell W. , Iyer V. , Mujica A.O. , Thomas M. , Harrow J. , Cox T. et al. A conditional knockout resource for the genome-wide study of mouse gene function . Nature . 2011 ; 474 : 337 – 342 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Torsvik V. , Goksoyr J. , Daae F.L. High diversity in DNA of soil bacteria . Appl. Environ. Microbiol. 1990 ; 56 : 782 – 787 . Google Scholar PubMed WorldCat 22. Rappe M.S. , Giovannoni S.J. The uncultured microbial majority . Annu. Rev. Microbiol. 2003 ; 57 : 369 – 394 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Charlop-Powers Z. , Milshteyn A. , Brady S.F. Metagenomic small molecule discovery methods . Curr. Opin. Microbiol. 2014 ; 19 : 70 – 75 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Larionov V. , Kouprina N. , Graves J. , Chen X.N. , Korenberg J.R. , Resnick M.A. Specific cloning of human DNA as yeast artificial chromosomes by transformation-associated recombination . Proc. Natl. Acad. Sci. U.S.A. 1996 ; 93 : 491 – 496 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Lee N.C. , Larionov V. , Kouprina N. Highly efficient CRISPR/Cas9-mediated TAR cloning of genes and chromosomal loci from complex genomes in yeast . Nucleic Acids Res. 2015 ; 43 : e55 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Kouprina N. , Larionov V. Transformation-associated recombination (TAR) cloning for genomics studies and synthetic biology . Chromosoma . 2016 ; 125 : 621 – 632 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Gibson D.G. , Young L. , Chuang R.Y. , Venter J.C. , Hutchison C.A. 3rd , Smith H.O. Enzymatic assembly of DNA molecules up to several hundred kilobases . Nat. Methods . 2009 ; 6 : 343 – 345 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Jiang W. , Zhao X. , Gabrieli T. , Lou C. , Ebenstein Y. , Zhu T.F. Cas9-assisted targeting of CHromosome segments CATCH enables one-step targeted cloning of large gene clusters . Nat. Commun. 2015 ; 6 : 8101 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Nedelkova M. , Maresca M. , Fu J. , Rostovskaya M. , Chenna R. , Thiede C. , Anastassiadis K. , Sarov M. , Stewart A.F. Targeted isolation of cloned genomic regions by recombineering for haplotype phasing and isogenic targeting . Nucleic Acids Res. 2011 ; 39 : e137 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Riele H.T. , Maandag E.R. , Berns A. Highly efficient gene targeting in embryonic stem cells through homologous recombination with isogenic DNA constructs . Proc. Natl. Acad. Sci. U.S.A. 1992 ; 89 : 5128 – 5132 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
THiCweed: fast, sensitive detection of sequence features by clustering big datasetsAgrawal,, Ankit;Sambare, Snehal, V;Narlikar,, Leelavati;Siddharthan,, Rahul
doi: 10.1093/nar/gkx1251pmid: 29267972
Abstract We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1–2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large ‘window’ sizes (≥50 bp), much longer than typical binding sites (7–15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity. INTRODUCTION Chromatin immunoprecipitation with sequencing (ChIP-seq) (1) is a widely used assay for determining transcription factor-binding sites (TFBS) in vivo. By crosslinking the in vivo DNA–protein complexes using formaldehyde, sonicating to break the DNA, precipitating the protein of interest using a specific antibody, reversing the crosslinks, sequencing the DNA fragments and mapping them to a reference genome, a genome-wide map of TFBS with a resolution of 100–200 bp can be obtained. Newer variants like ChIP-exo (2) and ChIP-nexus (3), which promise even higher resolution, are gaining popularity. Typically these assays yield hundreds or, in large genomes, thousands to hundreds of thousands of binding sites per factor per cell type (4,5). TFBS are generally characterized by short conserved patterns or ‘motifs’ in the DNA sequence, commonly represented by ‘position weight matrices’ (PWMs) (6,7), a probabilistic representation where each position within a binding site is described by an independent categorical distribution over the 4 nucleotides. A key bioinformatic task is to identify these motifs, but ab initio motif detection using traditional tools such as MEME (8) and Gibbs samplers such as AlignACE (9,10) and PhyloGibbs (11,12) is a challenge on such large datasets. Additionally, it is common for factors to interact with DNA via co-factors and not directly, which means a mixture of different motifs may be found in the ChIP-seq data. A previous program by one of us, MuMoD (13), was targeted at the second of these problems: it simultaneously and sensitively finds multiple motifs in a given dataset. Other programs such as Chipmunk (14–16), Meme-Chip (17) and Weeder (18,19) find successive motifs sequentially, masking previously identified sites or sequences to find the next motif. The program we describe here, THiCweed, offers both speed and accuracy in finding multiple motifs in large datasets. It does not require prior information on the number of motifs or the lengths of the motif, since its approach is based on clustering rather than traditional motif finding, and the clustering is based on stringent statistical criteria. On synthetic data, we show that it outperforms all current alternatives greatly on speed and is close to the best current alternative in terms of accuracy. On real genomic data, it reveals an unusual complexity in the structure of sequence motifs, in particular in internal dependencies and in flanking sequence extending far beyond the core motif. MATERIALS AND METHODS There are two components to our approach: First is an efficient method of divisive hierarchical clustering. Starting with one large cluster, we split it in two clusters (or three, the third consisting of poor matches to either cluster). The scoring is described below, and is based on the likelihood ratio of a sequence belonging to one or the other cluster, done iteratively starting from an initial heuristic split. We then split each new cluster into two (or three) further clusters; and proceed until no further splits are possible. For each split, we apply stringent statistical criteria to accept or reject the split. Further optimizations are described in ‘Algorithm’. During this clustering process, we include shifts and reverse complements of individual sequences to find optimal clusters. This is implemented by considering fixed-sized ‘windows’ of length W, one window within each sequence. Sequences may have variable length; we permit up to half the window to lie outside the sequence, with the missing nucleotides scored as N’s, so that for each sequence of length L, 2L configurations (L window positions and two orientations) are considered and the optimal window chosen. The default choice of W is one-third the median sequence length, that is, much longer than a typical TF motif. whose positioning and orientation is sampled. This, it turns out, constitutes an effective and fast implementation of an ab initio motif finder on large ChIP-seq datasets, in addition to detecting the variations in motif and sequence context alluded to in the previous point. THiCweed can also be used on sequences that have been previously aligned by a ‘feature’ (motif) to discover additional motifs/complexities, by disabling shifts and reverse complements, similar to the program No Promoter Left Behind (20,21), but we do not discuss this use here. Our divisive clustering is in contrast to typical (agglomerative) hierarchical clustering, where individual data points are formed into clusters, requiring O(N3) or at best O(N2log N) time for N data points. We call our approach ‘Top–down hierarchical clustering’; and since its purpose is to weed out ‘signals’ in ChIP-seq peaks, we call the program ‘THiCweed’. (We considered ‘THC-weed’ but it may confuse search engines.) Algorithm Top–down hierarchical clustering The algorithm and a typical run through it are portrayed in Figure 1 and described below. We first take the simpler case of input data that has been pre-aligned with all sequences of the same length, where we do not consider shifts and reverse complements of sequences. The steps are as follows: Initialize with one cluster containing all sequences. Split every current cluster C (initially just one cluster), into two clusters C1 and C2, using scoring and significance criteria described below. Sequences not consistently clustering with either C1 or C2 (as described below) are concatenated into a third cluster Cp. In each round, all these unclustered sequences from each division are concatenated into one cluster. After every two iterations of step (2), if the current state has more than two clusters, reassign the poor-scoring sequences (sequences whose likelihoods in their current cluster are low) to the ‘best’ available cluster. Repeat from (2), until no new clusters are formed and no reassignments are made. Figure 1. Open in new tabDownload slide (A) Flowchart for the hierarchical clustering algorithm. The initialization is with all sequences in one cluster. At every pass, an attempt is made to split every current cluster. Splits are accepted or rejected based on significance. Every two passes, a reassignment of low-scoring sequences to the best available cluster is made. When a pass has ended with no splits being made, the program terminates returning the current clusters. (B) A possible run for an input of 2000 sequences. The blue boxes represent cluster sizes, green arrows from ‘Split Cluster’ boxes indicate successful splits and red arrows indicate unsuccessful splits. Each horizontal row of ‘split cluster’ boxes represents one pass. Figure 1. Open in new tabDownload slide (A) Flowchart for the hierarchical clustering algorithm. The initialization is with all sequences in one cluster. At every pass, an attempt is made to split every current cluster. Splits are accepted or rejected based on significance. Every two passes, a reassignment of low-scoring sequences to the best available cluster is made. When a pass has ended with no splits being made, the program terminates returning the current clusters. (B) A possible run for an input of 2000 sequences. The blue boxes represent cluster sizes, green arrows from ‘Split Cluster’ boxes indicate successful splits and red arrows indicate unsuccessful splits. Each horizontal row of ‘split cluster’ boxes represents one pass. The user may specify a maximum number of desired clusters, and if the number of clusters at the end is greater than this, a dendrogram of current clusters is constructed and closest leaves are joined until the number of clusters is sufficiently reduced. Scoring Only windowed portions of sequences are scored. Let the window length be W. Consider a cluster C with N sequence windows in it, S1, S2, …, SN. The probability of seeing this data if all these windows were drawn from the same PWM model is: \begin{equation*} P({C})=\prod _{i=1}^{{W}}\frac{\prod _{\alpha }\Gamma (n_{i\alpha }+{c})\Gamma (4{c})}{\Gamma (\sum _{\alpha }n_{i\alpha }+4{c})\Gamma ({c})^{4}} \end{equation*} (1) where, niα is the number of times nucleotide α appears in column i, and c is a pseudocount (0.5 by default). If the cluster contains a single sequence, this expression reduces to |$\left(\frac{1}{4}\right)^{{ W}}$|. The likelihood that a sequence window S is sampled from the same PWM as sequences in a cluster C that contains N seqs is: \begin{equation*} P(S|C) = \frac{P(S.C)}{P(C)} = \prod _{i=1}^{W} \frac{n_{i{S_i}}+{c}}{N+4{c}} \end{equation*} (2) where, Si indicates the i’th nucleotide in sequence window S and |$n_{iS_i}$| is the number of occurrences of that nucleotide at position i in the cluster. When splitting a cluster, an initial split is made by ranking each sequence by its likelihood of belonging to that cluster, and moving the ‘best’ 25% to another cluster. Then sequences are selected in random order, removed from their current cluster and re-assigned to the more likely cluster, considering all possible window choices (position and orientation) within the sequence during the reassignment, until no further reassignments are made. The significance of the split is assessed using two criteria. First, we demand that the log ratio of the likelihoods of the two clusters, to the likelihood of the unsplit cluster, as calculated from Equation 1, exceed a threshold, calculated from the log likelihood ratio (LLR) of two columns being cleanly separated in nucleotide composition. That is, suppose the two clusters consisted of random sequences, and were split on a single position—say, one cluster contained only A or C in that position, the other only G or T—while the nucleotides at all other positions are evenly distributed. This is not a significant split (it is always possible to do this, or better, for any cluster). Call the log likelihood ratio in this case L1. However, if the clusters differed in this manner in two positions—one cluster contained only A or C in those two positions, the other only G or T—this would be significant. Call the log likelihood ratio of this split L2. We demand the LLR of the split performed be equal to at least L1 + T(L2 − L1) where, T is a parameter set to 0.4 by default (Supplementary Data). L1 and L2 can be calculated quickly using Equation 1. Second, we demand that the splits be reproducible. using the following approach: we perform the split four times with four random initializations. With the resulting four pairs of clusters, we demand that at least three of the six pairwise cluster comparisons that result have an adjusted Rand index (ARI) (22) greater than a threshold r (by default 0.2). An ARI of 1.0 indicates perfect agreement while random clusterings would have ARIs close to zero. If the three pairwise comparisons between the first three splits each exceed r, the fourth split is not performed. If the split is accepted, the three pairs of clusters resulting from the three splits are identified based on majority membership and sequences that failed to be consistently clustered by this criterion (that is, did not cluster in the same way according to this association) are put in a third cluster. Splits that fail one of these two significant criteria are rejected, that is, the split clusters are joined again and returned to the pool. Both parameters T and r are user adjustable. The default values were chosen based on benchmarks on synthetic data, as discussed in Supplementary Data. When reassigning sequences (step 3 of the algorithm), we consider the poorest 20% of the sequences (measured by their likelihoods in their current clusters). For each sequence S, we first remove it from its current cluster, then calculate P(C′) for each available cluster C where, C′ = C + S, using the above formula, and add it to the best cluster. In practice, on average 4% and at most about 10% of the sequences considered in this step get reassigned. Benchmarking: synthetic data We generated synthetic datasets consisting of sequences of length 100 bp each, with motifs drawn from random PWMs placed within the central 40 bp of these sequences, and otherwise random (each nucleotide having probability 0.25). The PWMs had columns sampled from Dirichlet distributions with uniform hyperparameter c (i.e. each column |$\boldsymbol{v}$| denoting the probability distribution over the four bases A, C, G and T, was independently sampled from the distribution |$P(\boldsymbol{v}) \propto v_\alpha ^{c-1}$|). Drawing from a Dirichlet distribution with a low value of c is more likely to result in a probability distribution that is highly skewed, i.e. is different from a uniform 0.25 probability per base. This skewness reduces with increase in c, a high value of c making the motif less distinguishable from background. Five datasets were generated with c = 0.1, 0.2, 0.3, 0.4 and 0.5. Each dataset consisted of 20 files, with each file having sequences containing between two and five distinct motifs (one motif per sequence), the motifs drawn from PWMs of a ‘core’ width of 5–10 bp and a tapering ‘flank’ to a full width of 10–20 bp (to reflect what is often in real data, as described below). The core positions were drawn from Dirichlet distributions with the hyperparameter c as described above, while the flanks tapered off rapidly from the core c to a hyperparameter of 20 (essentially a uniformly random vector). The performance of the programs and therefore the conclusions do not change when the flank is omitted (not shown). Each sequence contained one motif, and each dataset contained motifs drawn from a small number of PWMs. The number and lengths of PWMs were varied across datasets for each c, but the distribution of numbers and lengths was the same for different c’s. Figure 2 shows synthetic motifs for c = 0.1, 0.3 and 0.5, all with a core width of 6 bp and a full width of 20 bp. Figure 2. Open in new tabDownload slide (A) Examples of embedded synthetic motifs. In this case all these have core widths of 6 bp and full widths of 20 bp, which are common to corresponding files in all datasets. The PWMs are sampled from different values of c, which varies from the indicated value in the core to a large value of 20 at the periphery. This is intended to model the appearance of motifs observed in real data. (B) and (C): ARI (higher is better) of predicted clustering to known clustering of synthetic datasets, containing motifs drawn from PWMs sampled column wise from Dirichlet distributions with hyperparameter c. Error bars in black (standard error from 20 datasets). (B) In the case of 1000 seqs/file, THiCweed is competitive but somewhat inferior on this metric to MuMoD and ChipMunk, and somewhat superior to MemeChip (meme mode). (C) With 5000 seqs/file, comparing the better-performing programs from the previous figure, THiCweed is very close to MuMoD in performance. Figure 2. Open in new tabDownload slide (A) Examples of embedded synthetic motifs. In this case all these have core widths of 6 bp and full widths of 20 bp, which are common to corresponding files in all datasets. The PWMs are sampled from different values of c, which varies from the indicated value in the core to a large value of 20 at the periphery. This is intended to model the appearance of motifs observed in real data. (B) and (C): ARI (higher is better) of predicted clustering to known clustering of synthetic datasets, containing motifs drawn from PWMs sampled column wise from Dirichlet distributions with hyperparameter c. Error bars in black (standard error from 20 datasets). (B) In the case of 1000 seqs/file, THiCweed is competitive but somewhat inferior on this metric to MuMoD and ChipMunk, and somewhat superior to MemeChip (meme mode). (C) With 5000 seqs/file, comparing the better-performing programs from the previous figure, THiCweed is very close to MuMoD in performance. THiCweed and five other programs (Peak-Motifs (23), MuMoD, Chipmunk, Meme-Chip and Weeder2) were run on these sets, in multiple-motif ZOOPS mode (zero or one occurrences of a motif per sequence). The ‘known’ clustering of the set was the assignment of sequences to PWMs and the ‘predicted’ clustering for each program was the assignment of sequences to predicted motifs. The known and predicted clusters were compared using the ARI, and the results plotted as a function of c. Higher ARI indicates a better match between the clusterings, with 1.0 indicating perfect agreement and 0.0 being the value expected by chance. Two such datasets are shown here, with dataset 1 containing 1000 sequences per file and dataset 2 containing 5000 sequences per file. The ARIs are averaged over all 20 files for each value of c in each dataset. Commandline options: THiCweed: no additional parameters MuMoD: default parameters were used for the curves marked ‘MuMoD’. For ‘MuMoD(i)’ the true number of motifs was specified. ChipMunk: in all runs, the correct number of motifs was specified. The length of the motif was given as 7:20. Weeder: default options, but with a background frequency model derived from synthetic data. Meme-Chip (meme): dreme was disabled with ‘-dreme-m 0’, and the known number of motifs specified with ‘-meme-nmotifs’, with default parameters otherwise. Meme-Chip (dreme): meme was disabled with ‘-meme-nmotifs 0’ Peak-Motifs: default parameters were used. Other notes Despite the ‘filter’ keyword used in the command line, Chipmunk sometimes predicts multiple motifs per sequence because it searches for matches for predicted motifs in all sequences. For computing the ARI, each sequence was classified to the best-matching motif, as per the score reported by Chipmunk. The same was done for Peak-Motifs. In addition, sequences where no motifs were reported were assigned to an additional cluster. ENCODE data Here we used data from the ENCODE project (4,5,24), consisting of ChIP-seq peaks. Narrowpeak files were downloaded from the ENCODE website. Seventy five base pairs flanking sequence was taken about each peak location, and repetitive regions (lowercase sequence in chromosome files downloaded from the UCSC Genome Browser (25), identified using RepeatMasker and Tandem Repeat Finder with period of 12 or less) were rejected for the purposes of this work. The cell types and ENCODE accession numbers for various factors portrayed in Figures 4–7 are as follows, and full output is available on the web server: Factor Cell type Accession number BATF GM12878 ENCSR000BGT ENCFF002CGQ BCL11A GM12878 ENCSR000BHA ENCFF002CGR ELK1 (Figure 4:1) HeLa-S3 ENCSR000ECI ENCFF001VIJ ELK1 (Figure 4:2) A549 ENCSR623KNM ENCFF818TAN FOS HeLa-S3 ENCSR000EZE ENCFF001VHZ FOXA1 Ishikawa ENCSR000BKW ENCFF002CGL GATA1 erythroblast ENCSR000EXP ENCFF001VQR GATA2 K562 ENCSR000EWG ENCFF001VNE IRF1 (Figure 5) K562 ENCSR000EGL ENCFF002CWW IRF1 (Figure 4) K562 ENCSR000EGT ENCFF001VNN IRF3 GM12878 ENCSR408JQO ENCFF735DCQ MAX HeLa-S3 ENCSR000EZF ENCFF001VIT MEF2C GM12878 ENCSR000BNG ENCFF002CHD MYC K562 ENCSR000EGS ENCFF002CWF NFYA HeLa-S3 ENCSR000DNS ENCFF002CSU NR2F2 K562 ENCSR000BRS ENCFF002CME RELA GM12878 ENCSR000EAG ENCFF001VET REST Panc1 ENCSR000BJO ENCFF002CNA RFX5 HepG2 ENCSR000EEA ENCFF002CUT SIX5 GM12878 ENCSR000BJE ENCFF002CHU SP1 K562 ENCSR000BKO ENCFF002CMN SP2 (Figure 4:1) H1-hESC ENCSR000BQG ENCFF002CJL SP2 (Figure 4:2,3,4) HepG2 ENCSR000BOU ENCFF002CLC STAT5A K562 ENCSR000BRR ENCFF002CMQ TAL1 K562 ENCSR000EHB ENCFF002CYH TEAD4 K562 ENCSR000BRK ENCFF002CMT ZNF143 GM12878 ENCSR000DZL ENCFF002CPW Factor Cell type Accession number BATF GM12878 ENCSR000BGT ENCFF002CGQ BCL11A GM12878 ENCSR000BHA ENCFF002CGR ELK1 (Figure 4:1) HeLa-S3 ENCSR000ECI ENCFF001VIJ ELK1 (Figure 4:2) A549 ENCSR623KNM ENCFF818TAN FOS HeLa-S3 ENCSR000EZE ENCFF001VHZ FOXA1 Ishikawa ENCSR000BKW ENCFF002CGL GATA1 erythroblast ENCSR000EXP ENCFF001VQR GATA2 K562 ENCSR000EWG ENCFF001VNE IRF1 (Figure 5) K562 ENCSR000EGL ENCFF002CWW IRF1 (Figure 4) K562 ENCSR000EGT ENCFF001VNN IRF3 GM12878 ENCSR408JQO ENCFF735DCQ MAX HeLa-S3 ENCSR000EZF ENCFF001VIT MEF2C GM12878 ENCSR000BNG ENCFF002CHD MYC K562 ENCSR000EGS ENCFF002CWF NFYA HeLa-S3 ENCSR000DNS ENCFF002CSU NR2F2 K562 ENCSR000BRS ENCFF002CME RELA GM12878 ENCSR000EAG ENCFF001VET REST Panc1 ENCSR000BJO ENCFF002CNA RFX5 HepG2 ENCSR000EEA ENCFF002CUT SIX5 GM12878 ENCSR000BJE ENCFF002CHU SP1 K562 ENCSR000BKO ENCFF002CMN SP2 (Figure 4:1) H1-hESC ENCSR000BQG ENCFF002CJL SP2 (Figure 4:2,3,4) HepG2 ENCSR000BOU ENCFF002CLC STAT5A K562 ENCSR000BRR ENCFF002CMQ TAL1 K562 ENCSR000EHB ENCFF002CYH TEAD4 K562 ENCSR000BRK ENCFF002CMT ZNF143 GM12878 ENCSR000DZL ENCFF002CPW Open in new tab Factor Cell type Accession number BATF GM12878 ENCSR000BGT ENCFF002CGQ BCL11A GM12878 ENCSR000BHA ENCFF002CGR ELK1 (Figure 4:1) HeLa-S3 ENCSR000ECI ENCFF001VIJ ELK1 (Figure 4:2) A549 ENCSR623KNM ENCFF818TAN FOS HeLa-S3 ENCSR000EZE ENCFF001VHZ FOXA1 Ishikawa ENCSR000BKW ENCFF002CGL GATA1 erythroblast ENCSR000EXP ENCFF001VQR GATA2 K562 ENCSR000EWG ENCFF001VNE IRF1 (Figure 5) K562 ENCSR000EGL ENCFF002CWW IRF1 (Figure 4) K562 ENCSR000EGT ENCFF001VNN IRF3 GM12878 ENCSR408JQO ENCFF735DCQ MAX HeLa-S3 ENCSR000EZF ENCFF001VIT MEF2C GM12878 ENCSR000BNG ENCFF002CHD MYC K562 ENCSR000EGS ENCFF002CWF NFYA HeLa-S3 ENCSR000DNS ENCFF002CSU NR2F2 K562 ENCSR000BRS ENCFF002CME RELA GM12878 ENCSR000EAG ENCFF001VET REST Panc1 ENCSR000BJO ENCFF002CNA RFX5 HepG2 ENCSR000EEA ENCFF002CUT SIX5 GM12878 ENCSR000BJE ENCFF002CHU SP1 K562 ENCSR000BKO ENCFF002CMN SP2 (Figure 4:1) H1-hESC ENCSR000BQG ENCFF002CJL SP2 (Figure 4:2,3,4) HepG2 ENCSR000BOU ENCFF002CLC STAT5A K562 ENCSR000BRR ENCFF002CMQ TAL1 K562 ENCSR000EHB ENCFF002CYH TEAD4 K562 ENCSR000BRK ENCFF002CMT ZNF143 GM12878 ENCSR000DZL ENCFF002CPW Factor Cell type Accession number BATF GM12878 ENCSR000BGT ENCFF002CGQ BCL11A GM12878 ENCSR000BHA ENCFF002CGR ELK1 (Figure 4:1) HeLa-S3 ENCSR000ECI ENCFF001VIJ ELK1 (Figure 4:2) A549 ENCSR623KNM ENCFF818TAN FOS HeLa-S3 ENCSR000EZE ENCFF001VHZ FOXA1 Ishikawa ENCSR000BKW ENCFF002CGL GATA1 erythroblast ENCSR000EXP ENCFF001VQR GATA2 K562 ENCSR000EWG ENCFF001VNE IRF1 (Figure 5) K562 ENCSR000EGL ENCFF002CWW IRF1 (Figure 4) K562 ENCSR000EGT ENCFF001VNN IRF3 GM12878 ENCSR408JQO ENCFF735DCQ MAX HeLa-S3 ENCSR000EZF ENCFF001VIT MEF2C GM12878 ENCSR000BNG ENCFF002CHD MYC K562 ENCSR000EGS ENCFF002CWF NFYA HeLa-S3 ENCSR000DNS ENCFF002CSU NR2F2 K562 ENCSR000BRS ENCFF002CME RELA GM12878 ENCSR000EAG ENCFF001VET REST Panc1 ENCSR000BJO ENCFF002CNA RFX5 HepG2 ENCSR000EEA ENCFF002CUT SIX5 GM12878 ENCSR000BJE ENCFF002CHU SP1 K562 ENCSR000BKO ENCFF002CMN SP2 (Figure 4:1) H1-hESC ENCSR000BQG ENCFF002CJL SP2 (Figure 4:2,3,4) HepG2 ENCSR000BOU ENCFF002CLC STAT5A K562 ENCSR000BRR ENCFF002CMQ TAL1 K562 ENCSR000EHB ENCFF002CYH TEAD4 K562 ENCSR000BRK ENCFF002CMT ZNF143 GM12878 ENCSR000DZL ENCFF002CPW Open in new tab The ZNF143 clusters were compared with nucleosome positioning data in the same cell type (GM12878) from ENCODE and PhastCons (26) phylogenetic conservation data (with other primates) from the UCSC genome site (25), distances from nearest transcriptional start sites (TSS) and DNAse-seq values from ENCODE, using custom python scripts. For TSS, we used the refGene data from the hg19 release on the UCSC genome browser site. RESULTS Synthetic data Results for the two datasets described in ‘Materials and Methods’ section are plotted in Figure 2 parts A, B and C for c = 0.1, 0.2, 0.3, 0.4 and 0.5 (smaller value of c corresponds to sharper motifs). In all cases THiCweed was run with default parameters, and in particular, a ‘window size’ of 33 bp or one-third the median input sequence length. As noted, it is designed to be run with large window sizes on real genomic data. Also, the stringent criteria for splitting a cluster ensure that spurious clusters are unlikely, so setting the maximum number of clusters helps only marginally (not shown). Since clusters are split according to significance criteria, there is no option to set a minimal amount of clusters. MuMoD was run both with default parameters (‘MuMoD’) and with the additional information of number of motifs (‘MuMoD(i)’); the latter provides only marginal improvement. Chipmunk (in ChipHorde mode) requires the exact number of motifs to be told to it, which was done in these cases, and the range of lengths of the motif was given. Meme-Chip with its default options run the MEME motif finder on a random subset of the input data, with inferior results. Forcing MEME for the full set improved the results, at a significant cost in running time. For comparison, we also disabled MEME entirely in favor of DREME, a heuristic approach based on regular expressions rather than PWMs. Weeder2 was run with default options but a background model derived from synthetic data, as described in Methods. With 1000 seqs/set, THiCweed is competitive with MuMoD and ChipMunk on this metric. Only the best performers were tested with 5000 seqs/set. All programs show improved performance here, because the motif strength is maintained the same but background ‘noise’ reduces as |$N^{-\frac{1}{2}}$| with increasing number of sequences N. But THiCweed’s improvement is sharper: it catches up with MuMoD and is largely superior to ChipMunk. The reason for poor performance of Peak-Motifs seems to be its prediction of a very large number of motifs that are minor variations of one another. While it is hard to judge the relevance of this for real data, in the case of synthetic data these are certainly spurious and THiCweed’s statistical criteria for splitting help it avoid this problem. Running times: synthetic data Figure 3A shows running times of all the programs tested, except Peak-Motifs, for synthetic input data consisting of 200, 400, 600, 800 and 1000 sequences, each 1000-bp long and containing two different motifs, each of length 10 sampled with Dirichlet parameter 0.2, in 60:40 proportion. Meme-Chip in MEME mode is an outlier: though its performance in accuracy is not very far behind other programs (Figure 2, its running time would seem to disqualify it from realistic datasets (and indeed it disables MEME by default for sequence sets larger than about 600 × 100 bp). It appears that, of the other programs, Chipmunk and Meme-Chip (Dreme mode) have runtimes increasing roughly linearly with data size; MuMoD and Weeder running times increase superlinearly; and THiCweed’s increase is somewhat sublinear. The figure also discusses running time on genomic data from ENCODE, discussed below. Figure 3. Open in new tabDownload slide Running time of various programs as the size of the dataset varies for (A) synthetic data and (B) data from the ENCODE project. (C) THiCweed’s performance on real data varies significantly with the complexity of the sequence features. Nevertheless, it remains on average much faster than other programs (Peak-Motifs was not tested but it is the fastest in this comparison). Figure 3. Open in new tabDownload slide Running time of various programs as the size of the dataset varies for (A) synthetic data and (B) data from the ENCODE project. (C) THiCweed’s performance on real data varies significantly with the complexity of the sequence features. Nevertheless, it remains on average much faster than other programs (Peak-Motifs was not tested but it is the fastest in this comparison). ChIP-seq data from the ENCODE project Running on actual genomic data yields a variety of different results depending on the factor being examined and the size of the dataset. THiCweed has no prior knowledge of the number of different motif clusters, but by default reports a maximum of 15. In some cases far fewer are reported. Because of the statistical criteria on splitting clusters that we use, described in ‘Materials and Methods’ section, we believe that large numbers of clusters, if produced, are statistically significant, but THiCweed can recluster the output into smaller numbers of clusters for ease of visualization and this is done in some cases here. Also, it works with window sizes much larger than typical motif lengths that one considers; here we used 50 bp. We compare the discovered motifs to previously reported motifs from JASPAR (27,28); the THiCweed website also includes comparisons to motifs from HocoMoco (29) and FactorBook (30). Ubiquitous ‘zinger’ motifs Hunt and Wasserman (31) observed that certain TF motifs occur repeatedly in different ChIP-seq datasets, which they termed ‘zingers’. In particular they identified CTCF-like, JUN-like, ETS-like and THAP11-like motifs in multiple datasets. We see all of these in our analysis of ENCODE data too (for example, the THAP11-like and CTCF motifs occur in Figure 7, but several other motifs appear across multiple experiments. Figure 4 shows examples that resemble IRF1, SP2, GATA1, NFYB, REST and a novel motif that we could not identify. Of these, SP2 and the novel motif are roughly as ubiquitous as CTCF. Both frequently co-occur with CTCF and the SP2-like motif tends to be concentrated near TSS (an example is in Figure 7). We suspect a role for these in chromatin organization, a topic to be explored in future work. Figure 4. Open in new tabDownload slide Motifs that occur across multiple chip-seq datasets, in addition to zinger motifs identified in (31). The factor for which the motif is the canonical motif according to JASPAR is indicated at the top of each column, together with the JASPAR sequence logo. Below are datasets for various other TFs where THiCweed finds the same motif. Figure 4. Open in new tabDownload slide Motifs that occur across multiple chip-seq datasets, in addition to zinger motifs identified in (31). The factor for which the motif is the canonical motif according to JASPAR is indicated at the top of each column, together with the JASPAR sequence logo. Below are datasets for various other TFs where THiCweed finds the same motif. Also noteworthy is the appearance of a secondary motif in multiple cases for the GATA-like and NFYB-like motifs; and the variable spacing of the REST-like motif. The canonical motif has two halves, TCAGCACC and GGACAG, separated by 2 nt. But we pick up variants, previously described in (32), with longer spacing (8 and 9 bp here). Such widely spaced motifs cause problems for conventional motif-finders, but are readily picked up in our approach. Examples of THiCweed output Figure 5 shows four examples of motif output. In some cases the output has been reclustered and filtered for compactness of viewing; complete results for these and many more factors are available on the THiCweed website. Figure 5. Open in new tabDownload slide Sample THiCweed output on four ChIP-seq datasets: IRF1 (5543 peaks), NFYA (4497 peaks), REST (3998 peaks) and FOXA1 (4029 peaks). Not all output clusters are shown here. The full output is available on the THiCweed website. Figure 5. Open in new tabDownload slide Sample THiCweed output on four ChIP-seq datasets: IRF1 (5543 peaks), NFYA (4497 peaks), REST (3998 peaks) and FOXA1 (4029 peaks). Not all output clusters are shown here. The full output is available on the THiCweed website. We make the following observations: Zinger motifs are widespread here. The SP1-like motif that we documented above occurs in IRF1 and NFYA. The unidentified motif in the previous section appears in REST and FOXA1. CTCF occurs in NFYA and FOXA1. ETS-like occurs in IRF1. The canonical motif for IRF1 occurs in two clusters, one of which has an additional poly-T tail. Similarly, the canonical motif for NFYA appears in three clusters, one of which also exhibits a weak secondary motif to the left. The canonical REST motif occurs as a closely spaced dimer (fourth cluster), partial closely spaced dimer (fifth cluster), monomer (third cluster) and a widely spaced dimer (second cluster). All of these variants also occur in THiCweed output for SP2 (Figure 6) suggesting an interaction between SP2 and REST. The widely spaced dimer is not picked up by other motif finders. Figure 6. Open in new tabDownload slide Comparison of clustering of 2019 peaks for SP2 by THiCweed, with motifs found by three other programs. Figure 6. Open in new tabDownload slide Comparison of clustering of 2019 peaks for SP2 by THiCweed, with motifs found by three other programs. A much larger collection of THiCweed output on ENCODE factors is available on the website. Features similar to those noted above are ubiquitous. Comparison with other programs Figure 6 compares the output of THiCweed with three other programs. All programs pick up the main motif (though with varying numbers of instances). All also pick up the REST motif, but only THiCweed picks up the widely spaced version in one piece. THiCweed also seems to reveal a larger surrounding sequence context in many cases, notably for the SP1-like motif which generally occurs in a CG-rich background. Peak-Motifs identifies a very large number of motifs, most of which appear to be minor variations of the main motif. This may explain the poor performance of Peak-Motifs on our synthetic benchmark: the ARI would penalize breaking up clusters into smaller clusters. Biological relevance of these clusters We typically find several different motifs, variants of a motif and a few apparently uninformative clusters in THiCweed runs. Biological significance to these are suggested on comparing other genomic features, such as phylogenetic conservation (via PhastCons scores (26) from the UCSC Genome Browser (25)) and nucleosome occupancy and DNAse-seq data (from ENCODE (4)). Figure 7 compares each of eight clusters for ZNF143 with a plot of conservation, nucleosome occupancy (in an extended region of 1000 bp on each side), distance to the nearest TSS and DNAse-seq values. Cluster 6 (SP2-like motif) tends to be concentrated close to TSSs (mostly within 1000 bp—a pattern we see consistently), shows little phylogenetic conservation and no sign of nucleosome positioning. Cluster 1 (a motif resembling THAP11, identified in (31) as a zinger motif), too, is concentrated near TSSs; it too shows little effect in nucleosome positioning, but is strongly conserved. Cluster 7, resembling the REST motif, is spread away from TSSs, is phylogenetically conserved and has an effect on nucleosome positioning (which we observe in other datasets where this motif occurs). Figure 7. Open in new tabDownload slide Comparison of sequence clusters of 14 937 ZNF143 ChIP-seq peaks with DNAse-seq values (color scale: blue = open, red = closed), nucleosome occupancy (color scale: white = 0, brown = 5+), PhastCons conservation score (color scale: white = 0, dark blue = 1) and distance from nearest TSS, suggesting connections between the motif structure in different sequence clusters, biological function, evolutionary conservation pressure, nucleosome positioning and open/closed chromatin. Figure 7. Open in new tabDownload slide Comparison of sequence clusters of 14 937 ZNF143 ChIP-seq peaks with DNAse-seq values (color scale: blue = open, red = closed), nucleosome occupancy (color scale: white = 0, brown = 5+), PhastCons conservation score (color scale: white = 0, dark blue = 1) and distance from nearest TSS, suggesting connections between the motif structure in different sequence clusters, biological function, evolutionary conservation pressure, nucleosome positioning and open/closed chromatin. Cluster 8 seems uninformative, but it appears concentrated near the TSSs (within about 5000 bp), which would likely not happen if it consisted only of random unclusterable sequences left over from the other clusters. The remaining clusters are variants of the CTCF motif; cluster 5 includes the previously documented ‘M2’ motif. Cluster 3 appears different from other CTCF clusters in that it occurs in a GC-rich background, is more concentrated near TSS (mostly within about 10 000 bp), appears a little less conserved and a little less effective at nucleosome positioning, with more open chromatin as shown by DNAse. Running times: ENCODE data Figure 3B shows the results of THiCweed, MuMoD, ChipMunk and Meme-Chip (MEME mode) on real ENCODE data, consisting of 400–2000 random samples from a set of CTCF ChIP-seq peaks (dataset ENCFF001USS). The results are similar to on the synthetic data, except that, somewhat surprisingly, Meme-Chip is faster than MuMoD and ChipMunk on larger datasets. Figure 3C shows the running time of THiCweed as a function of the number of clusters found, on 92 ChIP-seq datasets each consisting of 27 000–33 000 peaks, across multiple TFs and cell lines. The running time increases with the number of clusters, but somewhat sublinearly. On such realistic ChIP-seq datasets, THiCweed’s running time is about two orders of magnitude less than MuMoD, which can take days and is also much faster than all other programs tested. Meme-Chip uses the MEME step on only a small fraction of the input sequences; and Weeder2 learns motifs from a small fraction of the sequences and uses those to analyze the rest (19). THiCweed processes the majority of files of this size in under 2 h, with interesting and biologically relevant results. DISCUSSION Motif finding in large datasets produced by ChIP-seq and similar experiments is a qualitatively different problem in complexity from what traditional motif finders are used to handle. Additionally, one could liken the problem of finding rarely occurring motifs to finding a needle in a haystack. We view THiCweed’s approach as ‘sequence feature analysis’ (over large windows) rather than ‘motif finding’ (detection of short patterns). Our novel clustering algorithm can comfortably handle tens of thousands of sequences at a time, and with significant heterogeneity in motif content. It successfully picks up biologically relevant motifs even when they occur in fewer than 5% of the input sequences, such as the REST-like motif in ZNF143 (cluster 7 in Figure 7). Its large window size enables it to also pick up secondary motifs like the M2 CTCF motif in the ZNF143 data (Figure 7), the widely spaced dimer in SP2 and REST (Figures 5 and 6), and peripheral features such as an overall CG-richness in some motifs (eg CTCF-like cluster 3 in Figure 7). The significance criterion used for splitting, and the differences in biological parameters in Figure 7, suggest that these differences are important and are not artifacts. Uniquely among the programs we have tested, THiCweed achieves its combination of speed and accuracy without resorting to heuristics in scoring (as DREME and Chipmunk do, using regular expressions and ‘seeding’ respectively) and without resorting to training on a small subset of the sequences (as Weeder does). THiCweed’s clustering algorithm is stochastic, but is essentially similar to an iterated K-means clustering with K = 2, with significance criteria to avoid spurious splits. Instead of invoking pairwise distances and calculating a centroid, however, we calculate multinomial likelihoods correctly within the limitations of the PWM assumption. The clustering algorithm and wide-window approach ensures that little or no prior information is required to run the program: significant short motifs can be found inside longer windows by eyeballing, but other relevant sequence features can be picked up too. A possible shortcoming is that within THiCweed’s framework, only one motif occurrence per sequence will be detected (unless two motifs co-occur with a restricted spacing, as in the extended REST motif and the secondary M2 CTCF motif). Sequences that match no dominant motif may end up in a relatively uninformative cluster such as cluster 8 in Figure 7. One may ask whether, in clusters that do not match the canonical motif, the motif nevertheless occurs elsewhere in some of the peaks in additional to the non-canonical motif in the cluster. We checked for this possibility exhaustively using FIMO (33), with a q-value threshold of 10−3 and found that, out of 93 TFs that have a PWM in JASPAR, only 25 reported any sequences at all that were not clustered with canonical motif matched but that nevertheless showed hits for the JASPAR motif. These too showed matches in very few cases; the exceptions were SP2 and EGR1, both of which have GC rich canonical motifs, which reported matches in about 15 and 14%, respectively, of such sequences. It would therefore seem that the ‘missing’ of canonical motifs because of occurrence of other strong motifs within ChIP-seq peaks is not a common concern in practice. At the moment, THiCweed runs on a single CPU core, but significant speedups are possible by parallelization. In cases where there is a profusion of similar but slightly different motif patterns as well as an occurrence of many different motifs (as in the ZNF143/CTCF case), it appears that the differences may have biological significance, as reflected by nucleosome positioning and phylogenetic conservation. We plan to explore this, and the significance of some of the novel zinger motifs, further in a future work. AVAILABILITY The software is open source and available for download at http://www.imsc.res.in/∼rsidd/thicweed/ under the two-clause BSD license. An online web server is also available, linked on the above page, and can be used for modest-sized jobs. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank Vishaka Datta and Arvind Shankar for useful discussions. The plots in Figure 7 were generated with a script written by Arvind Shankar for a different project which will be reported elsewhere. Author contributions: A.A., L.N. and R.S. conceived the basic algorithm. A.A. implemented an early prototype. R.S. implemented the current version in the Julia language. A.A., L.N. and R.S. contributed to various refinements in development. S.V.S. and R.S. implemented the web server. All authors contributed to the benchmarks described here and to the analysis of ENCODE chip-seq data. All authors read and approved the final manuscript. FUNDING Wellcome Trust-DBT India Alliance Early Career Fellowship 500188/Z/11/Z (to L.N.); Department of Atomic Energy, Government of India (PRISM 12th plan project) (to R,S., S.V.S.). Funding for open access charge: Institute of Mathematical Sciences, Chennai. Conflict of interest statement. None declared. REFERENCES 1. Johnson D.S. , Mortazavi A. , Myers R.M. , Wold B. Genome-wide mapping of in vivo protein-DNA interactions . Science . 2007 ; 316 : 1497 – 1502 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Rhee H.S. , Pugh B.F. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution . Cell . 2011 ; 147 : 1408 – 1419 . Google Scholar Crossref Search ADS PubMed WorldCat 3. He Q. , Johnston J. , Zeitlinger J. ChIP-nexus enables improved detection of in vivo transcription factor binding footprints . Nat. Biotechnol. 2015 ; 33 : 395 – 401 . Google Scholar Crossref Search ADS PubMed WorldCat 4. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome . Nature . 2012 ; 489 : 57 – 74 . Crossref Search ADS PubMed WorldCat 5. Sloan C.A. , Chan E.T. , Davidson J.M. , Malladi V.S. , Strattan J.S. , Hitz B.C. , Gabdank I. , Narayanan A.K. , Ho M. , Lee B.T. et al. ENCODE data at the ENCODE portal . Nucleic Acids Res. 2016 ; 44 : D726 – D732 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Stormo G.D. , Hartzell G.W. Identifying protein-binding sites from unaligned DNA fragments . Proc. Natl. Acad. Sci. U.S.A. 1989 ; 86 : 1183 – 1187 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Hertz G.Z. , Hartzell G.W. , Stormo G.D. Identification of consensus patterns in unaligned DNA sequences known to be functionally related . Comput. Appl. Biosci. 1990 ; 6 : 81 – 92 . Google Scholar PubMed WorldCat 8. Bailey T.L. , Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers . Proc. Int. Conf. Intell. Syst. Mol. Biol. 1994 ; 2 : 28 – 36 . Google Scholar PubMed WorldCat 9. Roth F.P. , Hughes J.D. , Estep P.W. , Church G.M. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation . Nat. Biotechnol. 1998 ; 16 : 939 – 945 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Hughes J.D. , Estep P.W. , Tavazoie S. , Church G.M. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae . J. Mol. Biol. 2000 ; 296 : 1205 – 1214 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Siddharthan R. , Siggia E.D. , van Nimwegen E. PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny . PLoS Comput. Biol. 2005 ; 1 : e67 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Siddharthan R. PhyloGibbs-MP: module prediction and discriminative motif-finding by Gibbs sampling . PLoS Comput. Biol. 2008 ; 4 : e1000156 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Narlikar L. MuMoD: a Bayesian approach to detect multiple modes of protein-DNA binding from genome-wide ChIP data . Nucleic Acids Res. 2013 ; 41 : 21 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Kulakovskiy I.V. , Boeva V. , Favorov A.V. , Makeev V.J. Deep and wide digging for binding motifs in ChIP-Seq data . Bioinformatics . 2010 ; 26 : 2622 – 2623 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Kulakovskiy I. , Levitsky V. , Oshchepkov D. , Bryzgalov L. , Vorontsov I. , Makeev V. From binding motifs in ChIP-Seq data to improved models of transcription factor binding sites . J. Bioinform. Comput. Biol. 2013 ; 11 : 1340004 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Levitsky V.G. , Kulakovskiy I.V. , Ershov N.I. , Oshchepkov D.Y. , Makeev V.J. , Hodgman T. , Merkulova T.I. Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data . BMC Genomics . 2014 ; 15 : 80 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Machanick P. , Bailey T.L. MEME-ChIP: motif analysis of large DNA datasets . Bioinformatics . 2011 ; 27 : 1696 – 1697 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Pavesi G. , Mereghetti P. , Mauri G. , Pesole G. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes . Nucleic Acids Res. 2004 ; 32 ( Suppl. 2 ): W199 – W203 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Zambelli F. , Pesole G. , Pavesi G. Using Weeder, Pscan, and PscanChIP for the discovery of enriched transcription factor binding site motifs in nucleotide sequences . Curr. Protoc. Bioinformatics . 2014 ; 47 : 2 – 11 . Google Scholar PubMed WorldCat 20. Narlikar L. Multiple novel promoter-architectures revealed by decoding the hidden heterogeneity within the genome . Nucleic Acids Res. 2014 ; 42 : 12388 – 12403 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Mitra S. , Narlikar L. No Promoter Left Behind (NPLB): learn de novo promoter architectures from genome-wide transcription start sites . Bioinformatics . 2016 ; 32 : 779 – 781 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Hubert L. , Arabie P. Comparing partitions . J. Classification . 1985 ; 2 : 193 – 218 . Google Scholar Crossref Search ADS WorldCat 23. Thomas-Chollier M. , Herrmann C. , Defrance M. , Sand O. , Thieffry D. , van Helden J. RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets . Nucleic Acids Res. 2012 ; 40 : e31 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Landt S.G. , Marinov G.K. , Kundaje A. , Kheradpour P. , Pauli F. , Batzoglou S. , Bernstein B.E. , Bickel P. , Brown J.B. , Cayting P. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia . Genome Res. 2012 ; 22 : 1813 – 1831 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Karolchik D. , Baertsch R. , Diekhans M. , Furey T.S. , Hinrichs A. , Lu Y. , Roskin K.M. , Schwartz M. , Sugnet C.W. , Thomas D.J. et al. The UCSC genome browser database . Nucleic Acids Res. 2003 ; 31 : 51 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Hubisz M.J. , Pollard K.S. , Siepel A. PHAST and RPHAST: phylogenetic analysis with space/time models . Brief. Bioinformatics . 2010 ; 12 : 41 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Sandelin A. , Alkema W. , Engström P. , Wasserman W.W. , Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles . Nucleic Acids Res. 2004 ; 32 ( Suppl. 1 ): D91 – D94 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Mathelier A. , Fornes O. , Arenillas D.J. , Chen C.-y. , Denay G. , Lee J. , Shi W. , Shyr C. , Tan G. , Worsley-Hunt R. et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles . Nucleic Acids Res. 2016 ; 44 : D110 – D115 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Kulakovskiy I.V. , Medvedeva Y.A. , Schaefer U. , Kasianov A.S. , Vorontsov I.E. , Bajic V.B. , Makeev V.J. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models . Nucleic Acids Res. 2012 ; 41 : D195 – D202 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Wang J. , Zhuang J. , Iyer S. , Lin X.-Y. , Greven M.C. , Kim B.-H. , Moore J. , Pierce B.G. , Dong X. , Virgil D. et al. Factorbook. org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium . Nucleic Acids Res. 2013 ; 41 : D171 – D176 . Google Scholar Crossref Search ADS PubMed WorldCat 31. Hunt R.W. , Wasserman W.W. Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets . Genome Biol. 2014 ; 15 : 412 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Otto S.J. , McCorkle S.R. , Hover J. , Conaco C. , Han J.-J. , Impey S. , Yochum G.S. , Dunn J.J. , Goodman R.H. , Mandel G. A new binding motif for the transcriptional repressor REST uncovers large gene networks devoted to neuronal functions . J. Neurosci. 2007 ; 27 : 6729 – 6739 . Google Scholar Crossref Search ADS PubMed WorldCat 33. Grant C.E. , Bailey T.L. , Noble W.S. FIMO: scanning for occurrences of a given motif . Bioinformatics . 2011 ; 27 : 1017 – 1018 . Google Scholar Crossref Search ADS PubMed WorldCat Author notes Present address: Snehal V. Sambare. Bioinformatics and Computational Biology program, George Mason University, Manassas, VA 20110, USA. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
CRISPR/Cas9-mediated modulation of splicing efficiency reveals short splicing isoform of Xist RNA is sufficient to induce X-chromosome inactivationYue,, Minghui;Ogawa,, Yuya
doi: 10.1093/nar/gkx1227pmid: 29237010
Abstract Alternative splicing of mRNA precursors results in multiple protein variants from a single gene and is critical for diverse cellular processes and development. Xist encodes a long noncoding RNA which is a central player to induce X-chromosome inactivation in female mammals and has two major splicing variants: long and short isoforms of Xist RNA. Although a differentiation-specific and a female-specific expression of Xist isoforms have been reported, the functional role of each Xist RNA isoform is largely unexplored. Using CRISPR/Cas9-mediated targeted modification of the 5′ splice site in Xist intron 7, we create mutant female ES cell lines which dominantly express the long- or short-splicing isoform of Xist RNA from the inactive X-chromosome (Xi) upon differentiation. Successful execution of CRISPR/Cas-based splicing modulation indicates that our CRISPR/Cas-based targeted modification of splicing sites is a useful approach to study specific isoforms of a transcript generated by alternative splicing. Upon differentiation of splicing-mutant Xist female ES cells, we find that both long and short Xist isoforms can induce X-chromosome inactivation normally during ES cell differentiation, suggesting that the short splicing isoform of Xist RNA is sufficient to induce X-chromosome inactivation. INTRODUCTION Alternative splicing of mRNA precursors is widespread in multicellular eukaryotes, especially in higher vertebrates (1,2). In multicellular eukaryotes, alternative splicing is more common than in unicellular eukaryotes in which most of genes are intron-less or very short introns and alternative splicing is rarely found. The total number of genes is not radically different between vertebrates and invertebrates, but the numbers of alternative spliced genes and the number of variants are higher in vertebrates, suggesting that alternative splicing could be related with the complexity of species. For example, in humans, ∼98% of multi-exon genes undergo alternative splicing (3). Significant expansion of the proteome generated through alternative splicing from limited numbers of genes provides diverse regulatory functions for proteins such as a tissues-specific and developmental stage-specific functions (4). Xist encodes a long noncoding RNA and is required for X chromosome inactivation (XCI) by which one of the two X-chromosomes is transcriptionally silenced in female mammals (5–8). During XCI, Xist RNA highly expressed from the inactive X-chromosome (Xi) recruits various chromatin modifying enzymes to the Xi and induces chromosome-wide epigenetic modifications (9,10). Disruption of Xist expression results in failure of female embryonic development or induction of cancer in females (11,12), indicating the critical role for Xist throughout the female life cycle. Xist is transcribed into a variety of different isoform transcripts through differentiation-specific transcription start sites (13), alternative polyadenylation sites (14,15), and alternative splicing (16). Although there are various isoforms of Xist RNA, the specific functions of each remain unexplored. A previous report using tetracycline-inducible Xist mutant transgenes integrated in X-linked Hprt locus in male ES cells demonstrated that repeat A located at the 5′-end of Xist RNA is essential for X-linked gene silencing, and functionally redundant elements for Xist RNA localization are dispersed across the rest of Xist region (17). In this assay, Xist mutant transgene lacking the 3′-half of Xist RNA including exon 7 still exhibits normal Xist RNA localization and induction of X-linked gene silencing. Using the Xist transgene assay, however, the role of Xist elements for XCI can be addressed only at the early stage of XCI, since inactivation of the single male X-chromosome leads to cell death. Thus, the role of the 3′-half region of Xist RNA including exon 7 in XCI has been overlooked until recently. Several papers using Xist mutant female cells have shown that the critical Xist elements/regions for XCI reside across Xist RNA (18–22). Our recent study demonstrated that exon 7 of long splicing isoform of Xist RNA is essential for stable Xist RNA localization on the Xi and harbors one of the two major binding region for heterogeneous nuclear ribonucleoprotein U (hnRNP U) protein required for anchoring Xist RNA to the Xi (20,23). The short splicing isoform of Xist RNA, which loses a large part of exon 7 present in the long splicing isoform of Xist RNA, is reported as a female-specific isoform of Xist RNA (16). Since the short splicing isoform of Xist RNA loses one of two major hnRNP U binding regions present in exon 7 of the long splicing Xist RNA isoform, we sought to address whether the short splicing Xist isoform is capable of inducing XCI. To investigate the function of specific splicing isoforms of Xist RNA, modification of the 5′ and 3′ splice sites or deletion of the intron is one potential approach. Modulation of splicing efficiency by altering the consensus sequence for splicing at the 5′ and 3′ splice sites may result in a dominant expression of a specific isoform of transcripts without disturbing the majority of the genomic sequence. Instead of traditional gene targeting, the CRISPR (clustered regularly interspaced short palindromic repeat)/Cas (CRISPR-associated) system derived from microbes has widely been used as a tool to modify genomic DNA sequence, because it is easy to design and efficient for genome editing in a variety of systems (24,25). Here we show an efficient strategy to modulate Xist intron 7 splicing using the CRISPR/Cas9 system. We targeted Xist intron 7 in mouse female ES cells to generate cell lines dominantly expressing the short- or long-splicing isoform of Xist RNA. To our surprise, the dominant expression of the short splicing Xist isoform induced by mutations at the 5′ splice site did not affect Xist RNA localization nor X-linked gene silencing. We discuss the implications of these results to understand the potential role of the short splicing isoform of Xist RNA in XCI. MATERIALS AND METHODS Cell culture Wildtype J1 male and 16.7 female ES cells and their derivatives Tsix-truncated J1 male and 16.7 female ES cells have been described previously (20,26–28). Tsix-truncation mutant female ES cells expressing FLAG-HA-hnRNP U from endogenous hnRNP U loci has been described previously (20). ES cells were cultured in ES media with 2i inhibitors and induced differentiation by the embryoid body procedure (20). To generate MEF, female Mus musculus 129S1/SvlmJ mouse (Jackson Laboratory) was crossed with male M. musculus castaneus mouse (CAST/EiJ, Jackson Laboratory). MEF cells were established from E12.5 mouse embryo of resulting progeny. ES cell targeting with CRISPR/Cas9 system For Xist intron 7 targeting, synthetic forward and reverse primers were annealed and inserted to the modified pX459ver2 plasmid with the sgRNA(F+E) mutation at BbsI (21,29,30). ssODNs for Xist splicing-enhanced or -repressed were used for the homology directed repair (HDR)-mediated modification for SXT or LXT Xist splicing mutations, respectively. Primer information is listed in Supplementary Table S1. For Tsix intron 2 targeting, SpCas9 in the modified pX459ver2 plasmid was replaced by high specificity eSpCas9(1.1) (31), yielding pX459Me. In addition, we introduced VQR mutations to eSpCas9(1.1) for NGA PAM recognition (32). A synthetic tRNA promoter primer pair for sgRNA without 5′-G to replace U6 promoter (33) and a primer pair for sgRNA were annealed and inserted to pX459Me(+VQR) plasmid at PciI-BbsI site. ssODNs STsT were used for HDR-mediated modification to enhance incorporation of Tsix exon 3. Transfection and selection for CRISPR/Cas9-mediated genome editing using ssODN were described previously (20). Genomic PCR primers (X7) used to identify mutant clones in Figure 2C and Supplementary Figure S3C are: Xist, Xist-AI7-SD-F and Xist-AI7-SD-R; and Tsix, Tex2-F1 and Tint3-R1, respectively. Generation of female ES cells expressing long or short splicing Xist isoform by gene targeting Bacterial recombineering system was used to construct the vector for targeting the 5′ and 3′ splice sites of Xist intron 7 (34). For targeting the 5′ splice site of Xist intron 7, the left- and right-arm adaptors for bacterial homologous recombination were annealed and inserted into pBS-2xLoxP-Zeo at XhoI/HindIII and EcoRI/NotI sites (20), respectively. For targeting the 3′ splice site of Xist intron 7, the left- and right-arm adaptors were inserted into pBS-2xLoxP-Zeo, at XhoI/HindIII and EcoRI/NotI sites, respectively. BstZ17I/SpeI splice acceptor (SA)-internal ribosome entry site (Ires)-puromycin resistance gene (Puro)-truncated thymidine kinase (ΔTK)-tandem poly(A) signals (tpA) cassette from pGEM-SAIres-PuroΔTK-tpA or SA-Ires-hygromycin resistance gene (Hyg)-ΔTK-tpA cassette from pGEM-SAIres-HygΔTK-tpA was inserted at PmlI/NheI between two LoxP sites, yielding the targeting vector for bacterial recombination for the 5′ or 3′ splice sites of Xist intron 7, respectively. pBS-sx16delL containing a 9.8 kb fragment from Xist exon 1 to exon 7 (chrX: 103 464 270–103 474 034 in GRCm38/mm10, UCSC genome browser) was transfected into SW106 strain and targeted by XhoI-NotI fragment of the bacterial targeting vector, generating the 5′ splice site targeting vector for mouse ES cells. pBS-sx16delL2 containing a 8.4 kb fragment from Xist exon 7 to downstream region (chrX: 103 457 842–103 466 223 in GRCm38/mm10, UCSC genome browser) and XhoI-NotI fragment of the bacterial targeting vector were used for generating the 3′ splice site mouse targeting vector. The Xist intron 7 targeting vectors linearized by SalI were used for transfection of mouse ES cells by electroporation. For transfection of targeting construct into mouse ES cells, 2 × 107 ES cells and 40 μg of linearized mouse targeting vector were used for electroporation using the Bio-rad Gene Pulser Xcell and 0.4 cm gap cuvette at 240 V, 500 μF setting. The transfected mouse ES cells were cultured without selection drug for 24 hours and then cultured with 250 μg/ml Hygromycin or 2 μg/ml Puromycin for 7–8 days. To remove selection marker by Cre-loxP recombination, cells were transfected with Cre-expression vectors and were selected by 0.2 μM Fialuridine (FIAU, Sigma-Aldrich) for 7–8 days to isolate TK-negative colonies. Isolated individual ES colonies were screened by genomic PCR using primer sets (the 5′ splice site targeting, 7.2TST5-F and UptpA-F for 5′-end, SA-R, and 7.2TSTIn-R for 3′-end; 3′ splice site targeting, 7.2TSTIn-F and UptpA-F for 5′-end, SA-R and 7.2TST3-R for 3′-end; Tsix targeting, TST-F and SA-R for 5′-end, TST-R and UptpA-F for 3′-end) as described in Supplementary Figure S2. Reverse transcription PCR (RT-PCR) RT-PCR were performed according to Yamada et al. (20). For reverse transcription and quantitative real time PCR (qRT-PCR) analysis of each Xist splicing mutant, two independent mutant cell lines were used. The qRT-PCR data is an average of three independent differentiation experiments per each mutant cell line. Briefly, total RNA was extracted using RNAzol RT (Molecular Research Center) as per the manufacturer's instructions. DNA contamination was eliminated from 2 μg of total RNA using 0.5 units TURBO DNase (Invitrogen) at 37°C for 1 hour. 0.4 μg RNA was used for cDNA synthesis with gene-specific reverse primers using Maxima H Minus Reverse Transcriptase (Thermo Scientific) at 50°C for 30 min according to the manufacturer's instructions. As a minus RT control, reaction without Maxima H Minus Reverse Transcriptase was performed in parallel to ensure specificity of the qRT-PCR. cDNA was 5-fold diluted with H2O and stored in –20°C. Real-time PCR reactions were performed by adding 0.5 μl cDNA and 0.5 μM forward and reverse primers to Fast SYBR Green Master Mix in StepOnePlus Real-Time PCR System (Applied Biosystems). All real-time PCR was done in triplicate with the conditions as followed: 1 min at 95°C (1×), followed by 15 s at 95°C, 30 s at 58–62°C (See Supplementary Table S1) (40×). A melt curve from 60 to 95°C was run to ensure only one specific product was amplified. Primer pairs used in qRT-PCR was previously described except for forward primer to detect short Xist isoform (Supplementary Table S1) (20): 129 allele-specific Xist (X1–3), Xist129-E1–3-F and Xist129-E1–3-R; 129 allele-specific Pgk1, Pgk1129-F and Pgk1129-R; 129 allele-specific Mecp2, Mecp2129-F and Mecp2129-R; Gapdh, Gapdh-F and Gapdh-R; long Xist isoform (XL), XiI7LRT-F and XiI7LRT-R; short Xist isoform (XS), XiI7SRT-F2 and XiI7SRT-R. For RT-PCR analysis of Tsix splicing mutant, PCR reactions were performed using 1 μl cDNA and 0.5 μM forward and reverse primers in total 20 μl Maxima Hot Start PCR Master Mix (Thermo Scientific). PCR reaction was performed with the conditions as followed: 4 min at 95°C (1×), followed by 30 s at 95°C, 30 s at 58°C, and 20 s at 72°C (28×) for Tsix or 10 s at 72°C (24×) for Gapdh. Primer pairs used in PCR: Tsix exons 2–4, Tex2-F1 and Tex4-R1; Gapdh, Gapdh-F and Gapdh-R. Immuno-FISH and immunofluorescence Immuno-FISH was performed as described (21,35). Antibodies used in this study are: anti-H3K27me3 (MABI #0323, 1:1000 dilution and Cell Signaling #9733, 1:1000), anti-H2AK119Ub (Cell Signaling #8240, 1:1000), anti-H4K20me1 (Active motif #39727, 1:5000), anti-ASH2L (Bethyl Laboratories A300-107A, 1:400). UV-crosslinking RNA immunoprecipitation (RIP) analysis UV-Crosslinking RIP was performed following a previously described procedure (20). qRT-PCR was performed using the primers listed in Supplementary Table S1. RESULTS Short splicing isoform of Xist RNA is expressed in both male and female ES cells Compared with the long splicing isoform of Xist RNA, the short splicing isoform of Xist RNA lacks a 5668 nt-length fragment located downstream of repeat E at the 5′-end of Xist exon 7 (Figure 1A). Although a previous report showed that the short splicing isoform of Xist RNA is transcribed in female cells (16), the presence ratio remains unclear between short and long splicing Xist RNA isoforms in undifferentiated embryonic stem (ES) and differentiated female cells. To determine the short/long Xist isoform ratio, we designed short and long splicing isoform-specific primer pairs for quantitative RT-PCR (qRT-PCR) (XS and XL in Figure 1A). Male and female ES cells used in this work contain a mutation in Tsix, a Xist antagonist, which results in higher Xist expression in undifferentiated male and female ES cells and leads to non-random X-inactivation in female cells upon induction of X-inactivation (20,26). As previously reported, Xist exhibited sex-specific dynamic expression patterns during X-chromosome inactivation (Figure 1B). In undifferentiated ES cells, a low level of the long Xist RNA isoform was transcribed in both male and female cells. Upon embryoid body (EB) differentiation, robust Xist upregulation was induced in female cells but lower Xist expression was maintained in male cells. Finally, whereas Xist expression was extinguished in male mouse embryonic fibroblast (MEF) cells, robust Xist expression was maintained in female MEF cells. Surprisingly, the short splicing isoform of Xist RNA was detected not only in female ES and EB cells but also in male ES and EB cells (Figure 1B). Figure 1. Open in new tabDownload slide Expression of short and long splicing isoform of Xist RNA in male and female cells. (A) Schematics of short and long splicing isoform of Xist. White and black boxes indicate Xist exons and repeat E, respectively. The positions of primer pairs are shown as asterisks: X1–3, total Xist expression; XL and XS, long and short splicing isoform of Xist expression, respectively. Two major hnRNP U binding regions in Xist RNA are shown by orange gradient lines. (B) RT-PCR of short and long splicing isoform Xist expression in Tsix mutant ES cells and embryoid bodies (EB), and wild-type mouse embryonic fibroblast (MEF) cells derived from Mus musculus 129S1/SvlmJ and Mus musculus castaneus (CAST/EiJ) mating. M, male; F, female. (C) Standard curve for short and long Xist splicing isoform-specific primer pairs, XL and XS shown in (A). X-axis, relative amount of serial dilutions of input genomic DNA (100 ng to 100 pg) extracted from heterozygous Xist intron 7 deletion mutant female ES cells; y-axis, the CT number measured by real time PCR. (D) Ratio of short and long Xist splicing isoform expressed in male and female ES, EB and MEF cells. The mean ± SD from three independent experiments is shown. Figure 1. Open in new tabDownload slide Expression of short and long splicing isoform of Xist RNA in male and female cells. (A) Schematics of short and long splicing isoform of Xist. White and black boxes indicate Xist exons and repeat E, respectively. The positions of primer pairs are shown as asterisks: X1–3, total Xist expression; XL and XS, long and short splicing isoform of Xist expression, respectively. Two major hnRNP U binding regions in Xist RNA are shown by orange gradient lines. (B) RT-PCR of short and long splicing isoform Xist expression in Tsix mutant ES cells and embryoid bodies (EB), and wild-type mouse embryonic fibroblast (MEF) cells derived from Mus musculus 129S1/SvlmJ and Mus musculus castaneus (CAST/EiJ) mating. M, male; F, female. (C) Standard curve for short and long Xist splicing isoform-specific primer pairs, XL and XS shown in (A). X-axis, relative amount of serial dilutions of input genomic DNA (100 ng to 100 pg) extracted from heterozygous Xist intron 7 deletion mutant female ES cells; y-axis, the CT number measured by real time PCR. (D) Ratio of short and long Xist splicing isoform expressed in male and female ES, EB and MEF cells. The mean ± SD from three independent experiments is shown. To determine the relative expression level of short and long splicing isoforms of Xist, we first determined the amplification efficiencies of short and long splicing isoform-specific primer pairs (Figure 1C). Ten-fold serial dilutions of genomic DNA extracted from the heterozygous Xist intron 7-deletion mutant female ES cells were used as templates for quantitative real-time PCR analysis (See Supplementary Figure S2). This Xist mutant female ES cell line has one wildtype X chromosome with the full length of Xist and the other mutant X chromosome deleted a 5.7-kb region corresponding to Xist intron 7 of the short splicing isoform. The standard curves created by the qRT-PCR data showed that both short and long splicing isoform-specific primer pairs amplified the target region with high efficiency (91% and 93%, respectively). We then examined the ratio of short and long splicing isoforms of Xist RNA expressed in different cell types (Figure 1D). Compared with the long splicing Xist isoform, the short splicing Xist RNA isoform was expressed at a lower level in all types of cells we tested except for male MEF cells which do not express Xist. Interestingly, male cells showed a higher short Xist isoform ratio than female cells. The ratio of the short splicing Xist isoform against the long isoform was approximately 0.30 in undifferentiated ES and 0.20 in differentiating EB male cells, respectively. On the other hand, in all types of female cells, the ratio of the short splicing Xist RNA isoform was consistent at ∼0.05. CRISPR/Cas9-based modulation of splicing efficiency for Xist intron 7 Given the short isoform of Xist RNA lacks the hnRNP U binding region which presents in exon 7 of the long splicing Xist isoform (Figure 1A) (20), we next sought to address whether the short isoform of Xist RNA functions to induce XCI: stable localization of Xist RNA and induction of X-linked gene silencing on the Xi. Although it remains difficult to predict alternative splicing efficiency from DNA sequences at the 5′ and 3′ splice sites, it is worth noting that the 5′ and 3′ splice sites of alternative splicing often differ from consensus sequence in constitutive splicing (2). Thus, the 5′ and 3′ splice sites could be primary targets for modifications whereby efficiency of alternative splicing is enhanced or repressed. We compared the nucleotide sequences of the 5′ and 3′ splice sites of Xist intron 7 with consensus sequences of constitutive splicing sites (36,37), as well as those of the other Xist introns, to determine whether modification of the 5′ and 3′ splice sites could alter splicing efficiency of Xist intron 7 (Figure 2A) (38). Invariant sequences at the 5′- and 3′-ends of the intron are conserved in all Xist introns: GT and AG, respectively (Figure 2A). Three nucleotides in Xist intron 7 splice sites are different from the constitutive splicing consensus sequences: –1 and +4 positions of the 5′ splice site, and +1 position of the 3′ splice site of Xist intron 7. Considering that a presence of T at position +4 of the 5′ splice site of Xist intron 2 and T at position +1 of the 3′ splice site of introns 5 and 6 do not impair their splicing efficiency, C at position –1 of the 5′ splice site of Xist intron 7 could be a major reason for the weaker splicing efficiency. Thus, we chose the C at position -1 of the 5′ splice site of Xist intron 7 as a primary target for modifications to modulate splicing efficiency. Optimization of the 5′ splice site of Xist intron 7 could lead to improved splicing efficiency of Xist intron 7 and induce higher expression of the short Xist RNA isoform. In contrast, disruption of invariant sequences at the 5′-end of Xist intron 7 is likely to result in no splicing of Xist intron 7 and exclusive expression of the long splicing Xist RNA isoform. Figure 2. Open in new tabDownload slide Short and long Xist isoform-specific targeting with CRISPR/Cas9. (A) Alignment of all 5′ and 3′ splice sites in Xist with a consensus splicing sequence of a major U2 class intron (36,37). r: adenine (a) or guanine (g); y is cytosine (c) or thymine (t). The exon and intron nucleotide sequences are capitalized or lowercased, respectively. Arrowheads indicate cleavage sites by splicing. (B) Map of Xist/Tsix locus to show the strategy of isoform-specific targeting and the Tsix truncation. The asterisk indicates the position of primer pair (X7) used in (C). The sgRNA sequence and adjacent protospacer-adjacent motif (PAM) for CRISPR/Cas9 genome editing are shown as green and orange boxes, respectively. The mutations introduced by CRISPR/Cas9 are labeled in red. Arrow indicates double-strand break site. HindIII site to confirm CRISPR/Cas9-based targeted HDR is highlighted by blue. SA, splicing acceptor; IRES, internal ribosome entry site; Hyg, hygromycin resistance gene; tpA, tandem polyadenylation signal. (C) Genomic PCR and following HindIII digestion analysis to confirm the CRISPR/Cas9 targeting. (D) qRT-PCR of short and long Xist expression in ES cells to confirm the splicing efficiency alteration in isoform-specific targeting cell lines. Gapdh was used as an internal control for normalization. The mean ± SD from three independent experiments is shown. P-values were calculated to TST control by an unpaired t-test (*P < 0.05, **P < 0.01, ***P < 0.001). Figure 2. Open in new tabDownload slide Short and long Xist isoform-specific targeting with CRISPR/Cas9. (A) Alignment of all 5′ and 3′ splice sites in Xist with a consensus splicing sequence of a major U2 class intron (36,37). r: adenine (a) or guanine (g); y is cytosine (c) or thymine (t). The exon and intron nucleotide sequences are capitalized or lowercased, respectively. Arrowheads indicate cleavage sites by splicing. (B) Map of Xist/Tsix locus to show the strategy of isoform-specific targeting and the Tsix truncation. The asterisk indicates the position of primer pair (X7) used in (C). The sgRNA sequence and adjacent protospacer-adjacent motif (PAM) for CRISPR/Cas9 genome editing are shown as green and orange boxes, respectively. The mutations introduced by CRISPR/Cas9 are labeled in red. Arrow indicates double-strand break site. HindIII site to confirm CRISPR/Cas9-based targeted HDR is highlighted by blue. SA, splicing acceptor; IRES, internal ribosome entry site; Hyg, hygromycin resistance gene; tpA, tandem polyadenylation signal. (C) Genomic PCR and following HindIII digestion analysis to confirm the CRISPR/Cas9 targeting. (D) qRT-PCR of short and long Xist expression in ES cells to confirm the splicing efficiency alteration in isoform-specific targeting cell lines. Gapdh was used as an internal control for normalization. The mean ± SD from three independent experiments is shown. P-values were calculated to TST control by an unpaired t-test (*P < 0.05, **P < 0.01, ***P < 0.001). To modify the 5′ splice site of Xist intron 7 to modulate splicing efficiency for Xist intron 7, we used the CRISPR/Cas9-mediated homology-directed repair (HDR) system to introduce mutations using a single-stranded oligodeoxynucleotides (ssODNs) as a repair template (Figure 2B) (39). For CRISPR/Cas9-mediated modification at the 5′ splice site of Xist intron 7, we chose a 20 bp-length sequence from position +5 to -15 of the 5′ splice of Xist intron 7 as sgRNA sequence. To improve splicing efficiency of Xist intron 7 which resulted in dominant expression of the short splicing Xist isoform (termed SXT, short Xist transcript), C to G transversion at position -1 of the 5′ splice site of Xist intron 7 was introduced as well as T to A at position +4. In contrast, to abolish Xist intron 7 splicing for exclusive expression of the long splicing Xist isoform (termed LXT, long Xist transcript), we used ssODN with mutations in invariant sequences at positions +1 and +2 of the 5′ splice site. In addition, we introduced HindIII site (position –16 to –11 of the 5′ splice site) to easily identify the Xist splicing mutant ES clones. A Tsix-truncated 16.7 female ES cell line (TST) was used as a parental cell line to target Xist intron 7 in this study. The 16.7 ES cell line has one X-chromosome from Mus musculus 129Sv/J (129) and the other from M. musculus castaneus (Cast). In the TST cell line, the 129 X with the Tsix mutation always become the Xi upon differentiation due to disruption of Tsix, a negative regulator of Xist (27,40,41), which allows allele-specific analysis of X-linked gene expression. By screening using genomic PCR followed by HindIII digestion, several homozygous Xist splicing mutant cell lines were isolated (Figure 2C). We also confirmed the alteration of sequences around the 5′ splice site of Xist intron 7 in each cell line by Sanger sequencing (Supplementary Figure S1). To confirm whether the mutations at the 5′ splice site at Xist exon 7 alter the splicing efficiency in SXT and LXT Xist splicing mutant female ES cells, we analyzed the Xist expression by qRT-PCR using short and long isoform-specific primer pairs (Figure 2D). As we predicted, while the expression level of the long Xist RNA isoform was comparable with that in control TST cells, expression of the short Xist isoform was further repressed in LXT mutant cell lines as compared to control TST ES cells. By contrast, in SXT female ES cell lines, the expression level of the long Xist isoform was significantly reduced with approximately 10% in control TST cells whereas the expression level of the short Xist RNA isoform was more than 10 times higher than in TST cells. These data indicate that SXT and LXT mutations at the 5′ splice site of Xist intron 7 by CRISPR/Cas9-mediated HDR can efficiently modulate splicing efficiency of Xist intron 7. We also created female ES cell lines dominantly expressing long and short splicing isoform of Xist from 129 X-chromosome by traditional gene targeting (Supplementary Figure S2). Using successive gene targeting followed by cre-loxP recombination, we replaced 5′ and 3′ splice sites of Xist intron 7 by loxP, yielding female ES cells expressing long splicing Xist isoform with 2 loxP sites. In addition, we also obtained female ES cell lines expressing short splicing Xist isoform with 1 loxP. Finally, we disrupted Tsix to induce non-random XCI from 129 mutant Xist allele. Compared with CRISPR/Cas9-based approach, the strategy using traditional gene targeting to generate short splicing Xist isoform-expressing ES cells is associated with large deletion of genomic sequence from Xist. Another disadvantage of traditional gene targeting for our Xist mutant ES cells is that series of gene targeting and Cre-loxP recombination took more than 4 months compared with CRISPR/Cas9-based approach which took approximately one month. Since our Xist mutant female ES cell lines created by gene targeting exhibited similar phenotype to those created by CRISPR/Cas9-based approach in Xist expression and X-linked silencing upon differentiation, results obtained from Xist mutant ES cell lines by traditional gene targeting are shown in Supplementary data (Supplementary Figures S2 and S4). This CRISPR/Cas-mediated modulation can be also used to modulate other kinds of alternative splicing events. As an example, we modified the efficiency of Tsix exon 3 skipping (Supplementary Figure S3). Optimization of the 3′ splicing site in Tsix intron 2 significantly enhanced the efficiency of exo3-incorporation in the Tsix transcript. Xist upregulation occurs normally in SXT and LXT female cells We next examined the effects of SXT and LXT mutations on Xist upregulation when we differentiated Xist splicing mutant female cells to induce non-random XCI due to Tsix mutation. To measure the total Xist expression level from the 129 Xi, we performed 129 mutant allele-specific qRT-PCR analysis to determine total Xist expression from the Xi using the primer pair (X1–3) that spans introns (Figures 1A and 3A and Supplementary Figure S4A). The expression levels of total Xist RNA from the 129 Xi gradually increased during ES cell differentiation in all cell lines. No significant difference in Xist upregulation was observed upon differentiation between control TST cells and SXTs or LXTs mutant cells, although the Xist RNA level in SXT mutant cells was slightly lower than that of control TST and LXT mutant cells at day 12 of differentiation. These data suggest that SXT and LXT mutant female ES cell lines can induce Xist upregulation even though several nucleotides are replaced at the 5′ splice site of Xist intron 7 by CRISPR/Cas9-mediated HDR. Figure 3. Open in new tabDownload slide Xist upregulation in isoform-specific targeting cells during ES cell differentiation. (A) 129 Xi allele-specific qRT-PCR of the Xist expression across exons 1–3. (B) Ratio of short splicing isoform of Xist RNA to long Xist isoform in control TST and Xist splicing mutant cells upon differentiation. Xist splicing isoform-specific primer pairs (XL and XS in Figure 1A) were used for qRT-PCR. (C, D) Expression of long (C) and short (D) splicing isoform of Xist RNA using qRT-PCR with Xist splicing isoform-specific primer pairs (XL and XS). The expression values in (A), (C), and (D) were normalized to Gapdh and those of the undifferentiated control TST cells which is set to 1. The mean ± SD from three independent experiments is shown. P-values were calculated to TST control at the same day of differentiation by an unpaired t-test (*P < 0.05, **P < 0.01, ***P < 0.001). Figure 3. Open in new tabDownload slide Xist upregulation in isoform-specific targeting cells during ES cell differentiation. (A) 129 Xi allele-specific qRT-PCR of the Xist expression across exons 1–3. (B) Ratio of short splicing isoform of Xist RNA to long Xist isoform in control TST and Xist splicing mutant cells upon differentiation. Xist splicing isoform-specific primer pairs (XL and XS in Figure 1A) were used for qRT-PCR. (C, D) Expression of long (C) and short (D) splicing isoform of Xist RNA using qRT-PCR with Xist splicing isoform-specific primer pairs (XL and XS). The expression values in (A), (C), and (D) were normalized to Gapdh and those of the undifferentiated control TST cells which is set to 1. The mean ± SD from three independent experiments is shown. P-values were calculated to TST control at the same day of differentiation by an unpaired t-test (*P < 0.05, **P < 0.01, ***P < 0.001). Next, we used qRT-PCR with splicing isoform-specific primer pairs (XL and XS in Figure 1) to examine whether dominant expressions of short- and long-splicing isoforms of Xist RNA in SXT and LXT mutant ES cells (Figure 2D), respectively, are maintained during ES cell differentiation. We determined the ratio of short splicing isoform to long splicing isoform of Xist RNA based on qRT-PCR data using Xist splicing isoform-specific primer pairs (Figure 3B and Supplementary Figure S4B). The proportion of the short Xist RNA isoform in control TST cells was constantly low (approximately 0.05 of short/long splicing Xist isoform ratio) during differentiation, consistent with data in Figure 1D. Dominant expression of the short splicing isoform of Xist RNA in SXT Xist splicing mutant cells became more evident as differentiation progressed. At day 12 of differentiation, short splicing isoform of Xist RNA was approximately 20-fold higher than long Xist isoform (Figure 3B). Throughout ES cell differentiation, the relative expression level of the long Xist splicing isoform in SXT mutant cell lines was much lower than that in undifferentiated control TST cells (Figure 3C). In contrast, in LXT Xist splicing mutant cells, short isoform Xist expression was significantly lower (approximately 0.01 or less) than those in TST cells, which exhibited approximately 0.05 short/long Xist isoform ratio constantly during ES cell differentiation (Figure 3B). During ES cell differentiation, LXT mutant cell lines consistently exhibited weaker expression of the short Xist isoform than undifferentiated control TST cells (Figure 3D). This data suggested that the CRISPR/Cas9-mediated SXT and LXT mutations at the 5′ splice site of Xist intron 7 result in dominant expression of short and long splicing isoforms of Xist RNA, respectively. Short Xist splicing isoform coats Xi and recruits repressive histone modifications to the Xi One of the characteristic features of Xist RNA is its unique localization along the entire Xi to induce X-linked gene silencing (42). Nuclear scaffold protein hnRNP U is a critical factor for Xist RNA to associate with the Xi (23,43). Since the short splicing isoform of Xist RNA lacks one of the two major hnRNP U binding region in exon 7 of long Xist splicing isoform (20), we examined whether the short isoform of Xist RNA can be localized on the Xi and recruit histone modifications enzymes to the Xi. We differentiated Xist splicing mutant cell lines, as well as control TST cells, and performed immunofluorescence against histone H3 trimethyl lysine 27 (H3K27me3) combined with fluorescence in situ hybridization (FISH) for Xist RNA (Figure 4). At the undifferentiating stage (day 0), neither Xist RNA clouds nor focal H3K27me3 signal was observed in the nucleus in both SXT and LXT Xist mutant cells as well as control TST cells (Figure 4A and C). Upon differentiation, the percentage of cells with Xist RNA clouds and focal H3K27me3 enriched on the Xi was increased in all cell lines (Figure 4B and C). The percentage of Xist-positive and H3K27me3-positive cells was comparable between TST cells and Xist splicing mutant cells (SXT and LXT) at each stage during ES cell differentiation. Consistent with a previous report that Xist exon 7 is not required for PRC2 recruitment (20), these results demonstrate that Xist intron 7 is dispensable for PRC2 recruitment to the Xi. These data also indicate that the short splicing isoform of Xist RNA as well as the long isoform is able to localize on the Xi. Figure 4. Open in new tabDownload slide Accumulation of Xist RNA and H3K27me3 on the Xi by short and long splicing mutant Xist RNA. (A and B) Representative image of Immuno-FISH for Xist RNA (green) and H3K27me3 (red) in undifferentiated ES cells (A, day 0) and differentiating EB cells (B, day 8). Nuclei were counterstained with DAPI. Scale bar is 10 μm. (C) Frequency of Xist RNA cloud- and H3K27me3-positive cells upon differentiation. More than 200 nuclei were counted at each time point for each cell line in three independent experiments. Figure 4. Open in new tabDownload slide Accumulation of Xist RNA and H3K27me3 on the Xi by short and long splicing mutant Xist RNA. (A and B) Representative image of Immuno-FISH for Xist RNA (green) and H3K27me3 (red) in undifferentiated ES cells (A, day 0) and differentiating EB cells (B, day 8). Nuclei were counterstained with DAPI. Scale bar is 10 μm. (C) Frequency of Xist RNA cloud- and H3K27me3-positive cells upon differentiation. More than 200 nuclei were counted at each time point for each cell line in three independent experiments. Since Xist RNA recruits various chromatin modifying enzymes to the Xi to induce chromosome-wide gene silencing (9,10), we examined the deposition of other histone modifications on the Xi in SXT and LXT mutant female cells upon differentiation. We performed immunofluorescence for two epigenetic hallmarks of the Xi, ubiquityl-histone H2A (H2AK119Ub) and histone H4 monomethyl lysine 20 (H4K20me1) (Supplementary Figure S5). Similar to the staining pattern of H3K27me3, no focal signal of H2AK119Ub and H4K20me1 was evident in undifferentiated female ES cells. However, focal stainings of these histone marks were colocalized with H3K27me3 on the Xi upon differentiation. The staining pattern of these histone marks showed no significant difference between control cells and Xist splicing mutant cells during X-inactivation. Consistent with our recent study showing that Xist repeat E is required for ASH2L localization to the Xi (21), ASH2L recruitment to Xi was normal in both SXT and LXT Xist splicing mutant cell lines since both SXT and LXT mutant Xist RNAs retains Xist repeat E (Supplementary Figure S6). No significant difference was found between control and Xist splicing mutant cells at any stage. These data suggest that the short isoform of Xist RNA can recruit various histone modifying enzymes and chromatin factors to establish unique heterochromatic features on the Xi. Since Xist intron 7 region contains one of two hnRNP U binding regions present in long splicing Xist isoform (Figure 5A) (20), we next addressed whether hnRNP U-binding to Xist RNA is maintained in the short splicing Xist isoform by UV-crosslinking RNA immunoprecipitation (RIP). Using the FLAG-HA-hnRNP U TsixTST6 female ES cell line, which expresses FLAG-hnRNP U from endogenous loci and has Tsix truncation mutation on 129 allele (20), we established SXT and LXT Xist mutant cell lines by CRISPR/Cas9-based approach (Supplementary Figure S7). Compared with control TST and LXT mutant cell lines, UV-crosslinking RIP using anti-FLAG antibody revealed that hnRNP U-binding to Xist exon 1 is comparable in SXT Xist mutant cell lines that dominantly express short splicing Xist isoform (Figure 5B). This result indicates that short splicing Xist isoform is able to bind with hnRNP U though exon 1 as efficiently as long splicing Xist isoform. Figure 5. Open in new tabDownload slide Interaction of short splicing Xist isoform with hnRNP U through exon 1. (A) Map of primer pairs across Xist for the RIP analysis. White boxes indicate Xist exons. The Xist repeats A-F are shown by black boxes. The positions of the primer pairs across Xist are shown as arrowheads. (B) UV- crosslinking RIP analysis using Xist splicing mutant female ES cell lines expressing FLAG-HA-tagged hnRNP U upon differentiation at day 9. The mean ± SD bar from two independent experiments is shown with an unpaired t test P values (*P < 0.05). Figure 5. Open in new tabDownload slide Interaction of short splicing Xist isoform with hnRNP U through exon 1. (A) Map of primer pairs across Xist for the RIP analysis. White boxes indicate Xist exons. The Xist repeats A-F are shown by black boxes. The positions of the primer pairs across Xist are shown as arrowheads. (B) UV- crosslinking RIP analysis using Xist splicing mutant female ES cell lines expressing FLAG-HA-tagged hnRNP U upon differentiation at day 9. The mean ± SD bar from two independent experiments is shown with an unpaired t test P values (*P < 0.05). Short Xist isoform is sufficient to induce X-linked gene silencing on the Xi Since differentiated SXT mutant female cells expressing a robust short isoform of Xist RNA exhibited normal localization to the Xi and accumulation of various histone modifications on the Xi similar to control TST cells (Figures 3 and 4), we next examined whether the short splicing isoform of Xist RNA can normally induce X-linked gene silencing during XCI. We performed qRT-PCR analyses using 129 Xi allele-specific primer pairs for two X-linked genes, Pgk1 and Mecp2 (Figure 6 and Supplementary Figure S4C and S4D). Upon ES cell differentiation, the levels of X-linked genes decreased significantly from day 0 to day 12 in control TST cells, as expected. In both SXT and LXT cell lines, expression levels of Pgk1 and Mecp2 genes from 129 Xi allele were gradually decreased with no significant difference from control TST (Figure 6). Although silencing of Mecp2 slightly preceded Pgk1 silencing, both X-linked genes were efficiently repressed less than 8% at day 12 than those in undifferentiated ES cells (day 0). These qRT-PCR data for X-linked genes indicate that both short and long Xist isoforms can induce X-linked gene silencing during XCI comparable to control TST cell line. In sum, despite missing a large region from the long splicing Xist isoform, the short splicing isoform of Xist RNA functions as efficiently as the long splicing isoform of Xist RNA in XCI. Figure 6. Open in new tabDownload slide X-linked gene silencing induced by Xist short and long isoform. qRT-PCR using 129 Xi allele-specific primer sets for two Xi-linked genes, Pgk1 and Mecp2. Gapdh was used as an internal control for normalization. Each value was also normalized to that of each undifferentiated cell line. The mean ± SD from three independent experiments is shown. P-values were calculated by an unpaired t-test (*P<0.05). Figure 6. Open in new tabDownload slide X-linked gene silencing induced by Xist short and long isoform. qRT-PCR using 129 Xi allele-specific primer sets for two Xi-linked genes, Pgk1 and Mecp2. Gapdh was used as an internal control for normalization. Each value was also normalized to that of each undifferentiated cell line. The mean ± SD from three independent experiments is shown. P-values were calculated by an unpaired t-test (*P<0.05). DISCUSSION In this study, we provide a CRISPR/Cas-based strategy to create cell lines expressing specific splicing isoforms of Xist RNA. The alternative splicing of Xist intron 7 belongs to a class of intron retention. Although intron retention has been described as a consequence of mis-splicing, recent studies show that intron retention is involved in regulation of protein-coding gene expression through nuclear retention, exosome degradation or nonsense-mediated mRNA decay (44,45). Due to the complexity of alternative splicing, in which many DNA elements (multiple 5′ and 3′ splice sites, splicing enhancers, splicing silencers etc.) are involved (46), it is still difficult to predict splicing efficiency of alternative splicing to determine the targeted modification to impact splicing efficiency. However, in terms of the numbers of the 5′ and 3′ splice sites involved, intron retention is the simplest alternative splicing class in which only one 5′ and one 3′ splice sites are involved. Although invariant sequences at the 5′ splice site (position +1 and +2) in Xist intron 7 are conserved, we found that highly conserved G in constitutive intron at position –1 of the 5′ splice site is replaced by C in Xist intron 7. We suspected that this nucleotide change causes less efficient splicing of Xist intron 7 and chose the 5′ splice site to enhance or repress the retention of Xist intron 7 by modulating splicing efficiency using the CRISPR/Cas-based strategy. This approach provides a platform of genome editing to study functions of specific splicing isoforms of a transcript generated by intron retention, which is found in transcripts from approximately three quarters of multi-exonic genes in mice and humans (47,48). Untranslated regions and noncoding RNAs have retained introns more frequently than protein-coding regions. In addition, we also show that we can modulate the efficiency of exon-skipping type alternative splicing in Tsix by CRISPR/Cas-mediated intron targeting (Supplementary Figure S3). Our described method does not require complex construction of targeting vectors; as such, our method provides a useful approach to study functions of various alternatively spliced transcripts. Our RT-PCR analysis of the short splicing isoform of Xist RNA differs from a previous report (16). The previous report showed that the short splicing isoform of Xist RNA is exclusively expressed in female cells but not in male (16). In our qRT-PCR experiments (Figure 1), we detected short Xist isoform expression in both male and female cells, and the ratio of the short Xist isoform in male undifferentiated ES and differentiating EB cells to the long isoform is even higher than that in female cells, although the level of the short splicing isoform was lower than the long Xist splicing isoform. It is possible that culture conditions affect the alternative splicing pattern of Xist in male ES and EB cells. Alternatively, differences in relation to expression of the short splicing isoform of Xist RNA could result from the different male ES cell lines used in the previous report and our work. Although the short splicing Xist isoform is expressed in male ES and EB cells, Xist expression level is constantly low and is finally extinguished in fully differentiated MEFs. Thus, it is not likely that short splicing isoform of Xist RNA plays a crucial function in Xist RNA-mediated X-linked gene regulation in male cells. Despite its low expression level, the short Xist isoform is consistently expressed in all types of female cells (Figure 1D). Whereas exon 7 in the long Xist isoform contains one of the two major hnRNP U binding regions (20), the short splicing isoform of Xist RNA loses the hnRNP U binding region present in exon 7 of long Xist isoform by intron 7 splicing (Figure 7). Since hnRNP U is a critical protein factor for Xist RNA localization on the Xi (23), we sought to examine whether the short Xist isoform can induce XCI using Xist splicing SXT mutant female ES cells expressing the short isoform of Xist. To our surprise, SXT mutant cell lines exhibit normal Xist RNA localization on the Xi and X-linked gene silencing (Figures 4 and 6). Since Xist exon 7 truncation mutant female ES cells result in unstable localization and compromised X-linked gene silencing upon differentiation (20), 1,658-bp exon 7 including 1.3-kb Xist repeat E and 205-bp exon 8 in short splicing isoform of Xist RNA is enough to maintain stable localization of Xist RNA on the Xi as efficiently as the full length of exon 7 in the long Xist isoform. Although Xist repeat E deletion does not significantly affect Xist RNA localization, dispersed Xist RNA localization was observed in a subset of Xist repeat E mutant female ES cells upon differentiation (21). Furthermore, a nuclear matrix binding protein CIZ1 binds to Xist RNA repeat E and is required for Xist RNA localization on the Xi in a tissue-specific manner (22,49). Thus, various protein factors such as hnRNP U and CIZ1 might have a redundant function through the Xist intron 7 region and repeat E to support Xist RNA association with the Xi. Figure 7. Open in new tabDownload slide Summary of phenotypes in Xist exon 7 mutant female ES cell lines. White and black rectangles indicate Xist exon and repeat E elements, respectively. Red line shows polyadenylation signal inserted to truncate Xist exon 7 (20). Dotted rectangle indicates Xist repeat E deletion (21). Blue and red arrowheads show mutations at the 5′ splice site in Xist intron 7 to disrupt and enhance splicing of intron 7, respectively. hnRNP U binding regions are shown below Xist mutant maps. Figure 7. Open in new tabDownload slide Summary of phenotypes in Xist exon 7 mutant female ES cell lines. White and black rectangles indicate Xist exon and repeat E elements, respectively. Red line shows polyadenylation signal inserted to truncate Xist exon 7 (20). Dotted rectangle indicates Xist repeat E deletion (21). Blue and red arrowheads show mutations at the 5′ splice site in Xist intron 7 to disrupt and enhance splicing of intron 7, respectively. hnRNP U binding regions are shown below Xist mutant maps. The short splicing isoform of Xist RNA functionally induces XCI in mouse female ES cells upon differentiation (Figures 4 and 6). However, the fact that the short isoform of Xist is maintained at a low expression level in all types of female cells might indicate that the short splicing isoform of Xist is not fully comparable to the long splicing isoform of Xist in a cell-type-dependent or tissue-specific manner. In addition to hnRNP U, variable factors are involved in Xist RNA targeting to the Xi as indicated by recent studies (22,43,49). With low expression of some redundantly acting factors for Xist RNA targeting to the Xi, the hnRNP U binding region present within exon 7 of the long Xist isoform might be more critical than in the differentiating female EB cells used in this study. Further studies are required to fully understand the role of short and long splicing isoforms of Xist during XCI using an in vivo mouse system. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank Dr Feng Zhang for eSpCas9(1.1) (Addgene plasmid #71814); Dr Keith Joung for MSP469 (Addgene plasmid #65771); Katie Gerhardt for editing the manuscript. Author contributions: M.Y. and Y.O. conceived the concept of this work, designed and performed experiments, analyzed the data and wrote the manuscript. FUNDING National Institutes of Health [RO1-GM102184 to Y.O.]. Funding for open access charge: CCHMC internal grant. Conflict of interest statement. None declared. REFERENCES 1. Kornblihtt A.R. , Schor I.E. , Alló M. , Dujardin G. , Petrillo E. , Muñoz M.J. Alternative splicing: a pivotal step between eukaryotic transcription and translation . Nat. Rev. Mol. Cell Biol. 2013 ; 14 : 153 – 165 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Keren H. , Lev-Maor G. , Ast G. Alternative splicing and evolution: diversification, exon definition and function . Nat. Rev. Genet. 2010 ; 11 : 345 – 355 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Pan Q. , Shai O. , Lee L.J. , Frey B.J. , Blencowe B.J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing . Nat. Genet. 2008 ; 40 : 1413 – 1415 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Kelemen O. , Convertini P. , Zhang Z. , Wen Y. , Shen M. , Falaleeva M. , Stamm S. Function of alternative splicing . Gene . 2013 ; 514 : 1 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Brockdorff N. , Ashworth A. , Kay G.F. , McCabe V.M. , Norris D.P. , Cooper P.J. , Swift S. , Rastan S. The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus . Cell . 1992 ; 71 : 515 – 526 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Willard H.F. , Brown C.J. , Hendrich B. , Rupert J. , Lafreniere R. , Xing Y. , Lawrence J.B. The human XIST gene: analysis of a 17 kb inactive X-specific RNA that contains conserved repeats and is highly localized within the nucleus . Cell . 1992 ; 71 : 527 – 542 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Penny G.D. , Kay G.F. , Sheardown S.A. , Rastan S. , Brockdorff N. Requirement for Xist in X chromosome inactivation . Nature . 1996 ; 379 : 131 – 137 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Sado T. , Brockdorff N. Advances in understanding chromosome silencing by the long non-coding RNA Xist . Philos. Trans. R. Soc. Lond. B, Biol. Sci. 2013 ; 368 : 20110325 . Google Scholar Crossref Search ADS WorldCat 9. Cerase A. , Pintacuda G. , Tattermusch A. , Avner P. Xist localization and function: new insights from multiple levels . Genome Biol. 2015 ; 16 : 166 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Pinter S.F. A Tale of Two Cities: How Xist and its partners localize to and silence the bicompartmental X . Semin. Cell Dev. Biol. 2016 ; 56 : 19 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Marahrens Y. , Panning B. , Dausman J. , Strauss W.M. , Jaenisch R. Xist-deficient mice are defective in dosage compensation but not spermatogenesis . Genes Dev. 1997 ; 11 : 156 – 166 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Yildirim E. , Kirby J.E. , Brown D.E. , Mercier F.E. , Sadreyev R.I. , Scadden D.T. , Lee J.T. Xist RNA is a potent suppressor of hematologic cancer in mice . Cell . 2013 ; 152 : 727 – 742 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Johnston C.M. , Nesterova T.B. , Formstone E. , Newall A. , Duthie S. , Sheardown S.A. , Brockdorff N. Developmentally regulated Xist promoter switch mediates initiation of X inactivation . Cell . 1998 ; 94 : 809 – 817 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Ma M. , Strauss W.M. Analysis of the Xist RNA isoforms suggests two distinctly different forms of regulation . Mamm. Genome . 2005 ; 16 : 391 – 404 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Memili E. , Hong Y.K. , Kim D.H. , Ontiveros S.D. , Strauss W.M. Murine Xist RNA isoforms are different at their 3′ ends: a role for differential polyadenylation . Gene . 2001 ; 266 : 131 – 137 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Kim J.S. , Choi H.W. , Araúzo-Bravo M.J. , Schöler H.R. , Do J.T. Reactivation of the inactive X chromosome and post-transcriptional reprogramming of Xist in iPSCs . J. Cell Sci. 2015 ; 128 : 81 – 87 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Wutz A. , Rasmussen T.P. , Jaenisch R. Chromosomal silencing and localization are mediated by different domains of Xist RNA . Nat. Genet. 2002 ; 30 : 167 – 174 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Hoki Y. , Kimura N. , Kanbayashi M. , Amakawa Y. , Ohhata T. , Sasaki H. , Sado T. A proximal conserved repeat in the Xist gene is essential as a genomic element for X-inactivation in mouse . Development . 2009 ; 136 : 139 – 146 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Senner C.E. , Nesterova T.B. , Norton S. , Dewchand H. , Godwin J. , Mak W. , Brockdorff N. Disruption of a conserved region of Xist exon 1 impairs Xist RNA localisation and X-linked gene silencing during random and imprinted X chromosome inactivation . Development . 2011 ; 138 : 1541 – 1550 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Yamada N. , Hasegawa Y. , Yue M. , Hamada T. , Nakagawa S. , Ogawa Y. Xist Exon 7 contributes to the stable localization of Xist RNA on the inactive X-chromosome . PLoS Genet. 2015 ; 11 : e1005430 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Yue M. , Ogawa A. , Yamada N. , Charles Richard J.L. , Barski A. , Ogawa Y. Xist RNA repeat E is essential for ASH2L recruitment to the inactive X and regulates histone modifications and escape gene expression . PLoS Genet. 2017 ; 13 : e1006890 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Sunwoo H. , Colognori D. , Froberg J.E. , Jeon Y. , Lee J.T. Repeat E anchors Xist RNA to the inactive X chromosomal compartment through CDKN1A-interacting protein (CIZ1) . Proc. Natl. Acad. Sci. U.S.A. 2017 ; 114 : 10654 – 10659 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Hasegawa Y. , Brockdorff N. , Kawano S. , Tsutui K. , Tsutui K. , Nakagawa S. The matrix protein hnRNP U is required for chromosomal localization of Xist RNA . Dev. Cell . 2010 ; 19 : 469 – 476 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Hsu P.D. , Lander E.S. , Zhang F. Development and applications of CRISPR-Cas9 for genome engineering . Cell . 2014 ; 157 : 1262 – 1278 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Doudna J.A. , Charpentier E. Genome editing. The new frontier of genome engineering with CRISPR-Cas9 . Science . 2014 ; 346 : 1258096 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Sun S. , Del Rosario B.C. , Szanto A. , Ogawa Y. , Jeon Y. , Lee J.T. Jpx RNA activates Xist by evicting CTCF . Cell . 2013 ; 153 : 1537 – 1551 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Lee J.T. , Lu N. Targeted mutagenesis of Tsix leads to nonrandom X inactivation . Cell . 1999 ; 99 : 47 – 57 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Li E. , Bestor T.H. , Jaenisch R. Targeted mutation of the DNA methyltransferase gene results in embryonic lethality . Cell . 1992 ; 69 : 915 – 926 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Chen B. , Gilbert L.A. , Cimini B.A. , Schnitzbauer J. , Zhang W. , Li G.-W. , Park J. , Blackburn E.H. , Weissman J.S. , Qi L.S. et al. Dynamic imaging of genomic loci in living human cells by an optimized CRISPR/Cas system . Cell . 2013 ; 155 : 1479 – 1491 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Ran F.A. , Hsu P.D. , Wright J. , Agarwala V. , Scott D.A. , Zhang F. Genome engineering using the CRISPR-Cas9 system . Nat. Protoc. 2013 ; 8 : 2281 – 2308 . Google Scholar Crossref Search ADS PubMed WorldCat 31. Slaymaker I.M. , Gao L. , Zetsche B. , Scott D.A. , Yan W.X. , Zhang F. Rationally engineered Cas9 nucleases with improved specificity . Science . 2016 ; 351 : 84 – 88 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Kleinstiver B.P. , Prew M.S. , Tsai S.Q. , Topkar V.V. , Nguyen N.T. , Zheng Z. , Gonzales A.P.W. , Li Z. , Peterson R.T. , Yeh J.-R.J. et al. Engineered CRISPR-Cas9 nucleases with altered PAM specificities . Nature . 2015 ; 523 : 481 – 485 . Google Scholar Crossref Search ADS PubMed WorldCat 33. Mefferd A.L. , Kornepati A.V.R. , Bogerd H.P. , Kennedy E.M. , Cullen B.R. Expression of CRISPR/Cas single guide RNAs using small tRNA promoters . RNA . 2015 ; 21 : 1683 – 1689 . Google Scholar Crossref Search ADS PubMed WorldCat 34. Lee E.C. , Yu D. , Martinez de Velasco J. , Tessarollo L. , Swing D.A. , Court D.L. , Jenkins N.A. , Copeland N.G. A highly efficient Escherichia coli-based chromosome engineering system adapted for recombinogenic targeting and subcloning of BAC DNA . Genomics . 2001 ; 73 : 56 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Yue M. , Charles Richard J.L. , Yamada N. , Ogawa A. , Ogawa Y. Quick fluorescent in situ hybridization protocol for Xist RNA combined with immunofluorescence of histone modification in X-chromosome inactivation . J. Vis. Exp. 2014 ; e52053 . WorldCat 36. Shapiro M.B. , Senapathy P. RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression . Nucleic Acids Res. 1987 ; 15 : 7155 – 7174 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Abril J.F. , Castelo R. , Guigó R. Comparison of splice sites in mammals and chicken . Genome Res. 2005 ; 15 : 111 – 119 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Zheng C.L. , Fu X.-D. , Gribskov M. Characteristics and regulatory elements defining constitutive splicing and different modes of alternative splicing in human and mouse . RNA . 2005 ; 11 : 1777 – 1787 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Chen F. , Pruett-Miller S.M. , Huang Y. , Gjoka M. , Duda K. , Taunton J. , Collingwood T.N. , Frodin M. , Davis G.D. High-frequency genome editing using ssDNA oligonucleotides with zinc-finger nucleases . Nat. Methods . 2011 ; 8 : 753 – 755 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Sado T. , Wang Z. , Sasaki H. , Li E. Regulation of imprinted X-chromosome inactivation in mice by Tsix . Development . 2001 ; 128 : 1275 – 1286 . Google Scholar PubMed WorldCat 41. Stavropoulos N. , Lu N. , Lee J.T. A functional role for Tsix transcription in blocking Xist RNA accumulation but not in X-chromosome choice . Proc. Natl. Acad. Sci. U.S.A. 2001 ; 98 : 10232 – 10237 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Clemson C.M. , Chow J.C. , Brown C.J. , Lawrence J.B. Stabilization and localization of Xist RNA are controlled by separate mechanisms and are not sufficient for X inactivation . J. Cell Biol. 1998 ; 142 : 13 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 43. Sakaguchi T. , Hasegawa Y. , Brockdorff N. , Tsutsui K. , Tsutsui K.M. , Sado T. , Nakagawa S. Control of chromosomal localization of Xist by hnRNP U family molecules . Dev. Cell . 2016 ; 39 : 11 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 44. Wong J.J.-L. , Au A.Y.M. , Ritchie W. , Rasko J.E.J. Intron retention in mRNA: No longer nonsense: known and putative roles of intron retention in normal and disease biology . Bioessays . 2016 ; 38 : 41 – 49 . Google Scholar Crossref Search ADS PubMed WorldCat 45. Jacob A.G. , Smith C.W.J. Intron retention as a component of regulated gene expression programs . Hum. Genet. 2017 ; 136 : 1043 – 1057 . Google Scholar Crossref Search ADS PubMed WorldCat 46. Matlin A.J. , Clark F. , Smith C.W.J. Understanding alternative splicing: towards a cellular code . Nat. Rev. Mol. Cell. Biol. 2005 ; 6 : 386 – 398 . Google Scholar Crossref Search ADS PubMed WorldCat 47. Braunschweig U. , Barbosa-Morais N.L. , Pan Q. , Nachman E.N. , Alipanahi B. , Gonatopoulos-Pournatzis T. , Frey B. , Irimia M. , Blencowe B.J. Widespread intron retention in mammals functionally tunes transcriptomes . Genome Res. 2014 ; 24 : 1774 – 1786 . Google Scholar Crossref Search ADS PubMed WorldCat 48. Middleton R. , Gao D. , Thomas A. , Singh B. , Au A. , Wong J.J.-L. , Bomane A. , Cosson B. , Eyras E. , Rasko J.E.J. et al. IRFinder: assessing the impact of intron retention on mammalian gene expression . Genome Biol. 2017 ; 18 : 51 . Google Scholar Crossref Search ADS PubMed WorldCat 49. Ridings-Figueroa R. , Stewart E.R. , Nesterova T.B. , Coker H. , Pintacuda G. , Godwin J. , Wilson R. , Haslam A. , Lilley F. , Ruigrok R. et al. The nuclear matrix protein CIZ1 facilitates localization of Xist RNA to the inactive X-chromosome territory . Genes Dev. 2017 ; 31 : 876 – 888 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Multimode drug inducible CRISPR/Cas9 devices for transcriptional activation and genome editingLu,, Jia;Zhao,, Chen;Zhao,, Yingze;Zhang,, Jingfang;Zhang,, Yue;Chen,, Li;Han,, Qiyuan;Ying,, Yue;Peng,, Shuai;Ai,, Runna;Wang,, Yu
doi: 10.1093/nar/gkx1222pmid: 29237052
Abstract Precise investigation and manipulation of dynamic biological processes often requires molecular modulation in a controlled inducible manner. The clustered, regularly interspaced, short palindromic repeats (CRISPR)/CRISPR associated protein 9 (Cas9) has emerged as a versatile tool for targeted gene editing and transcriptional programming. Here, we designed and vigorously optimized a series of Hybrid drug Inducible CRISPR/Cas9 Technologies (HIT) for transcriptional activation by grafting a mutated human estrogen receptor (ERT2) to multiple CRISPR/Cas9 systems, which renders them 4-hydroxytamoxifen (4-OHT) inducible for the access of genome. Further, extra functionality of simultaneous genome editing was achieved with one device we named HIT2. Optimized terminal devices herein delivered advantageous performances in comparison with several existing designs. They exerted selective, titratable, rapid and reversible response to drug induction. In addition, these designs were successfully adapted to an orthogonal Cas9. HIT systems developed in this study can be applied for controlled modulation of potentially any genomic loci in multiple modes. INTRODUCTION Taking advantage of their RNA-guided DNA binding and endonuclease activity, CRISPR/Cas9 systems have been adapted to sequence dependent modulations in the genomic contexts (1–3). Cas9 protein binds and cleaves DNA in a sequence specific manner based on its complementarity with the associated guide RNA (gRNA) and the adjacent presence of a protospacer-adjacent motif (PAM). DNA cleavage mediated by the Cas9–gRNA complex results in targeted gene editing, either through error prone non-homologous end-joining (NHEJ) or precise homology directed repair (HDR) (1–3). Further, utilities of the CRISPR/Cas9 systems have been expanded to other molecular functions, including transcription activation, repression, genomic DNA labeling, and epigenetic programming, by coupling with various effectors (4,5). For these purposes, DNA cleavage is spared by using nuclease-null, or ‘dead’, Cas9 (dCas9) variants bearing key mutations in nuclease domains, while the DNA binding activity that recruits the effectors to target loci is retained (2). CRISPR-Cas9 dependent gene transcriptional activation system can be achieved by recruiting activation domains (ADs) to the dCas9–gRNA complex. Around 20 ADs and their combinations have been screened for activation potential (6,7). Commonly used ADs include VP64 (an engineered tetramer of the herpes simplex virus transcriptional activator domain VP16), p65 (NF-κB trans-activating subunit), Rta (immediate-early nuclear transcription factor of Kaposi's sarcoma-associated herpesvirus), HSF1 (human heat-shock factor 1), and their combinations VPR (VP64, p65 and Rta) and PH (p65 and HSF1) (3,5–16). To date, there are multiple strategies to generate dCas9–gRNA directed transcriptional activation systems, including direct fusion of ADs to dCas9 and recruitment of multiple AD modules to a peptide array fused with dCas9 or aptamers appended to gRNAs (3,5–16). Precise dissection of dynamic biological processes often requires perturbation of the underlying molecular events in a controlled inducible manner. Greater precision is also beneficial towards the translational ends for development of gene therapies. Nuclear hormone receptors, estrogen receptor (ER) as a representative, have been adapted to drug inducible genome modulation, which utilizes their ligand dependent translocation from the cytoplasm to the nucleus (17). Without its hormone ligand, ER is sequestered by heat shock protein Hsp90 in the cytoplasm. Ligand binding disrupts their interaction, thus leading to ER translocation to the nucleus. ER bearing three mutations G400V/M543A/L544A, also known as ERT2, displays selective affinity to the synthetic estrogen antagonist 4-OHT over β-estradiol, the endogenous ER ligand, a property crucial for low background activity (18). As its best known application, ERT2 fusion to the Cre recombinase has been widely used in biomedical research for conditional and inducible genome engineering (19,20). Here we established and comprehensively optimized a series of novel drug inducible transcriptional activation systems, which we named Hybrid Inducible CRISPR/Cas9 Technologies (HIT) based on coupling CRISPR/Cas9 and ERT2. 4-OHT inducible transcriptional activation was first accomplished using various CRISPR/Cas9 systems grafted with ERT2 domains. Then simultaneous gene activation and editing was delivered in drug inducible manner using one device termed HIT2. These systems delivered better performances than several designs published previously in head-to-head comparisons (21–24). Consistent with these findings, multiple latest literatures reported Cas9 editing activity under the tight control of the ER domain (25–27). We also demonstrated that drug induction is titratable, selective, rapid, and reversible. Further, architectures developed in this study can be applied directly to orthogonal Cas9 devices. HIT-Cas9 systems developed herein provide powerful tools for controlled modulation of potentially any genomic loci in multiple modes. MATERIALS AND METHODS Plasmid construction Cas9, dCas9 and ERT2 were cloned from the pX330-U6-Chimeric_BB-CBh-hSpCas9 plasmid (a gift from Feng Zhang, Addgene plasmid # 42230 (1)), the pMSCV-LTR-dCas9-VP64-BFP plasmid (a gift from Stanley Qi & Jonathan Weissman, Addgene plasmid # 46912 (15)) and the pAd-CreER plasmid (a gift from T.C. He's lab, Chicago University), respectively. Multiple activation domains (ADs), including VP64(V), P65(P), Rta(R) and HSF1(H), were cloned from Addgene plasmids (VP64 was PCR amplified from pLenti-EF1a-SOX2, a gift from Feng Zhang, Addgene plasmid #35388 (28); P65 and Rta were amplified from SP-dCas9-VPR, a gift from George Church, Addgene plasmid # 63798 (7); HSF1 was amplified from lenti MS2-P65-HSF1_Hygro, a gift from Feng Zhang, Addgene plasmid # 61426 (6)). scFv-sfGFP-GB1 and 10xGCN4 were synthesized (Genewiz) according to the sequences reported by Tanenbaum et al. (11). MCP plasmids were cloned by replacing scFv with MCP, which was cloned from MCP-P65-HSF1, a gift from Feng Zhang, Addgene plasmid # 61426 (6)). NES were synthesized (Sangon Biotech) according to the sequence reported by Ding et al. (29) and inserted into various Cas9 constructs. As for scFv constructs used in simultaneous gene editing and gene activation, we removed sfGFP so that it did not interfere with GFP fluorescence from BFP editing. TRE3G promoter was cloned from the Tet-On 3G inducible expression system from Clontech. Intein-S219-G521R was a gift from David Liu (Addgene plasmid # 64192 (24)). As for intein-S219-G521R-VPR construction, intein-G521R was inserted in dCas9 at S219 and VPR were fused to the C-terminus of dCas9. Split-Cas9, split-dCas9(N) and split-dCas9-VP64(C) were gifts from Feng Zhang (Addgene plasmid # 62889 (23); Addgene plasmid # 62887 (23); Addgene plasmid # 62888 (23)). SaCas9-2NES-2ERT2-GCN4 plasmids were cloned by replacing SpCas9 with SaCas9 (a gift from Feng Zhang, Addgene plasmid # 61591 (30)). NLS-dSaCas9–GCN4 and NLS-dSaCas9-VPH plasmids were cloned by replacing dCas9 with dSaCas9 (a gift from Feng Zhang, Addgene plasmid # 61594 (30)). Key plasmids of HIT systems reported in this study will be available through Addgene. sgRNA candidates for both gene activation and editing were chosen based on prediction of high efficiency and low off-target effect by the computational analyses using the GPP web portal (http://www.broadinstitute.org/rnai/public/analysis-tools/sgrna-design) (31,32) and the CRISPR DESIGN web portal (http://crispr.mit.edu/) (33). They are cloned into an optimized single chimeric guide RNA scaffold (A-U flip extension) (34) and experimentally examined for their potency using the pSSA assay and endogenous gene activation assays respectively for editing and activation purposes (data not shown). One sgRNA displaying highest activity was chosen for each gene (Supplementary Table S1). sgRNA2.0s for the SAM system were cloned using a backbone plasmid from Feng Zhang (Addgene plasmid # 61424) (6). sgRNAs candidates of SaCas9 for both gene editing and activation were chosen by PAM sequence requirement (NNGRRT) (Supplementary Table S3) (30). They were cloned into an optimized single chimeric guide RNA scaffold (A-U flip) (35) and experimentally examined in the pSSA assay and transcriptional activation assays for activity (data not shown). The pSSA reporter plasmids used in this study to examine Cas9 DNA cleavage activity were constructed by insertion of sgRNA target sequences in between the homology domains (36). The TLR plasmid was constructed by replacing Sce target with sgRNA targeting sequence in the plasmid of pCVL Traffic Light Reporter 1.1 (Sce target) Ef1a Puro (a gift from Andrew Scharenberg Addgene plasmids #31482) (37). GFP donor plasmid used in conjunction was a gift from Andrew Scharenberg (Addgene plasmids #31475) (37). The gLuc luciferase reporter plasmid used in this study to examine gene activation activity was constructed by inserting the gLuc sgRNA target sequence (Supplementary Table S1) upstream of a minimal CMV promoter (pAAV-minCMV-mCherry was a gift from Feng Zhang, Addgene plasmid # 27970 (28)), and replacing mCherry with firefly luciferase or gaussia luciferase. The gaussia luciferase reporter plasmid used in this study to examine gene activation activity via SaCas9 was constructed by replacing gLuc sgRNA target sequence with the target sequence of a telomere sgRNA in the gaussia luciferase reporter (Supplementary Table S3). The BFP reporter plasmid used in FCR assay to examine gene editing activity was constructed by mutagenesis of the 198th–200th nucleotides TAC in GFP ORF to CAT using a plasmid containing EF1a-GFP cassette (38). Cell culture HEK293T cells (ATCC) were maintained in Dulbecco's modified Eagle's Medium supplemented with 10% FBS, 2 mM GlutaMAX (Thermo Fisher), 100 U/ml penicillin and 100 μg/ml streptomycin under 37°C, 5% CO2. Stable monoclonal cell lines of TLR and FCR reporters were generated following standard protocols of lentivirus packaging and infection. Transfections were done using Biotool DNA transfection Reagent (Biotool) according to the manufacturer's recommended protocol. Within each experiment, the molar amount of each AD and the total weight of transfected DNA were matched for each well. Unless stated otherwise, culture medium was changed 5 h after transfection with 125 nM of 4-OHT or a matched volume of ethanol. Cells were cultured for additional 48 h before examination of genome editing or transcription activation. Transcription activation assays For luciferase reporter assay, HEK293T cells were plated into 96-well plates. A total plasmid mass of 183 ng was transfected using 0.45 ul of Biotool DNA transfection Reagent (Biotool) for each well according to the manufacturer's instructions. The molar amount of sgRNA, dCas9, and ADs were matched across all wells. Constitutive expression plasmids of renilla luciferase and firefly luciferase were co-transfected to serve as internal control for firefly luciferase and gaussia luciferase respectively. Culture medium containing 125 nM 4-OHT was changed 5 h after transfection. Forty eight hours after 4-OHT was added, samples were incubated with luciferase substrates (Promega for firefly luciferase; New England Biolabs for gaussia and renilla luciferase) and then analyzed using VictorX3 Multilabel Plate Reader (PerkinElmer) according to the manufacturer's recommended protocol. As for endogenous gene activation, HEK293T cells were plated into 24-well plates. Cells were cultured, transfected using standard protocols. The total amount of transfected DNA and the molar amount of sgRNA, dCas9 and ADs were all matched across all wells. Forty eight hours after 4-OHT was added, cells were harvested from each transfection and subsequently processed for total RNA extraction using the Direct-zol™ RNA MiniPrep Kit (Zymo Research). cDNA was generated using the GoScript cDNA Synthesis Kit (Promega) according to the manufacturer's recommended protocol. mRNA expression levels were quantitated using SYBR Green Gene Expression Assays (Toyobo). The sequences of qPCR primers were listed in Supplementary Table S2. As for the CD43 activation assay, 48 h after 4-OHT was added, live cells were collected and incubated with a CD43 antibody conjugated with APC (Miltenyi Biotec) according to the manufacturer's recommended protocol. Then cells were analyzed by the CytoFLEX (Beckman Coulter). Subcellular localization analyses HEK293T cells were cultured on 24-well plates. 500 ng of scFv-2E-V or scFv-2E-PH was transfected into each well. To examine C2N2E-GCN4 localization, 400 ng of scFv-GFP were transfected alone or co-transfected with 400 ng of C2N2E-GCN4 into each well. Cells were then transferred to 96-well plates for image collection after 48 h induction of 4-OHT. Images were collected and quantitatively analyzed using Operetta High Content Screening system (Perkin-Elmer) after cells being fixed by 4% (w/v) paraformaldehyde and stained with Hochest 33342 (Thermo Fisher). Gene editing assays As for the pSSA reporter assay, HEK293T cells were cultured, transfected using standard protocols. Forty eight hours after transfection, 50 ul medium was collected and then gaussia luciferase substrate (New England Biolabs) was added. A constitutive expression plasmid of firefly luciferase was co-transfected to serve as a normalization control. Luciferase activity was measured by chemical luminescence using VICTOR X3 Multi-label Plate Reader (PerkinElmer) according to the manufacturer's recommended protocol. As for the TLR assays, 250 ng of various Cas9 constructs and 150 ng of sgRNA with or without 400 ng of GFP donor template (Addgene plasmids #31475 (37)) were co-transfected into each well of TLR reporter cells pre-seeded on 24-well plates. HDR assessment was done with flow cytometry analysis using CytoFLEX cell analyzer (Beckman Coulter). 50 000 cells at the minimum from each well were analyzed. The HDR rates were determined by the percentiles of GFP positive cells. Cells were transferred to 96-well plates for NHEJ measurements. Images were collected using Operetta High Content Screening system (Perkin-Elmer) after cells being fixed by 4% (w/v) paraformaldehyde and stained with Hochest 33342 (Thermo Fisher). The mCherry fluorescence was assessed by Harmony 3.5 (Perkin-Elmer). As for the CD201 genomic knockout assay, HEK293T cells were cultured, transfected using standard protocols. Twenty four hours after transfection, transfected cells were enriched by antibiotic selection using 125 ug/ml Zeocin and 100 ug/ml G418. Twenty four hours after selection, 4-OHT was added. Cells were cultured for several days till reaching sufficient amount for live staining with a CD201 antibody conjugated with PE-Vio770 (Miltenyi Biotec) and flow cytometry analysis using the CytoFLEX (Beckman Coulter). As for the FCR assay, the stable cell line was cultured on 24-well plates. 300 ng of various Cas9 constructs and 300 ng of BFP sgRNA with 10 pmol of ssDNA donor template were co-transfected into each well. The sequence of donor template was 5′-GCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACGTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGA-3′ (38) (synthesized by Sangon Biotech). HDR assessment was performed with flow cytometry analysis using CytoFLEX cell analyzer (Beckman Coulter). At least 30,000 cells from each well were analyzed. The HDR rates were determined by the percentiles of GFP positive cells. Assay of simultaneous BFP editing and CD43 activation BFP stable cell line was used for simultaneous BFP editing and CD43 activation. Cells were cultured in 24-well plates and transfected using standard protocols. The total amount of transfected DNA and the molar amount of CD43 and BFP sgRNAs, Cas9 fusions, and activators were all matched across all wells. 10 pmol ssDNA donor was co-transfected to each well. Forty eight hours after 4-OHT introduction, live cells were collected and stained with a CD43 antibody conjugated with APC (Miltenyi Biotec) according to the manufacturer's recommended protocol. Then cells were examined using the CytoFLEX (Beckman Coulter).The efficiencies of BFP editing and CD43 activation were determined by the percentages of GFP and CD43 positive cells respectively. RESULTS Drug inducible transcriptional activation In order to efficiently evaluate various designs for gene activation, we constructed a luciferase reporter under control of a sgRNA target sequence (gLuc sgRNA) (Supplementary Figure S1A) (39). Transcription machineries are exclusively accessible in the nucleus, thus drug inducible performance of our HIT-dCas9 activation systems can be readily assessed upon co-transfection with this reporter construct. Using it as a surrogate, we first optimized designs within each system. Next the designs delivering the most tight and efficient drug inducible performance from each system were compared side by side for the activation of the luciferase reporter and a panel of endogenous genes. The most straightforward design for CRISPR/Cas9 based transcription activation utilized direct AD fusion to dCas9 (3,7–9,14–16). dCas9-VP64 represents the first generation of this kind (3,8,15,16). Tandem fusion of multiple ADs, including VP64 (V), P65 (P), and Rta (R), to dCas9 (dCas9-VPR) represents the second generation with much higher potency due to synergy among distinct ADs (7). First, we inserted one or two ERT2 domains in between dCas9 and VPR (dCas9-E-VPR and dCas9-2E-VPR) or fused ERT2(s) to the C-terminus of the entire construct (dCas9-VPR-E and dCas9-VPR-2E) (Supplementary Figure S1B). We observed robust activation of the luciferase reporter across all four constructs in a gRNA specific manner upon treatment with 4-OHT (Supplementary Figure S1C). In the absence of 4-OHT, no significant background activity was introduced by the dCas9-2ERT2-VPR (dCas9-2E-VPR) construct, in contrast to the other three (Supplementary Figure S1C). In addition, further improvement of activation potency with undetectable background activity was observed when VPR was replaced with VPH (VP64, p65 and HSF1) in the same architecture (dCas9-2E-VPH) (Supplementary Figure S1D). Taken together, HIT-dCas9 activation system based on direct fusion to ERT2 domains concluded at dCas9-2E-VPH, which delivered the highest potency upon drug induction without obvious background activity (Figure 1A). Figure 1. Open in new tabDownload slide Design, optimization and cross-comparison of HIT transcription activation systems. (A–C) Cartoons illuminating the mechanisms of optimized 4-OHT induced transcription activation using a direct fusion HIT construct (A), the HIT-SAM system (B), and the HIT-SunTag system (C). (D–I) Cross-comparison among three optimized HIT transcription activation systems. Transcription activation induced by HIT constructs was examined in the luciferase reporter assay (D), by quantification of relative mRNA level of endogenous expression for Klf4 (E) and Oct4 (F), and by flow-cytometry analyses of CD43 protein level on the cell surface (G–I). Representative plots (G) and quantitative analyses of median CD43 fluorescent intensities (H) and overall CD43 fluorescent intensities (I) of CD43+GFP+ positive population were shown. GFP fluorescence indicates successful transfection. Cells transfected with the same amount of reporter construct or sgRNA only while keeping the total amount of transfection constant were used as negative controls (NC) in the luciferase reporter assay. Cells transfected with an ERT2 tagged GFP construct while keeping the total amount of transfection constant were used as negative control (NC) in qRT-PCR assays. Cells transfected with an unrelated sgRNA (ctl sgRNA) and GFP were used as negative control (NC) in flow-cytometry analyses. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; **P < 0.01; ***P < 0.001; two tailed t-tests. Three biological replicates means three independently transfected samples throughout this study. The readouts without drug induction were compared in t-tests against the negative controls (NCs) for background detection. Figure 1. Open in new tabDownload slide Design, optimization and cross-comparison of HIT transcription activation systems. (A–C) Cartoons illuminating the mechanisms of optimized 4-OHT induced transcription activation using a direct fusion HIT construct (A), the HIT-SAM system (B), and the HIT-SunTag system (C). (D–I) Cross-comparison among three optimized HIT transcription activation systems. Transcription activation induced by HIT constructs was examined in the luciferase reporter assay (D), by quantification of relative mRNA level of endogenous expression for Klf4 (E) and Oct4 (F), and by flow-cytometry analyses of CD43 protein level on the cell surface (G–I). Representative plots (G) and quantitative analyses of median CD43 fluorescent intensities (H) and overall CD43 fluorescent intensities (I) of CD43+GFP+ positive population were shown. GFP fluorescence indicates successful transfection. Cells transfected with the same amount of reporter construct or sgRNA only while keeping the total amount of transfection constant were used as negative controls (NC) in the luciferase reporter assay. Cells transfected with an ERT2 tagged GFP construct while keeping the total amount of transfection constant were used as negative control (NC) in qRT-PCR assays. Cells transfected with an unrelated sgRNA (ctl sgRNA) and GFP were used as negative control (NC) in flow-cytometry analyses. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; **P < 0.01; ***P < 0.001; two tailed t-tests. Three biological replicates means three independently transfected samples throughout this study. The readouts without drug induction were compared in t-tests against the negative controls (NCs) for background detection. Next we worked on the SAM system, in which minimal MS2 aptamers are appended to the tetraloop and stem loop 2 in the sgRNA scaffold (named sgRNA2.0), which recruit ADs fused to a MS2 binding protein (MCP) (6). The original design involves two hybrid proteins: dCas9-VP64 and MCP-PH. Data showed that AD recruitment to the MS2 sites contributed much more to the potency than that directly fused to dCas9, possibly due to four AD molecules accommodated to the MS2 aptamers (6). Accordingly, we first used a dCas9-NLS construct without VP64 in conjunction with fusion constructs of various ADs to MCP-ERT2 (MCP-E) (Supplementary Figure S2A). Additive drug inducible effect was observed upon addition of VP64 to PH or PR (MCP-E-VPH or MCP-E-VPR) (Supplementary Figure S2B). Further addition of another R to VPH (MCP-E-VPHR) rather lowered its potency (Supplementary Figure S2B). Moreover, splitting VPH into VP64 and PH (MCP-E-VP64+PH) displayed similar activation efficiency upon drug induction, indicating intermolecular synergy is not more effective in this arrangement (Supplementary Figure S2B). Taken these data together, we chose VPH as the most efficient AD combination for further optimization. Notably, higher background activity in the absence of 4-OHT accompanied higher potency when using the trimer and tetramer AD constructs (Supplementary Figure S2B). To this end, we fused two ERT2 to the MCP-VPH construct (MCP-2E-VPH) and found this significantly lowered the background without compromising the drug induced activity (Supplementary Figure S2C). In addition, further rendering dCas9 to 4-OHT regulation (dcas9-2ERT2, dCas9-2E) appeared not necessary as this dramatically reduced the potency upon drug induction (Supplementary Figure S2C). Therefore, the best design for our optimized HIT-SAM systems concluded at the conjunct use of dCas9-NLS and MCP-2E-VPH, which takes advantage of tight regulation of MCP-VPH via fusion with two tandem ERT2 domains while increasing the access of dCas9 to its target DNA via NLS tagging (Figure 1B). Another major transactivation system named SUperNova TAGging (SunTag) utilizes specific binding of soluble single chain variable fragment (scFv) antibodies to multiple GCN4 peptides. In this system, VP64 is fused with scFv and recruited to a tandem array of GCN4 peptides fused with dCas9 (10,11). Amplified potency was observed in comparison to the original dCas9-VP64 as multiple VP64 modules were recruited to a single dCas9–gRNA complex. To establish a SunTag based drug inducible transactivation system, we cloned one or two ERT2 domains between scFv and ADs (scFv-E-ADs or scFv-2E-ADs), which were then used in conjunction with dCas9-NLS fused with 10 × GCN4 peptides to its C-terminus (dCas9–NLS–GCN4) (Supplementary Figure S3A). Consistent with our previous observations, VPH, VPR and VPHR demonstrated additive domain synergy (Supplementary Figure S3B). As observed in other systems, VPH consistently elicited highest level of activation upon drug induction. Importantly, 2ERT2 fusion controlled the background activity in the absence of 4-OHT to a non-significant level in comparison to the negative control across all AD combinations. Furthermore, in contrast to our finding in SAM system, separating VPH into two constructs, V and PH, delivered a more robust synergetic effect, possibly due to higher numbers of AD modules that the SunTag system can accommodate (Supplementary Figure S3C and D). Moreover, similar to the inducible SAM system, replacing NLS tagged dCas9 with one fused with 2ERT2 (dCas9-2E-GCN4) significantly reduced the activation potency (Supplementary Figure S3D). Taken together, we concluded our optimization of HIT-SunTag systems at the co-delivery of dCas9–NLS–GCN4, scFv-2E-VP64 and scFv-2E-PH (Figure 1C). We have accomplished tight and effective drug induction across all three systems. A pending question is which system delivered the most effective drug inducible performance. To this end, we compared the optimized designs of each system in a carefully controlled manner side-by-side in the same experiment (Figure 1D). We also expanded our analyses to multiple endogenous genes, including Klf4 and Oct4, examined by RT-PCR (Figure 1E and F), and CD43, examined by flow-cytometry (Figure 1G–I). The optimized terminal HIT-SunTag design displayed the highest or the equally highest transcriptional efficiency upon drug treatment across all these assays (Figure 1D–F). Notably, the direct fusion construct dCas9-2E-VPH elicited the lowest level of activation across all assays and its lowest potency could be clearly observed in single cell resolution by flow cytometry, a distinction possibly due to only one VPH module recruited to the promoters (Figure 1D–I). Importantly, background activities in the absence of 4-OHT remained undetectable across all these optimized designs (Figure 1D–I). Next, utilizing these assays cross these endogenous genes, we conducted collateral assessments for selectivity (Supplementary Figure S4). In doing so, we included another cell surface protein CD31 to cross examine it with CD43 for gene activation at the protein level (Supplementary Figure S4A and B). Our results indicate no collateral effect among these sgRNAs in activation of their targets (Supplementary Figure S4), thus implying good specificity in agreement with previous reports examining the original SunTag system (10,12). HIT2: a drug inducible CRISPR/Cas9 device capable of both transcriptional activation and genome editing It was reported previously that a shortened sgRNA with less than 16 nucleotides (nt) complementary to its target DNA allows wildtype Cas9 to bind its target without cleaving it (40,41). Taking advantage of this differential effect, we envisioned to use one CRISPR/Cas9 device with sgRNAs in different lengths to simultaneously achieve genome editing and transcriptional activation in a drug inducible manner. Considering the superiority of two tandem ERT2 for the control of background activity, we replaced dCas9 with Cas9-2ERT2 fusion modules (C2E) in all optimized activation constructs (Supplementary Figure S5A), thus rendering genome editing activity also drug inducible. Using a shortened sgRNA (14 nt), we observed significant 4-OHT induction of luciferase reporter signal and no background activity without the drug (Supplementary Figure S5B). Notably, the activation efficiency of SAM system (C2E + MCP-2E-VPH) was approximately a magnitude lower than direct fusion (C2E-VPH) and SunTag (C2E-GCN4 + scFv-2E-VP64 + scFv-2E-PH), in strong contrast to results using dCas9 and full length sgRNAs (Figure 1D), suggesting that a shortened sgRNA might be less tolerant to MS2 aptamer appendages. We therefore focused on direct fusion and SunTag systems for further interrogations. We also examined genome editing activity of these Cas9-2ERT2 (C2E) constructs using a fluorescence conversion reporter (FCR) assay (38), in which BFP is converted to GFP upon HDR mediated substitution of a key amino acid (Supplementary Figure S5C). Surprisingly, GCN4 fusion to C2E in the SunTag system renders higher editing activity upon drug induction in comparison with fusion with VPH and without (Supplementary Figure S5D). And notably, all three constructs introduce significant background activity in the absence of drug treatment (Supplementary Figure S5D). Therefore, in order to attenuate background activity in genome editing, we introduced one or two nuclear export signal (NES) peptides to the C2E-VPH and C2E-GCN4 constructs (Supplementary Figure S6A) (23,42). Drug inducible transcriptional activation was examined using the luciferase reporter assay (Supplementary Figure S6B) and the endogenous CD43 activation assay (Supplementary Figure S6C). A pSSA assay was also employed to examine DNA cleavage activity of these Cas9 constructs, in which a functional luciferase reading frame was restored via single strand annealing (SSA) (Supplementary Figure S7A) (36). Among all tested designs, GCN4 constructs appeared to be highly efficient in both activation with 14nt sgRNAs and editing with 20nt sgRNAs (Supplementary Figures S6 and S7).Our results also validated minimal editing and retained binding activities using wildtype Cas9 and shortened 14nt sgRNAs (40,41). Full length sgRNAs were also capable, but dramatically less efficient, of inducing luciferase signal in a 4-OHT dependent manner, probably due to complication from editing events that impair transcription (Supplementary Figure S6B). Further focused examinations of Cas9-2ERT2-GCN4 (C2E-GCN4) constructs harboring one and two NES peptides respectively (CN2E-GCN4 and C2N2E-GCN4) were conducted across a panel of gene editing assays, including the FCR assay (Supplementary Figure S8), an endogenous CD201 knockout assay (Supplementary Figure S9) and a traffic-light reporter (TLR) assay (37) (Supplementary Figure S10). In the endogenous CD201 knockout assay, we used a sgRNA that effectively targets the 5′ region of the coding sequence of CD201, a cell surface protein highly expressed in human embryonic kidney 293T (HEK 293T) cells, and examined NHEJ induced gene knockout using flow cytometry. The TLR assay can probe both NHEJ and HDR events using a reporter construct in which the restoration of mCherry and GFP signals measure NHEJ and HDR events respectively (Supplementary Figure S10A) (37). Results across all these assays demonstrated a consistent reduction of background activity to minimal level by having two tandem repeats of NES peptides. Therefore, we used C2N2E-GCN4 in conjunction with scFv-2E-VPH, a device we termed ‘HIT2’, for simultaneous drug inducible genome editing and transcriptional activation (Figure 2A). When co-transfected with a full length sgRNA targeting the BFP coding region and a shortened sgRNA targeting the CD43 promoter, simultaneous editing of BFP to GFP and CD43 activation was accomplished in a drug inducible manner (Figure 2B). When a NLS tagged Cas9–GCN4 construct (Cas9–NLS–GCN4) was used, as expected, editing events were not subject to drug control (Figure 2B). Figure 2. Open in new tabDownload slide HIT2: one CRISPR/Cas9 device for simultaneous genome editing and transcriptional activation in a drug inducible manner. (A) Cartoon illuminating the mechanism of the optimized drug inducible HIT2 system for simultaneous genome editing and transcriptional activation. (B) Simultaneous editing and activation by HIT2 and Cas9–NLS–GCN4 were examined using flow-cytometry. The percentage of GFP positive cells indicated HDR efficiency, while CD43 protein level on the cell surface represented transcription activation. Representative plots (upper panels) and quantitative analyses (lower panels) were shown. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; *** P < 0.001; ****P < 0.0001; two-tailed t-tests. Figure 2. Open in new tabDownload slide HIT2: one CRISPR/Cas9 device for simultaneous genome editing and transcriptional activation in a drug inducible manner. (A) Cartoon illuminating the mechanism of the optimized drug inducible HIT2 system for simultaneous genome editing and transcriptional activation. (B) Simultaneous editing and activation by HIT2 and Cas9–NLS–GCN4 were examined using flow-cytometry. The percentage of GFP positive cells indicated HDR efficiency, while CD43 protein level on the cell surface represented transcription activation. Representative plots (upper panels) and quantitative analyses (lower panels) were shown. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; *** P < 0.001; ****P < 0.0001; two-tailed t-tests. Upon completion of optimization of HIT and HIT2 systems, we next examined 4-OHT induced subcellular trafficking of the ERT2 constructs in the terminal designs. A super folding GFP (sfGFP) is fused with scFv constructs to facilitate protein folding (11), which allows us to examine its subcellular distribution. As expected, both scFv-2E-VP64 and scFv-2E-PH proteins showed drug inducible translocation from the cytoplasm to the nucleus (Supplementary Figure S11). We also cloned a fusion construct of GFP to scFv without any localization signal to objectively probe subcellular localization of the C2N2E-GCN4 construct. scFv-GFP was evenly distributed in cells when transfected alone (Supplementary Figure S12). When co-transfected with C2N2E-GCN4, an obvious cytoplasmic retention was observed. An obvious nuclear accumulation of scFv-GFP was observed upon adding 4-OHT to the culture medium (Supplementary Figure S12). These data support a working mechanism of drug inducible nuclear translocation of our HIT constructs. Comparison with existing designs Previously multiple designs of drug inducible CRISPR/Cas9 systems have been reported (21–24). Insertion of an evolved intein within Cas9 at Serine 219 (intein-S219) disrupts its conformation, which is restored upon 4-OHT dependent intein splicing (24). In a split-Cas9 architecture, Cas9 and dCas9 were split into two halves and separately fused with two binding partners of mammalian target of rapamycin (mTOR), FK506 binding protein 12 (FKBP) and FKBP rapamycin binding (FRB) domains. Rapamycin induces binding of the pairs, thus reconstitute Cas9 or dCas9 protein (23). Doxycycline inducible Cas9 systems were also characterized (21,22). We next compared the performance of our HIT-SunTag and HIT2 systems with these published designs head to head in drug inducible transcriptional activation. Since only gene editing intein system was reported in the original paper (24), we cloned its gene construct in which intein (G521R) was inserted into dCas9 at the same site as Cas9. The G521R mutation renders intein refractory to endogenous β-estradiol ligand, a property crucial for selective control by the exogenous 4-OHT (24). VPR, the AD array reported to improve potency (7), was fused to the C-terminus of dCas9. A dCas9-VPR driven by a TRE3G promoter was also constructed. In the luciferase reporter assay, consistent with previous results, we observed no significant background activity and efficient reporter activation using HIT-SunTag system employing dCas9 with a full length sgRNA or HIT2 system (C2N2E-GCN4) with a shortened sgRNA (Figure 3A). In contrast, high background signals for both intein and Tet-on constructs were seen in the absence of chemical induction. The leaky background of the Tet-on system is consistent with its use without doxycycline in independent studies (34,43). All three previously reported designs were significantly less potent than our systems upon drug treatment, partly due to their less optimized AD combinations. Activation of endogenous CD43 expression observed similar differences in drug inducible efficiency between our constructs and published designs (Figure 3B–D). Only Tet-on system showed high background activity in this assay, possibly due to lower sensitivity compared with the one using luciferase reporter. Notably, HIT2, although enabling both activation and editing, did compromise activation potency, possibly due to the use of shortened sgRNAs (Figure 3A–D). Therefore, it is recommended to use HIT-SunTag when simultaneous editing is not desired. Figure 3. Open in new tabDownload slide Comparisons of HIT systems with existing drug inducible designs. (A–D) Drug inducible efficiency and background activity of transcription activation was examined head-to-head between HIT systems and existing designs using the luciferase reporter assay (A), in which the expression of luciferase was controlled by a sgRNA target sequence (gLuc sgRNA), and the CD43 activation assay (B-D).Representative plots (B), quantitative analyses of the percentage of CD43 positive cells (C), and median CD43 fluorescent intensities (D) were shown. (E) Drug inducible efficiency and background activity of genome editing was examined using the FCR assay in comparison with existing designs. Representative plots (top) and quantifications (right bottom) were shown. NC, cells transfected with an unrelated sgRNA; PC, cells transfected with Cas9-NLS and BFP sgRNA. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; **P < 0.01; ****P < 0.0001; two-tailed t-tests. Figure 3. Open in new tabDownload slide Comparisons of HIT systems with existing drug inducible designs. (A–D) Drug inducible efficiency and background activity of transcription activation was examined head-to-head between HIT systems and existing designs using the luciferase reporter assay (A), in which the expression of luciferase was controlled by a sgRNA target sequence (gLuc sgRNA), and the CD43 activation assay (B-D).Representative plots (B), quantitative analyses of the percentage of CD43 positive cells (C), and median CD43 fluorescent intensities (D) were shown. (E) Drug inducible efficiency and background activity of genome editing was examined using the FCR assay in comparison with existing designs. Representative plots (top) and quantifications (right bottom) were shown. NC, cells transfected with an unrelated sgRNA; PC, cells transfected with Cas9-NLS and BFP sgRNA. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; **P < 0.01; ****P < 0.0001; two-tailed t-tests. Given that our HIT2 system is armed with extra functionality of drug inducible genome editing, we next benchmarked this activity too. C2N2E-GCN4 used in HIT2 system introduced lower background activity than that from intein-S219-Cas9, Split-Cas9 and TRE3G-Cas9 in the FCR assay (Figure 3E), Consistent with previous observations for activation (Figure 3A–D), TRE3G-Cas9 elicits most pronounced background activity. Intein-S219-G521R-Cas9 displayed non-significant background activity but its drug induced activity is significantly lower than the HIT2. These differences were also observed in the less sensitive TLR assay (Supplementary Figure S13). Taken together, these data demonstrated advantageous performances of our HIT systems for activation and HIT2 for editing over multiple existing designs including intein, split and Tet-on. Selectivity, titratability, speed and reversibility of drug induction of HIT systems In addition to the tightness and efficiency, we next further characterized our HIT systems for other important criteria of drug inducible modulation. These include selectivity over endogenous β-estradiol ligand, whether their activities are titratable with different concentrations of 4-OHT, and the speed of drug induced effect and whether it is reversible. We first examined dose dependent response of optimized terminal HIT-SunTag and HIT2 systems to 4-OHT and β-estradiol respectively. Transcriptional activation was examined in the luciferase reporter assay (Figure 4A and B) and the endogenous CD43 activation assay (Figure 4C and D, Supplementary Figure S14). Selective response to 4-OHT was observed across all these assays for all HIT systems. Dose dependent activities were achieved, a demonstration of titratable response by varying drug concentration. As expected, selective response to 4-OHT over β-estradiol for transcriptional activation was achieved when using intein dCas9 construct that harboring the G521R mutation (Supplementary Figure S15). Nonetheless, our HIT systems displayed higher sensitivity and selectivity than the intein (G521R) construct. A much lower concentration of 4-OHT was required to reach maximum level of activation activity for our HIT constructs than that for the intein (G521R) construct. Fold difference between 4-OHT and β-estradiol was also higher for HIT systems across the dose range of selective response. Consistent with previous results, the maximum activities of HIT systems were also higher than the intein (G521R) construct. In fact, in contrast to HIT systems, selective activity of the intein (G521R) activation construct in response to 4-OHT could not be detected in a less sensitive CD43 activation assay. Examination of editing activity from HIT2 also demonstrated better sensitivity and selectivity than the intein (G521R) system (Figure 4E and Supplementary Figures S16 and S17). Figure 4. Open in new tabDownload slide Selective and titratable drug induction of HIT systems. (A and B) Dose dependent transcription activation induced by HIT-SunTag (A) and HIT2 (B) with different concentration of β-estradiol or 4OHT was examined using the luciferase reporter assay. (C and D) Activation of endogenous gene CD43 was examined using flow cytometry. Quantitative analyses of the percentage of CD43 positive cells (C) and median CD43 fluorescent intensities (D) were shown. (E) Dose dependent genome editing activities of HIT2 were examined using the FCR assay upon treatment with different concentration of β-estradiol or 4-OHT. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant; *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001; two-tailed t-tests. Fold of activation by 4-OHT over the same concentration of β-estradiol was displayed. Figure 4. Open in new tabDownload slide Selective and titratable drug induction of HIT systems. (A and B) Dose dependent transcription activation induced by HIT-SunTag (A) and HIT2 (B) with different concentration of β-estradiol or 4OHT was examined using the luciferase reporter assay. (C and D) Activation of endogenous gene CD43 was examined using flow cytometry. Quantitative analyses of the percentage of CD43 positive cells (C) and median CD43 fluorescent intensities (D) were shown. (E) Dose dependent genome editing activities of HIT2 were examined using the FCR assay upon treatment with different concentration of β-estradiol or 4-OHT. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant; *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001; two-tailed t-tests. Fold of activation by 4-OHT over the same concentration of β-estradiol was displayed. Speed of response and reversibility are essential to achieve precise and dynamic gene regulation using drug inducible transcription modulation systems. Using either HIT-SunTag or HIT2 system, we observed a significant activation as short as 1 h and increase of signal over time in the luciferase assay, an indication of a very rapid response that is also titratable by altering the exposure time (Figure 5A and B). We also examined whether the drug induced effect could be reversed upon drug withdrawal using the HIT-SunTag system. We toggled 4-OHT treatment of a cell line in which HIT-SunTag constructs, the luciferase reporter, and the sgRNA construct targeting its promoter were stably integrated into the genome through lentiviral delivery. A decrease in luciferase signal was observed upon 4-OHT withdrawal and such signal can be reactivated upon re-supply of the drug (Figure 5C). Distinctions in responses could be clearly observed in comparison with continuous treatment, continuous withdrawal, and no treatment groups. Treatments with a different concentration of 4-OHT led to consistent results (Supplementary Figure S18). These results indicate the response of our HIT systems to drug induction is rapid and reversible, which are crucial properties for sensitive and precise control of functional modulation. Figure 5. Open in new tabDownload slide Rapid and reversible drug induction of HIT systems. (A and B) Luciferase signal was examined upon drug treatment for distinct lengths of time using (A) HIT-SunTag system and (B) HIT2 system. (C) Reversibility of HIT-SunTag was examined in a stable cell line expressing the luciferase reporter and HIT-SunTag constructs. Cells were either continuously treated with 250 nM of 4-OHT for 9 days (ON9) or without (OFF9), or treated for 3 days followed by being cultured without 4-OHT for 6 days (ON3OFF6) or treated for 3 days, then cultured without 4-OHT for 3 days followed by the treatment of 4-OHT again for 3 days (ON3 OFF3 ON3), respectively. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001; two-tailed t-tests. The signals activated by HIT systems were compared against the negative controls (NCs). Figure 5. Open in new tabDownload slide Rapid and reversible drug induction of HIT systems. (A and B) Luciferase signal was examined upon drug treatment for distinct lengths of time using (A) HIT-SunTag system and (B) HIT2 system. (C) Reversibility of HIT-SunTag was examined in a stable cell line expressing the luciferase reporter and HIT-SunTag constructs. Cells were either continuously treated with 250 nM of 4-OHT for 9 days (ON9) or without (OFF9), or treated for 3 days followed by being cultured without 4-OHT for 6 days (ON3OFF6) or treated for 3 days, then cultured without 4-OHT for 3 days followed by the treatment of 4-OHT again for 3 days (ON3 OFF3 ON3), respectively. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001; two-tailed t-tests. The signals activated by HIT systems were compared against the negative controls (NCs). Adaptation of HIT designs to orthogonal Staphylococcus aureus Cas9 Orthogonal Cas9 species other than the prevalent Streptococcus pyogenes (SpCas9), which the current study is based on so far, require distinct PAM sequences, thus expanding the coverage of target loci in the genome (5). They can be used in conjunction to deliver functional perturbations in multiple modes simultaneously. Among them, Staphylococcus aureus Cas9 (SaCas9) has attracted much interest because of the high incidence of its PAM sequence (NNGRR), its high activity in mammalian cells, and a smaller size of protein than SpCas9, which is critical to fit certain delivery vehicles for gene therapy with a restrictive cargo size (e.g. adeno-associated virus (AAV)) (30). Therefore, we next explored the feasibility of adapting our optimized HIT designs to the SaCas9 species. First, we generated a NLS–dSaCas9–GCN4 construct for drug inducible transcriptional activation and use it in conjunction with scFv-2ERT2-AD constructs (Figure 6A). Combination of V and PH constructs introduced a more pronounced activation than V or PH alone. And the drug inducible activation was sgRNA dependent (Figure 6A) and ERT2 dependent (Supplementary Figure S19). In an assay to activate endogenous CD43 expression, robust drug inducible action was achieved with minimal background activity (Figure 6B). Conjunct use of such a dSaCas9 HIT-SunTag device with a split dSpCas9-VPH device under control of a different drug, rapamycin, selectively delivered transcriptional activation under their respective drug inductions of both luciferase reporters and endogenous genes (23) (Supplementary Figure S20). Notably, split architecture displayed a much lower efficiency upon drug induction even AD was changed from V to the more potent VPH combination, which was difficult to detect in the less sensitive endogenous gene activation assay. Next, to deliver both editing and activation using a HIT2 device, we first explored whether sgRNA length variation plays a similar role in control of binding and editing as that for SpCas9. We cotransfected wildtype SaCas9 with sgRNAs in varying lengths and examined them in both the pSSA assay for editing activity (Supplementary Figure S21A) and the luciferase activation assay (Supplementary Figure S21B). Shortening the sgRNA target complementary sequence by 1nt almost completely abolishes its editing activity while a target complementary sequences at 15nt to 18nt is most efficient in activation. Accordingly, we constructed SaCas9-2NES-2ERT2-GCN4 and co-transfected it with a full length sgRNA for BFP editing and a sgRNA either at a length of 15nt or 18nt to activate CD43. As a result, simultaneous editing and activation in a drug inducible manner was delivered using HIT2-SaCas9 constructs with shortened sgRNAs in either length (Figure 6C). Taken together, using SaCas9 as an example, we demonstrated that our HIT designs may be adapted to orthogonal species, thus further expanding their uses. Figure 6. Open in new tabDownload slide Adaptation of HIT designs to SaCas9. (A and B) Drug inducible gene activation by NLS-dSaCas9–GCN4 in conjunction with scFv-2ERT2-AD was examined in the luciferase reporter assay (A) and the endogenous CD43 activation assay (B). (C) Simultaneous genome editing and transcriptional activation by HIT2-SaCas9 were examined by the FCR activity and CD43 activation respectively. Cells transfected with the same amount of reporter construct while keeping the total amount of transfection constant were used as negative controls (NC). ISO represents cells stained with antibody isotype control. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; **P < 0.01; ***P < 0.001; two-tailed t-tests. Figure 6. Open in new tabDownload slide Adaptation of HIT designs to SaCas9. (A and B) Drug inducible gene activation by NLS-dSaCas9–GCN4 in conjunction with scFv-2ERT2-AD was examined in the luciferase reporter assay (A) and the endogenous CD43 activation assay (B). (C) Simultaneous genome editing and transcriptional activation by HIT2-SaCas9 were examined by the FCR activity and CD43 activation respectively. Cells transfected with the same amount of reporter construct while keeping the total amount of transfection constant were used as negative controls (NC). ISO represents cells stained with antibody isotype control. Data showed mean ± SD. n = 3 biological replicates. ns: non-significant;*P < 0.05; **P < 0.01; ***P < 0.001; two-tailed t-tests. DISCUSSION In summary, upon vigorous optimization of a number of designs, we have established multiple advantageous drug inducible devices for transcription activation by coupling ERT2 with various CRISPR/Cas9 systems, and further empowered the most effective system with extra editing functionality. Differences in distinct systems can be difficult to uncover due to confounding variables, such as different batch of cells, different transfection efficiencies. To eliminate these complications as much as we could, we always examined different designs side by side in the same experiment and conducted multiple assays to make definitive comparisons. We also carefully conducted these experiments in a well-controlled manner such as matching the molar amount of constructs that contain dCas9 or Cas9, sgRNAs and ADs respectively across different designs. Tight and efficient drug induction of transcription activation was achieved by conjunct delivery of NLS-dCas9–GCN4, scFv-2E-VP64 and scFv-2E-PH, a HIT-SunTag system with improved potency (Figure 1D–I; Supplementary Figure S3). Further, utilizing differential activities of DNA cutting and binding with sgRNAs in varying lengths, simultaneous editing and activation was accomplished with one device we named HIT2 that consists of C2N2E-GCN4, scFv-2E-VPH (Figure 2). Cross-comparisons with other designs demonstrated better performances of the HIT systems (Figures 3 and 4; Supplementary Figures S13–S17). We did not focus on the split design because of its interference against mTOR, a central pathway in many biological processes (44). Nonetheless, a latest report deployed ERT2 domain on top of this design to reduce its background activity, consistent with our observations of ERT2 as tight regulator of drug induction (26). The current study focused on transcription activation and further harnessed the power of our optimized system by enabling simultaneous editing in a drug inducible manner. Two recent reports, which described drug inducible editing devices by fusion of four ERT2 domains to Cas9 and by insertion of a ER T2 domain into Cas9, lend more support to the utility of ERT2 as a tight and effective tool for drug inducible control (25,27). Last but not the least, successful adoption of HIT designs to SaCas9 in the current study suggests further expansion of their applications (Figure 6). Better understanding of the ERT2 based drug induction and various designs of Cas9 systems from this study would also benefit future development of similar molecular devices. First, tandem fusion of 2-ERT2may confer tighter control of drug induction (Supplementary Figures S1-S3) and NES may be employed to reduce background level of ERT2constructs possibly due to reduced nuclear retention of hybrid proteins (Supplementary Figures S8–S10). Second, HIT constructs are responsive to the endogenous ER ligand β-estradiol, albeit at a much lower level than the synthetic 4-OHT (Figure 4 and Supplementary Figures S14 and S16). This suggests opportunities for further improvement on selectivity, which might be accomplished by development of new ER variants as the current one was developed two decades ago (18). Moreover, in the HIT-SAM and HIT-SunTag systems, tandem ERT2 fusions to scFv or MCP hybrid proteins were sufficient for a tight control while maximizing dCas9’s access to its genomic target DNA enhances its efficiency (Supplementary Figures S2C and S3D). Moreover, VPH combination showed the highest efficiency across multiple scenarios (Supplementary Figures S1D, S2B and S3B) and splitting VPH to V+PH exhibited higher synergy in the HIT-SunTag system (Supplementary Figure S3C and S3D). The HIT designs can be potentially applied to develop next generation CRISPR/Cas9 systems for additional functional perturbations such as transcriptional repression and epigenetic modulation (4,5). Further, the current study used the SaCas9 to demonstrate the applicability of our designs to orthogonal species. Future development of additional systems using engineered SpCas9 with distinct PAM requirements (45), or programmable nucleases from other species (5), will further expand the repertoire of genomic loci and the complexity for functional perturbations. In doing so, HIT-SunTag architecture might be more straightforward as only protein engineering of ‘dead’ programmable nucleases is needed while shortening gRNA for ramification of binding and editing capabilities might impose extra challenges for HIT2 adaption. Fortunately, preliminary tests paved the way for such adaption to SaCas9 in the current study (Supplementary Figure S21), which will need to be conducted when working on new species in the future. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank Dr Andrew P. McMahon (University of South California, CA, USA), Dr Christopher T. Walsh (Harvard Medical School, MA, USA) and Dr Qi Zhou (Institute of Zoology, Chinese Academy of Sciences, Beijing, China) for discussion and reading of this manuscript. We are grateful for all the members of the Wang lab for helpful discussions and technical assistance. Authors Contributions: Y.W. conceived and supervised the study. J.L., C.Z., Y.Z., J.Z., Y.Z., L.C., Q.H., Y.Y., S.P., R.A. and Y.W. designed experiments. J.L., C.Z., Y.Z., J.Z., Y.Z., L.C., Q.H., Y.Y., S.P. and R.A. performed experiments. J.L., C.Z., Y.Z., J.Z., Y.Z., L.C., Q.H., Y.Y., S.P., R.A. and Y.W. analyzed data. J.L., C.Z., Y.Z., J.Z., Y.Z., L.C. and Y.W. wrote the manuscript. FUNDING National Basic Research Program of China [2015CB964800, 2014CB964900]; National Natural Science Foundation of China [31571514, 21402195, 31401270]; Hundred Talents Program of Chinese Academy of Sciences; State Key Laboratory of Stem Cell and Reproductive Biology. Funding for open access charge: State Key Laboratory of Stem Cell and Reproductive Biology. Conflict of interest statement. Patents covering the novel designs in this work have been filed. REFERENCES 1. Cong L. , Ran F.A. , Cox D. , Lin S. , Barretto R. , Habib N. , Hsu P.D. , Wu X. , Jiang W. , Marraffini L.A. et al. Multiplex genome engineering using CRISPR/Cas systems . Science . 2013 ; 339 : 819 – 823 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Jinek M. , Chylinski K. , Fonfara I. , Hauer M. , Doudna J.A. , Charpentier E. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity . Science . 2012 ; 337 : 816 – 821 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Mali P. , Yang L. , Esvelt K.M. , Aach J. , Guell M. , Dicarlo J.E. , Norville J.E. , Church G.M. RNA-guided human genome engineering via Cas9 . Science . 2013 ; 339 : 823 – 826 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Dominguez A.A. , Lim W.A. , Qi L.S. Beyond editing: repurposing CRISPR-Cas9 for precision genome regulation and interrogation . Nat. Rev. Mol. Cell Biol. 2016 ; 17 : 5 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Mali P. , Esvelt K.M. , Church G.M. Cas9 as a versatile tool for engineering biology . Nat. Methods . 2013 ; 10 : 957 – 963 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Konermann S. , Brigham M.D. , Trevino A.E. , Joung J. , Abudayyeh O.O. , Barcena C. , Hsu P.D. , Habib N. , Gootenberg J.S. , Nishimasu H. et al. Genome-scale transcriptional activation by an engineered CRISPR-Cas9 complex . Nature . 2015 ; 517 : 583 – 588 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Chavez A. , Scheiman J. , Vora S. , Pruitt B.W. , Tuttle M. , E P.R.I. , Lin S. , Kiani S. , Guzman C.D. , Wiegand D.J. et al. Highly efficient Cas9-mediated transcriptional programming . Nat Methods . 2015 ; 12 : 326 – 328 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Perez-Pinera P. , Kocak D.D. , Vockley C.M. , Adler A.F. , Kabadi A.M. , Polstein L.R. , Thakore P.I. , Glass K.A. , Ousterout D.G. , Leong K.W. et al. RNA-guided gene activation by CRISPR-Cas9-based transcription factors . Nat. Methods . 2013 ; 10 : 973 – 976 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Chakraborty S. , Ji H. , Kabadi A.M. , Gersbach C.A. , Christoforou N. , Leong K.W. A CRISPR/Cas9-based system for reprogramming cell lineage specification . Stem Cell Rep. 2014 ; 3 : 940 – 947 . Google Scholar Crossref Search ADS WorldCat 10. Gilbert L.A. , Horlbeck M.A. , Adamson B. , Villalta J.E. , Chen Y. , Whitehead E.H. , Guimaraes C. , Panning B. , Ploegh H.L. , Bassik M.C. et al. Genome-scale CRISPR-mediated control of gene repression and activation . Cell . 2014 ; 159 : 647 – 661 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Tanenbaum M.E. , Gilbert L.A. , Qi L.S. , Weissman J.S. , Vale R.D. A protein-tagging system for signal amplification in gene expression and fluorescence imaging . Cell . 2014 ; 159 : 635 – 646 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Chavez A. , Tuttle M. , Pruitt B.W. , Ewen-Campen B. , Chari R. , Ter-Ovanesyan D. , Haque S.J. , Cecchi R.J. , Kowal E.J. , Buchthal J. et al. Comparison of Cas9 activators in multiple species . Nat. Methods . 2016 ; 13 : 563 – 567 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Cheng A.W. , Jillette N. , Lee P. , Plaskon D. , Fujiwara Y. , Wang W. , Taghbalout A. , Wang H. Casilio: a versatile CRISPR-Cas9-Pumilio hybrid for gene regulation and genomic labeling . Cell Res. 2016 ; 26 : 254 – 257 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Cheng A.W. , Wang H. , Yang H. , Shi L. , Katz Y. , Theunissen T.W. , Rangarajan S. , Shivalila C.S. , Dadon D.B. , Jaenisch R. Multiplexed activation of endogenous genes by CRISPR-on, an RNA-guided transcriptional activator system . Cell Res. 2013 ; 23 : 1163 – 1171 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Gilbert L.A. , Larson M.H. , Morsut L. , Liu Z. , Brar G.A. , Torres S.E. , Stern-Ginossar N. , Brandman O. , Whitehead E.H. , Doudna J.A. et al. CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes . Cell . 2013 ; 154 : 442 – 451 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Maeder M.L. , Linder S.J. , Cascio V.M. , Fu Y. , Ho Q.H. , Joung J.K. CRISPR RNA-guided activation of endogenous human genes . Nat. Methods . 2013 ; 10 : 977 – 979 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Metzger D. , Clifford J. , Chiba H. , Chambon P. Conditional site-specific recombination in mammalian cells using a ligand-dependent chimeric Cre recombinase . Proc. Natl. Acad. Sci. U.S.A. 1995 ; 92 : 6991 – 6995 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Feil R. , Wagner J. , Metzger D. , Chambon P. Regulation of Cre recombinase activity by mutated estrogen receptor ligand-binding domains . Biochem. Biophys. Res. Commun. 1997 ; 237 : 752 – 757 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Indra A.K. , Warot X. , Brocard J. , Bornert J.M. , Xiao J.H. , Chambon P. , Metzger D. Temporally-controlled site-specific mutagenesis in the basal layer of the epidermis: comparison of the recombinase activity of the tamoxifen-inducible Cre-ER(T) and Cre-ER(T2) recombinases . Nucleic Acids Res. 1999 ; 27 : 4324 – 4327 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Branda C.S. , Dymecki S.M. Talking about a revolution: The impact of site-specific recombinases on genetic analyses in mice . Dev. Cell . 2004 ; 6 : 7 – 28 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Gonzalez F. , Zhu Z. , Shi Z.D. , Lelli K. , Verma N. , Li Q.V. , Huangfu D. An iCRISPR platform for rapid, multiplexable, and inducible genome editing in human pluripotent stem cells . Cell Stem Cell . 2014 ; 15 : 215 – 226 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Dow L.E. , Fisher J. , O’Rourke K.P. , Muley A. , Kastenhuber E.R. , Livshits G. , Tschaharganeh D.F. , Socci N.D. , Lowe S.W. Inducible in vivo genome editing with CRISPR-Cas9 . Nat. Biotechnol. 2015 ; 33 : 390 – 394 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Zetsche B. , Volz S.E. , Zhang F. A split-Cas9 architecture for inducible genome editing and transcription modulation . Nat. Biotechnol. 2015 ; 33 : 139 – 142 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Davis K.M. , Pattanayak V. , Thompson D.B. , Zuris J.A. , Liu D.R. Small molecule-triggered Cas9 protein with improved genome-editing specificity . Nat. Chem. Biol. 2015 ; 11 : 316 – 318 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Liu K.I. , Ramli M.N. , Woo C.W. , Wang Y. , Zhao T. , Zhang X. , Yim G.R. , Chong B.Y. , Gowher A. , Chua M.Z. et al. A chemical-inducible CRISPR-Cas9 system for rapid control of genome editing . Nat. Chem. Biol. 2016 ; 12 : 980 – 987 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Nguyen D.P. , Miyaoka Y. , Gilbert L.A. , Mayerl S.J. , Lee B.H. , Weissman J.S. , Conklin B.R. , Wells J.A. Ligand-binding domains of nuclear receptors facilitate tight control of split CRISPR activity . Nat. Commun. 2016 ; 7 : 12009 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Oakes B.L. , Nadler D.C. , Flamholz A. , Fellmann C. , Staahl B.T. , Doudna J.A. , Savage D.F. Profiling of engineering hotspots identifies an allosteric CRISPR-Cas9 switch . Nat. Biotechnol. 2016 ; 34 : 646 – 651 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Zhang F. , Cong L. , Lodato S. , Kosuri S. , Church G.M. , Arlotta P. Efficient construction of sequence-specific TAL effectors for modulating mammalian transcription . Nat. Biotechnol. 2011 ; 29 : 149 – 153 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Ding Y. , Ai H.W. , Hoi H. , Campbell R.E. Forster resonance energy transfer-based biosensors for multiparameter ratiometric imaging of Ca2+ dynamics and caspase-3 activity in single cells . Anal. Chem. 2011 ; 83 : 9687 – 9693 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Ran F.A. , Cong L. , Yan W.X. , Scott D.A. , Gootenberg J.S. , Kriz A.J. , Zetsche B. , Shalem O. , Wu X. , Makarova K.S. et al. In vivo genome editing using Staphylococcus aureus Cas9 . Nature . 2015 ; 520 : 186 – 191 . Google Scholar Crossref Search ADS PubMed WorldCat 31. Doench J.G. , Fusi N. , Sullender M. , Hegde M. , Vaimberg E.W. , Donovan K.F. , Smith I. , Tothova Z. , Wilen C. , Orchard R. et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9 . Nat. Biotechnol. 2016 ; 34 : 184 – 191 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Doench J.G. , Hartenian E. , Graham D.B. , Tothova Z. , Hegde M. , Smith I. , Sullender M. , Ebert B.L. , Xavier R.J. , Root D.E. Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation . Nat. Biotechnol. 2014 ; 32 : 1262 – 1267 . Google Scholar Crossref Search ADS PubMed WorldCat 33. Hsu P.D. , Scott D.A. , Weinstein J.A. , Ran F.A. , Konermann S. , Agarwala V. , Li Y. , Fine E.J. , Wu X. , Shalem O. et al. DNA targeting specificity of RNA-guided Cas9 nucleases . Nat. Biotechnol. 2013 ; 31 : 827 – 832 . Google Scholar Crossref Search ADS PubMed WorldCat 34. Chen B. , Gilbert L.A. , Cimini B.A. , Schnitzbauer J. , Zhang W. , Li G.W. , Park J. , Blackburn E.H. , Weissman J.S. , Qi L.S. et al. Dynamic imaging of genomic loci in living human cells by an optimized CRISPR/Cas system . Cell . 2013 ; 155 : 1479 – 1491 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Chen B. , Hu J. , Almeida R. , Liu H. , Balakrishnan S. , Covill-Cooke C. , Lim W.A. , Huang B. Expanding the CRISPR imaging toolset with Staphylococcus aureus Cas9 for simultaneous imaging of multiple genomic loci . Nucleic Acids Res. 2016 ; 44 : e75 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Bhakta M.S. , Segal D.J. The generation of zinc finger proteins by modular assembly . Methods Mol. Biol. 2010 ; 649 : 3 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Certo M.T. , Ryu B.Y. , Annis J.E. , Garibov M. , Jarjour J. , Rawlings D.J. , Scharenberg A.M. Tracking genome engineering outcome at individual DNA breakpoints . Nat. Methods . 2011 ; 8 : 671 – 676 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Richardson C.D. , Ray G.J. , DeWitt M.A. , Curie G.L. , Corn J.E. Enhancing homology-directed genome editing by catalytically active and inactive CRISPR-Cas9 using asymmetric donor DNA . Nat. Biotechnol. 2016 ; 34 : 339 – 344 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Shechner D.M. , Hacisuleyman E. , Younger S.T. , Rinn J.L. Multiplexable, locus-specific targeting of long RNAs with CRISPR-Display . Nat. Methods . 2015 ; 12 : 664 – 670 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Kiani S. , Chavez A. , Tuttle M. , Hall R.N. , Chari R. , Ter-Ovanesyan D. , Qian J. , Pruitt B.W. , Beal J. , Vora S. et al. Cas9 gRNA engineering for genome editing, activation and repression . Nat. Methods . 2015 ; 12 : 1051 – 1054 . Google Scholar Crossref Search ADS PubMed WorldCat 41. Dahlman J.E. , Abudayyeh O.O. , Joung J. , Gootenberg J.S. , Zhang F. , Konermann S. Orthogonal gene knockout and activation with a catalytically active Cas9 nuclease . Nat. Biotechnol. 2015 ; 33 : 1159 – 1161 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Ortiz O. , Wurst W. , Kuhn R. Reversible and tissue-specific activation of MAP kinase signaling by tamoxifen in Braf(V637)ER(T2) mice . Genesis . 2013 ; 51 : 448 – 455 . Google Scholar Crossref Search ADS PubMed WorldCat 43. Knight S.C. , Xie L. , Deng W. , Guglielmi B. , Witkowsky L.B. , Bosanac L. , Zhang E.T. , El Beheiry M. , Masson J.B. , Dahan M. et al. Dynamics of CRISPR-Cas9 genome interrogation in living cells . Science . 2015 ; 350 : 823 – 826 . Google Scholar Crossref Search ADS PubMed WorldCat 44. Shimobayashi M. , Hall M.N. Making new contacts: the mTOR network in metabolism and signalling crosstalk . Nat. Rev. Mol. Cell Biol. 2014 ; 15 : 155 – 162 . Google Scholar Crossref Search ADS PubMed WorldCat 45. Kleinstiver B.P. , Prew M.S. , Tsai S.Q. , Topkar V.V. , Nguyen N.T. , Zheng Z. , Gonzales A.P. , Li Z. , Peterson R.T. , Yeh J.J. et al. Engineered CRISPR-Cas9 nucleases with altered PAM specificities . Nature . 2015 ; 523 : 481 – 485 . Google Scholar Crossref Search ADS PubMed WorldCat Author notes These authors contributed equally to this work as first authors. Present address: Yingze Zhao, National Institute for Viral Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing 102206, China. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
The mechanisms of a mammalian splicing enhancerJobbins, Andrew, M;Reichenbach, Linus, F;Lucas, Christian, M;Hudson, Andrew, J;Burley, Glenn, A;Eperon, Ian, C
doi: 10.1093/nar/gky056pmid: 29394380
Abstract Exonic splicing enhancer (ESE) sequences are bound by serine & arginine-rich (SR) proteins, which in turn enhance the recruitment of splicing factors. It was inferred from measurements of splicing around twenty years ago that Drosophila doublesex ESEs are bound stably by SR proteins, and that the bound proteins interact directly but with low probability with their targets. However, it has not been possible with conventional methods to demonstrate whether mammalian ESEs behave likewise. Using single molecule multi-colour colocalization methods to study SRSF1-dependent ESEs, we have found that that the proportion of RNA molecules bound by SRSF1 increases with the number of ESE repeats, but only a single molecule of SRSF1 is bound. We conclude that initial interactions between SRSF1 and an ESE are weak and transient, and that these limit the activity of a mammalian ESE. We tested whether the activation step involves the propagation of proteins along the RNA or direct interactions with 3′ splice site components by inserting hexaethylene glycol or abasic RNA between the ESE and the target 3′ splice site. These insertions did not block activation, and we conclude that the activation step involves direct interactions. These results support a model in which regulatory proteins bind transiently and in dynamic competition, with the result that each ESE in an exon contributes independently to the probability that an activator protein is bound and in close proximity to a splice site. INTRODUCTION The high point in studies on the mechanisms of action of exonic splicing enhancer sequences was reached in 1998. It had been shown previously that sequences in alternative exons were required for their inclusion (1–5), and in mammals these sequences had been mapped to relatively short, purine-rich motifs (6–8). These motifs were found to be bound by SR proteins (8–10). SR proteins have one or two RNA recognition motif (RRM)-type RNA-binding domains and a C-terminal RS domain, rich in arginine-serine dipeptides, that is subject to extensive phosphorylation (11–13). The binding of these proteins to ESEs was shown to stimulate splicing, splicing complex assembly and binding of U2 snRNPs and U2-associated proteins U2AF65 and U2AF35 (8,14–16). Moreover, SR proteins were shown also to interact directly with U2AF35 and this interaction was shown to mediate the effects of ESEs on the splicing of an upstream intron (17,18). These observations led to two models for the mechanisms of action of ESEs. ESEs that are within 100 nts or so of the 3’SS were proposed to work by binding of an SR protein to the ESE and propagation of the complex by cooperative interactions with further SR proteins until interactions with U2AF were possible; ESEs that are regulated, such as the repeated elements in Drosophila dsx exon 4 (dsxREs) that depend on the proteins Tra and Tra2, might form stable multi-protein complexes and interact directly over greater distances by 3D-diffusion (looping) (18). These models resolve the mechanisms of action of ESEs into two steps, either of which might be limiting or regulated. The first is binding to the ESE (Figure 1A–D); the second is the step in which the ESE-bound SR protein activates its target splice site (Figure 1E and F). The models were tested indirectly in two important papers in 1998 by following the rates of splicing dependent on the dsxREs (19,20). The proportion of dsx pre-mRNA that had spliced in vitro in a given time was shown to increase with the addition of Tra/Tra2, but the level of splicing asymptotically approached a maximum with increasing concentrations of protein. Assuming that the maximum level of splicing was a direct function of dsxRE occupancy and not, for example, the result of limiting solubility or aggregation at higher protein concentrations, it was inferred that the dsxREs had become saturated (20). Since the concentration of Tra/Tra2 required for half-maximal splicing did not depend on the number of dsxREs, it was concluded that each dsxRE was independent. The level of splicing reached under saturating conditions was linearly dependent on the number of dsxREs present (1–6, distributed along the exon 300–600 nts from the 3’SS), from which it was inferred that each dsxRE element was occupied and that the more that were occupied, the higher the chance of activation of splicing, as if the bound SR proteins each had a low probability of contacting a common target (20) (Figure 1A). This interpretation was substantiated by the use of tethered RS domains placed at various distances downstream of the 3’SS. The rates of splicing depended on the length of the RS domain, consistent with the existence of a common target, and declined with distance from the 3’SS as expected if the target were being found by looping (19). These results supported the models shown in Figure 1A and E. Figure 1. Open in new tabDownload slide Elementary processes representing possible SR protein binding (A–D) and activation (E and F) steps by which ESEs stimulate 3′ splice site usage. (A) SR proteins bind stably, with multiple occupancy of tandemly-repeated ESEs, and the probability of subsequent interactions by mechanisms E or F is limiting. Solid blue line, pre-mRNA; red oval, SRSF1; other ovals, components at the 3’SS (U2AF, U2 snRNP, etc.); dashed lines, low probability interactions. (B) SR proteins bind stably, with multiple occupancy of tandemly-repeated ESEs, and are able to interact with multiple targets, each of which has an approximately equal effect on the rate. (C) SR protein binding is cooperative. (D) The probability of binding by an SR protein is very low and each site contributes independently. (E) Activation involves direct contact by three-dimensional diffusion between an ESE-bound SR protein and a 3’SS component; the intervening RNA is looped out. (F) Activation involves processes that maintain contact with the RNA between the ESE and the 3’SS, such as propagation of SR protein complexes or scanning along the RNA (for example, in conjunction with a helicase). Figure 1. Open in new tabDownload slide Elementary processes representing possible SR protein binding (A–D) and activation (E and F) steps by which ESEs stimulate 3′ splice site usage. (A) SR proteins bind stably, with multiple occupancy of tandemly-repeated ESEs, and the probability of subsequent interactions by mechanisms E or F is limiting. Solid blue line, pre-mRNA; red oval, SRSF1; other ovals, components at the 3’SS (U2AF, U2 snRNP, etc.); dashed lines, low probability interactions. (B) SR proteins bind stably, with multiple occupancy of tandemly-repeated ESEs, and are able to interact with multiple targets, each of which has an approximately equal effect on the rate. (C) SR protein binding is cooperative. (D) The probability of binding by an SR protein is very low and each site contributes independently. (E) Activation involves direct contact by three-dimensional diffusion between an ESE-bound SR protein and a 3’SS component; the intervening RNA is looped out. (F) Activation involves processes that maintain contact with the RNA between the ESE and the 3’SS, such as propagation of SR protein complexes or scanning along the RNA (for example, in conjunction with a helicase). Since then, the variety of sequences known to act in some contexts as ESEs or silencers has expanded considerably, following computational analyses (21,22), systematic screens of large sequence libraries (23,24) and the analysis of mutations (25–29). As a result, there is a much better understanding of the sequence preferences of the SR proteins, such as the prototypical SR protein, SRSF1 (30–34). Mapping of the transcriptome-wide binding sites by cross-linking in vivo confirms these preferences (34–37), and some analyses show binding sites enriched near splice sites. Despite this plethora of information, the limitations of all the widely used methods for studying splicing mechanisms have meant that there has been little progress in understanding the interactions of ESEs with trans-acting factors and the subsequent mechanisms of activation. The most informative experiments have come from the use of RS domains that were stably tethered to a 3′ ESE. These experiments showed that the RS domains contacted the branchpoint of the preceding intron (38,39) and stabilized U2 base-pairing (40). While these results support the idea of direct interactions or looping (Figure 1E), the use of tethering meant that the contribution of binding to the rate-determining step could not be assessed, the numbers of SR proteins bound were not known so propagation (Figure 1F) could not be excluded, and the contacts made were not, as had been expected, with U2AF proteins. Thus, the models inferred from the rates of Drosophila Tra/Tra2-dependent splicing are still generally used to represent or interpret mammalian ESE activity. Doubts about the applicability of the dsxRE model to mammalian enhancers in general have arisen from analyses of complexes forming on ESEs. Using targeted oligonucleotide enhancers of splicing (TOES), we showed that splicing efficiency increased with the number of GGA repeat motifs in the enhancer domain of the oligonucleotide, consistent with the dsxRE results (41). However, extensive investigations using conventional methods showed that the enhancer sequence was able to form a number of different complexes, including a G-quadruplex, and that SR protein complexes under normal conditions were only minor components (42). This is consistent with other examples in which the proportion of cross-linking or affinity-purified proteins attributable to SR proteins in enhancer complexes has likewise been found to be very low (14,43), and suggested to us that SR proteins might interact only transiently with ESEs (42), as in the model in Figure 1D. The best way to identify whether the limiting step in activation by an ESE is the binding of the SR protein or its subsequent interactions is to use multiple ESEs and test for multiple occupancy. Concurrent occupancy would indicate functionally stable binding and that, as with the dsxRE, subsequent interactions are limiting (Figure 1A–C). If multiple ESEs increased the rate of splicing but this was linked only to increased levels of single occupancy, then the probability of binding would be limiting (Figure 1D). The use of indirect methods, such as those used for the dsxRE, is inappropriate because the splicing rates do not saturate and it is difficult to infer much from a negative result (20). Direct measurement of the numbers of ESEs occupied would be a much more definitive approach. However, such measurements cannot be made in functional splicing assays for ESEs embedded in a pre-mRNA using conventional methods such as cross-linking, including CLIP, or affinity purification, because of the high levels of apparently non-specific association of SR proteins with RNA mediated by their RS domains, because SR proteins are recruited in other ways and because the distribution of pre-mRNA molecules among complexes with different numbers of molecules of protein bound is unknowable. Single molecule multi-colour colocalization methods have been used extensively by this laboratory to determine directly the numbers of protein molecules bound to a molecule of pre-mRNA in mammalian nuclear extracts (44–47) (Figure 2) and by the laboratory of M. Moore to follow the dynamics of association and dissociation of splicing factors in yeast extracts (48,49), providing mechanistic evidence that would otherwise be unattainable. By combining single molecule methods with the use of non-RNA linkers, we describe a new paradigm for mammalian enhancers. Figure 2. Open in new tabDownload slide An outline of single molecule multicolour colocalization. (A) Representative views of a field of view showing the accumulated emission upon excitation at 640 nm (for RNA labelled with Cy5, red spots) and 488 nm (for mEGFP-SRSF1, green spots). Data was collected by emCCD from excitation at 640 nm for 50 frames of 50 ms and at 488 nm for 250 frames of 50 ms. Cy5-labelled RNA had been incubated in nuclear extract expressing mEGFP-SRSF1 under conditions allowing complex A formation before dilution and injection onto the slide surface for detection by total internal reflection fluorescence microscopy. Colocalized RNA and protein molecules are shown by circles. (B) Time courses of mEGP fluorescence from representative spots. Irreversible drops in the emission intensity are caused by bleaching of a single molecule of mEGFP. These are direct screen images and the ordinate (intensity per 50 ms frame) is re-scaled for each spot. (C) Diagrams showing for each molecule the number of mEGFP-SRSF1 molecules associated with the RNA, deduced from the bleaching profile. (D) The spots are classified and each contributes one RNA molecule to the classes represented in the histogram. Figure 2. Open in new tabDownload slide An outline of single molecule multicolour colocalization. (A) Representative views of a field of view showing the accumulated emission upon excitation at 640 nm (for RNA labelled with Cy5, red spots) and 488 nm (for mEGFP-SRSF1, green spots). Data was collected by emCCD from excitation at 640 nm for 50 frames of 50 ms and at 488 nm for 250 frames of 50 ms. Cy5-labelled RNA had been incubated in nuclear extract expressing mEGFP-SRSF1 under conditions allowing complex A formation before dilution and injection onto the slide surface for detection by total internal reflection fluorescence microscopy. Colocalized RNA and protein molecules are shown by circles. (B) Time courses of mEGP fluorescence from representative spots. Irreversible drops in the emission intensity are caused by bleaching of a single molecule of mEGFP. These are direct screen images and the ordinate (intensity per 50 ms frame) is re-scaled for each spot. (C) Diagrams showing for each molecule the number of mEGFP-SRSF1 molecules associated with the RNA, deduced from the bleaching profile. (D) The spots are classified and each contributes one RNA molecule to the classes represented in the histogram. MATERIALS AND METHODS Sequences The transcripts comprised 225 nts of rabbit β-globin exon 2 and 67 nts of intron 2, fused to human SMN2 exon 7 and 157 nts of the preceding intron as described previously (42,50). In the control transcript, transcription terminated at nucleotide 48 in SMN2 exon 7, excluding the 6 nts preceding the 5′ splice site. The ESE sequences appended to nt 48 were as follows: ESE-A, CAAGGCGGAGGAAG; ESE-B, CACACAGGACCACACAGGAC; ESE-C, AAAAAGAAAGAAAAAAAGAAAGAA; ESE-D, UCAGAGGAUCAGAGGA. Tandem repeats of ESE-A were used as described. In all the transcripts used for the assays with non-RNA linkers, the transcript lacked the Tra2β enhancer region (SMN2 exon 7 nts 17–28) and ended at nt 38. The removal of nts 17–28 shortened the exon by 12 nts, so that the addition of 12 abasic nts at the 3′ end followed by the ESE put the ESE at the same distance relative to the 3′ splice site as when the ESE was appended to the 3′ end as described above. The transcripts used for splicing and cross-linking were prepared using T7 RNA polymerase with the cap analogue GpppG for initiation and [α-32P]GTP for elongation. The transcripts used for cross-linking were as follows, omitting the 5′ cap: 1 ESE, GCAAGGCGGAGGAAG; 2 ESEs, GCAAGGCGGAGGAAGCAAGGCGGAGGAAG; 3 ESEs, GCAAGGCGGAGGAAGCAAGGCGGAGGAAGCAAGGCGGAGGAAG; 4 ESEs, GCAAGGCGGAGGAAGCAAGGCGGAGGAAGCAAGGCGGAGGAAGCAAGGCG-GAGGAAG These were purified by denaturing gel electrophoresis and quantified by scintillation counting. Synthesis and ligation of non-RNA linkers The sequences ligated to nt 38 of SMN2 exon 7 were as follows: UUCCUUAAAUCAAGGCGGAGGAAGCAAGGCGGAGGAAG (exon 7 nts 39–48 + ESE-Ax2). UUCCUUAAAU – 2HEG – CAAGGCGGAGGAAGCAAGGCGGAGGAAG UUCCUUAAAU – 12 abasic pd – CAAGGCGGAGGAAGCAAGGCGGAGGAAG UUCCUUAAAU – 12 abasic pr – CAAGGCGGAGGAAGCAAGGCGGAGGAAG The synthesis of the linkers is described in detail elsewhere (Reichenbach et al., ms in preparation). In brief, TOM-Protected RNA phosphoramidites, HEG spacer phosphoramidite and CPG supports loaded with standard nucleosides were purchased from LINK Technologies Ltd (Bellshill, UK) and Cambio Ltd. (Cambridge, UK). Abasic phosphoramidites were synthesized based on methods reported in the literature (51,52). RNA oligonucleotides were synthesized using standard solid phase oligonucleotide synthesis protocols on an ABI 394 synthesizer. The reaction time for each individual coupling was 10 min. The crude oligonucleotides were purified by gel electrophoresis before ligation to the transcripts with 20 nt DNA oligonucleotide splints and T4 RNA ligase 2 (53). Labelling RNA at the 5′ end Substrate RNAs for single molecule detection were transcribed in the presence of 10 mM guanosine-5′-O-monophosphorothioate (BioLog), and the 5′ end was labelled using 5′ Cy5-maleimide (54), as described (47). Splicing and analysis of complexes The HeLa cell nuclear extract containing mEGFP-SRSF1 and mCherry-U1A was prepared (55) and the relative levels of exogenous and endogenous proteins were characterized by western blotting and detection with a fluorescent secondary antibodies (IRDye 680LT, LI-COR 926–68020; IRDye 800CW, LI-COR 926–32211). Splicing assays and native gel electrophoresis of splicing complexes were done as reported previously (45). Splicing levels were taken as the molecular ratio of mRNA/(mRNA + pre-mRNA), after allowing for the number of radioactive nucleotides in each species. For cross-linking reactions, the ESE transcripts were incubated in 10 μl in splicing conditions for 15 min, then irradiated with a short-wave SpotCure (UVP) for 2 min and incubated with 2 μg RNase A and 5 U RNase T1 before electrophoresis and transfer onto nitrocellulose. Sample preparation Cover slips from Menzel-Gläser (22 mm × 50 mm, #1) were soaked in 1 M KOH for four hours before being washed with water, sonicated, dried under a nitrogen stream and cleaned in an argon plasma (MiniFlecto-PC-MFC, Gala Instruments) for 5 times 5 min. Sample chambers were formed with double-sided tape and a second cover slip. Splicing reactions were prepared with 50% nuclear extract, 3.2 mM MgCl2, 50 mM monopotassium glutamate, 20 mM phosphocreatine, 1.5 mM ATP, 20 mM Hepes pH 7.5 and 3 units RNase OUT (Invitrogen). A 2′-O-methyl oligonucleotide complementary to U6 snRNA was added at 1 μM to block splicing at complex A (45,56). The reaction mixtures were pre-incubated for 15 min at 30°C before pre-mRNA was added at a final concentration of 62.5 nM and incubated for a further 15 min at 30°C. Samples were diluted in 40 mM Hepes pH 7.5, 3.2 mM MgCl2, 50 mM monopotassium glutamate, 50 mM KCl, 0.1 mM EDTA and 0.5 mM DTT, loaded into the sample chamber and incubated for 5 min. Data acquisition and analysis A home-built objective-based total internal reflection fluorescence microscope was used for the single-molecule experiments. The central regions of both the laser beam and the CCD chip were used to reduce intensity variation. A stable focus was achieved by active feedback from the reflected beam. Nine fields were acquired in succession automatically, and from each experiment ∼50 fields in total were acquired. Each field was obtained by collection of 50 frames, each of 50 ms, with excitation at 633 nm, followed by the collection of 250 frames with excitation at 488 nm. The periods of irradiation resulted in the complete bleaching of Cy5 (pre-mRNA) and mEGFP in each case. Recording was continuous, minimizing the possibility of unrecorded bleaching of mEGFP during the switch between lasers. Each SRSF1 experiment was done on three separate occasions. Data analysis The data were analyzed using a MATLAB program. Spots produced by fluorescence were identified by detecting squares in which the maximum pixel intensity exceeded the mean by a threshold value, tested for the presence of only a single peak, and required to conform to a Gaussian distribution in each dimension. The use of maximum intensities during the recording period ensures that mEGFP molecules that were in a dark state at any point, even repeatedly, would be detected during the 12.5 s of irradiation at 488 nm because the dark state lifetime of mEGFP is only 1–2 s (57–59). Chromatic aberrations were corrected by a linear transformation, using parameters from a test molecule labelled with two fluorophores. Steps in the bleaching profile were identified using a recursive Bayesian strategy (Jobbins et al., submitted). Each frequency histogram shows the result of measurements collated from around 150 fields (three experiments done on separate days, each comprising 50 fields). The colocalization values represent the mean from the three experiments, with the standard error of the mean. Apparent colocalization arising from random coincidence was calculated for each image and would involve 1–2% of the pre-mRNA molecules. This was considered to be insignificant and was not subtracted from the values measured. The frequency values are the total from the three experiments, with error bars showing the square root of the variance of the binomial probability that an RNA spot will be associated with the given number of protein bleaching steps. The distribution of frequencies corresponding to the presence of a single molecule of SRSF1 in a complex was taken to be the same as was seen under the same conditions but in the absence of pre-mRNA (Supplementary Figure S1). The level of dimerization of the protein was calculated from the proportion of complexes observed to contain two molecules of mEGFP-SRSF1 (12%). The distributions expected from two molecules in a complex (Supplementary Figure S1) were calculated from the proportion of SRSF1 in the extract that was labelled with mEGFP (58%) and the level of dimerization (45,47). The level of mis-folding of the mEGFP domain was assumed to be nil, both because it was N-terminal to the SRSF1 moiety and because the results in Figure 5 fit the prediction for binding by two molecules of SRSF1 in 100% of the complexes with no indication of the presence of larger complexes. RESULTS The dependence of splicing on ESE sequences is linked to the level of ESE-dependent binding by SRSF1 To test whether there was a correlation between the efficacy of an ESE and binding of SRSF1, four different sequences were tested: ESE-A, the known SRSF1 binding site from Ron exon 12 (60,61); ESE-B, an SRSF1 binding motif identified by functional SELEX (62); ESE-C, the Tra2β binding site from SMN exon 7, which has been shown to promote splicing but should not bind SRSF1 (63,64); ESE-D, an SRSF1 motif established via RNA-seq (34). Since ESE-A was longer than the others (14 nt), the others were incorporated as tandem repeats with overall lengths of 20, 24 and 16 nt, respectively. Each ESE was attached to the 3′ end of SMN2 exon 7 in a β-globin/SMN2 chimeric pre-mRNA (47). In vitro splicing assays showed that all four ESEs increased the level of splicing significantly, but the ESE derived from Ron exon 12 was by far the most effective (Figure 3A). Figure 3. Open in new tabDownload slide Effects of different ESE sequences on splicing and the association of SRSF1. (A) Splicing activity in vitro of pre-mRNA with different 3’ESEs. The pre-mRNA is represented by a diagram showing the sequences from β-globin exon 2 (blue), SMN2 exon 7 (brown) and the ESE under test (green). The level of spliced mRNA (%, [mRNA]/[mRNA + pre-mRNA]) after incubation for 2 h was calculated for each of three reactions done in triplicate, and the mean and standard error of the mean are shown. The probabilities (P) that results with the ESE-containing reactions are from the same population as the results from the pre-mRNA lacking an extra ESE (BGSMN2) were calculated by a Student's t test. (B) Single molecule multicolour colocalization studies on the levels and stoichiometries of mEGFP-SRSF1 binding to molecules of labelled pre-mRNA in nuclear extracts. Histograms show the frequencies (%) of pre-mRNA (BGSMN2) molecules showing bleaching of colocalized labelled protein in n steps (Figure 2). The number above each bar indicates the number of complexes in which complete bleaching was achieved in 1, 2, 3, etc., steps, and hence the number of complexes in which there were 1, 2, 3, etc., molecules of fluorescent fusion protein. > refers to complexes where more than 5 bleaching steps were measured; x represents complexes where the number could not be determined. The percentage value above each histogram is the percentage of labelled pre-mRNA molecules (RNA spots) that were associated with mEGFP-SRSF1 (Coloc. spots). Nuclear extracts contained ATP and an oligonucleotide complementary to U6 snRNA that blocks progression beyond complex A. (C) Frequency distributions as described above for pre-mRNA comprising BGSMN2 with a 3’ESE containing one motif of ESE-A from Ron exon 12 (60,61). The frequency for n = 1 is blue and that for n = 2 is green to assist in identifying these classes (see text). (D) As (C), but the pre-mRNA contained two 3′-terminal repeats of an SRSF1 site of action determined by functional SELEX (62). (E) As (C), but with two 3′ terminal repeats of the SMN2 exon 7 Tra2 β binding site (63,64). (F) As (C), but with two 3′ terminal repeats of an SRSF1 site of action inferred from RNA-seq (34). Figure 3. Open in new tabDownload slide Effects of different ESE sequences on splicing and the association of SRSF1. (A) Splicing activity in vitro of pre-mRNA with different 3’ESEs. The pre-mRNA is represented by a diagram showing the sequences from β-globin exon 2 (blue), SMN2 exon 7 (brown) and the ESE under test (green). The level of spliced mRNA (%, [mRNA]/[mRNA + pre-mRNA]) after incubation for 2 h was calculated for each of three reactions done in triplicate, and the mean and standard error of the mean are shown. The probabilities (P) that results with the ESE-containing reactions are from the same population as the results from the pre-mRNA lacking an extra ESE (BGSMN2) were calculated by a Student's t test. (B) Single molecule multicolour colocalization studies on the levels and stoichiometries of mEGFP-SRSF1 binding to molecules of labelled pre-mRNA in nuclear extracts. Histograms show the frequencies (%) of pre-mRNA (BGSMN2) molecules showing bleaching of colocalized labelled protein in n steps (Figure 2). The number above each bar indicates the number of complexes in which complete bleaching was achieved in 1, 2, 3, etc., steps, and hence the number of complexes in which there were 1, 2, 3, etc., molecules of fluorescent fusion protein. > refers to complexes where more than 5 bleaching steps were measured; x represents complexes where the number could not be determined. The percentage value above each histogram is the percentage of labelled pre-mRNA molecules (RNA spots) that were associated with mEGFP-SRSF1 (Coloc. spots). Nuclear extracts contained ATP and an oligonucleotide complementary to U6 snRNA that blocks progression beyond complex A. (C) Frequency distributions as described above for pre-mRNA comprising BGSMN2 with a 3’ESE containing one motif of ESE-A from Ron exon 12 (60,61). The frequency for n = 1 is blue and that for n = 2 is green to assist in identifying these classes (see text). (D) As (C), but the pre-mRNA contained two 3′-terminal repeats of an SRSF1 site of action determined by functional SELEX (62). (E) As (C), but with two 3′ terminal repeats of the SMN2 exon 7 Tra2 β binding site (63,64). (F) As (C), but with two 3′ terminal repeats of an SRSF1 site of action inferred from RNA-seq (34). Transcripts labelled with Cy5 were incubated in nuclear extract prepared from cells expressing functional mEGFP-SRSF1 (Jobbins et al., submitted), in the presence of a 2′-O-methyl oligonucleotide complementary to U6 snRNA that causes spliceosome assembly to stall at complex A (45,56). The reactions were then diluted and captured on cover slips. The fluorescent RNA and proteins were imaged using TIRF at two or three wavelengths and fluorescence was recorded until the fluorophores were bleached. Co-localized single molecules of RNA with either protein were identified and the number of steps in which the mEGFP or mCherry signal bleached was recorded for each RNA molecule (Figure 2). The numbers of spots in which bleaching occurred in 1, 2, 3 etc. steps are shown as a percentage of the total number of RNA spots (Figure 3B–F). Each distribution is an accumulation of the spots from three separate experiments. In the absence of an ESE, there was a strong peak of complexes containing a single molecule of mEGFP-SRSF1, followed by a geometric distribution of higher order complexes (Figure 3B, BGSMN2; Pgeo(n = 2–5) = 0.15). We have shown elsewhere that U1 snRNPs recruit a single SRSF1 to a 5’SS in complex A but that failure to form complex A leads to a geometric distribution associated with non-productive complexes (Jobbins et al., submitted). We conclude that only a minority of BGSMN2 pre-mRNA had assembled into complex A, while the majority had not (see below). The addition of the ESEs reduced the levels of the non-productive complexes with 3 or 4 molecules of mEGFP-SRSF1 and produced an increase in the proportion with two molecules (Figure 3C–F). Supplementary Figure S1 shows the frequency distributions expected if all the complexes contained either one or two molecules of SRSF1, taking into account the proportions of labelled and unlabelled SRSF1 in the extract and the levels of dimerization observed in the absence of pre-mRNA. If all the colocalized complexes contained two molecules of SRSF1 then, as shown in Supplementary Figure S1, the number of molecules of RNA in complexes that bleached in two steps would be ∼90% of the number bleaching in one step. The ratio observed in the absence of an ESE, when many of the molecules are in non-productive complexes, is 37% (Figure 3B), but it increases to 58% with ESE-A (Figure 3C) and is around 40–43% with the other ESEs (Figure 3D–F) We conclude that the presence of an ESE reduces the proportion of non-productive complexes and increases the proportion of complexes containing two molecules of SRSF1. Significantly, the abundance of these complexes was highest with the most active ESE, ESE-A, consistent with the expectation that the efficiency of an ESE is linked to its ability to bind SRSF1. We have shown elsewhere that inclusion of an oligonucleotide complementary to the 5′ end of U1 snRNA eliminates the second peak with ESE-A, demonstrating that recruitment by U1 snRNP and the ESE are independent (Jobbins et al., submitted). Multiple ESEs produce additive effects on splicing but do not increase the number of SRSF1 molecules bound Introducing tandem repeats of ESE-A produced an increase in the level of splicing that was proportional to the number of ESEs (Figure 4A, R2 = 0.98; Supplementary Figure S2A), very like that seen with dsx (20) or TOES oligonucleotides (42). There was a corresponding increase in the rates and levels of formation of splicing complexes at all stages (Figure 4B–E; Supplementary Figure S2B). Figure 4. Open in new tabDownload slide Effects of multiple copies of the Ron ESE-A sequence on splicing and complex assembly. (A) The level of spliced mRNA after incubation for two hours with BGSMN2 pre-mRNA terminating in 0, 1, 2, 3 or 4 copies of ESE-A. (B–E) Relative intensities of complexes containing labelled RNA after analysis by native gel electrophoresis of splicing reactions containing the indicated pre-mRNA that had been incubated for the times shown. Figure 4. Open in new tabDownload slide Effects of multiple copies of the Ron ESE-A sequence on splicing and complex assembly. (A) The level of spliced mRNA after incubation for two hours with BGSMN2 pre-mRNA terminating in 0, 1, 2, 3 or 4 copies of ESE-A. (B–E) Relative intensities of complexes containing labelled RNA after analysis by native gel electrophoresis of splicing reactions containing the indicated pre-mRNA that had been incubated for the times shown. The observation of linearity is a fundamental starting point in assessing possible pathways for ESE action. It is clearly consistent with either the dsx mechanism (Figure 1A), if each additional protein bound were to add to a very small probability of an interaction, but it might also be explained by other mechanisms such as those in Figure 1B, involving multiple targets if each target contacted has an approximately equal effect on the rate, Figure 1C, in which binding is cooperative but each additional site occupied contributes progressively less to the reduction in ΔG, or Figure 1D, in which the probability of binding by a protein is very low but each additional site contributes independently. The extent of multiple occupancy was measured for each pre-mRNA as above. The results (Figure 5) showed that the increase in the number of ESEs was accompanied by a small increase in the levels of colocalization, i.e. the proportion of pre-mRNA molecules bound. There was a roughly linear relationship between the level of co-localization and the levels of both splicing and complex A formation (Figure 6; A, R2 = 0.92 and Pcorrelation < 0.001; B, R2 = 0.89 and Pno correlation < 0.05), but in both cases the intercept on the ordinate is positive due to the binding of mEGFP-SRSF1 via U1 snRNPs and background binding. Figure 5. Open in new tabDownload slide Single molecule multicolour colocalization studies on the levels and stoichiometries of mEGFP-SRSF1 binding in nuclear extracts to molecules of BGSMN2 pre-mRNA terminating in 0, 1, 2, 3 or 4 copies of ESE-A. As in Figure 3, histograms show the frequencies (%) of pre-mRNA molecules showing bleaching of colocalized labelled protein in n steps, and the overall proportion of pre-mRNA showing colocalization is shown at the top. (A–E), pre-mRNA terminating in 0, 1, 2, 3 or 4 copies of ESE-A. Panels A and B are reproduced from Figure 3B and C. Figure 5. Open in new tabDownload slide Single molecule multicolour colocalization studies on the levels and stoichiometries of mEGFP-SRSF1 binding in nuclear extracts to molecules of BGSMN2 pre-mRNA terminating in 0, 1, 2, 3 or 4 copies of ESE-A. As in Figure 3, histograms show the frequencies (%) of pre-mRNA molecules showing bleaching of colocalized labelled protein in n steps, and the overall proportion of pre-mRNA showing colocalization is shown at the top. (A–E), pre-mRNA terminating in 0, 1, 2, 3 or 4 copies of ESE-A. Panels A and B are reproduced from Figure 3B and C. Figure 6. Open in new tabDownload slide The total proportion of pre-mRNA molecules colocalized with molecules of mEGFP-SRSF1 (from all the experiments shown in Figures 3 and 5) compared with the efficiencies of splicing (A; data from Figures 3A and 4A) or complex A formation (B; data from Figure 4C, 15 min time point). Points denoted by red diamonds were obtained with BGSMN2 with 0, 1, 2, 3 or 4 copies of ESE-A, as marked; the results with ESE-B, ESE-C and ESE-D are shown as green, orange and blue circles, respectively. Figure 6. Open in new tabDownload slide The total proportion of pre-mRNA molecules colocalized with molecules of mEGFP-SRSF1 (from all the experiments shown in Figures 3 and 5) compared with the efficiencies of splicing (A; data from Figures 3A and 4A) or complex A formation (B; data from Figure 4C, 15 min time point). Points denoted by red diamonds were obtained with BGSMN2 with 0, 1, 2, 3 or 4 copies of ESE-A, as marked; the results with ESE-B, ESE-C and ESE-D are shown as green, orange and blue circles, respectively. The most striking feature of Figure 5 is that the increase in the number of ESEs produced no increase at all in binding of three, four or five molecules of SRSF1, although there was an increase in the proportion of pre-mRNA molecules associated with two molecules of mEGFP-SRSF1. Specifically, the number of complexes in which mEGFP bleached in two steps compared to those in which it bleached in one step rose from 58% to 71%, 75% and 81% as the number of ESEs increased from one to four. Since the increased number of ESEs produced a clear functional effect on splicing, it is unlikely that the apparent restriction of stable binding to one molecule of SRSF1 results from steric occlusion of the extra copies of the ESE by, for example, G-quadruplex formation. However, we tested this possibility by UV cross-linking with the ESE sequences. If only a single molecule of SRSF1 were able to interact with the tandemly-repeated ESEs, then the addition of an equal number of molecules with 1, 2 3 or 4 ESEs to nuclear extract would produce equal levels of cross-linking. However, the results showed a linear increase in cross-linking to SRSF1 and other proteins as the number of repeats increased (Supplementary Figure S3), meaning that there is no intrinsic barrier to protein interactions with the tandem repeats. We conclude that the affinity of SRSF1 for an ESE is low and that the probability of binding SRSF1 is rate-limiting, as illustrated in Figure 1D. The concurrent loss of background binding and the native gel results support the possibility that binding of the second SRSF1 strengthens the 3’SS, stimulating complex A formation. Finally, we can conclude that the activation step does not involve propagation of SRSF1 (Figure 1F), although it is possible that other proteins might be involved. ESE-A increases the association of U2AF35 and U2 snRNP To confirm that the ESE repeats affect the recruitment of 3’SS components, the Cy5-labelled BGSMN2 and BGSMN2 + ESE-Ax4 pre-mRNAs were incubated in complex A-forming conditions in nuclear extracts containing either mCherry-U2AF35 + mEGFP-U2AF65 or mEGFP-U2B’’ (47). The results in Supplementary Figure S4 show that the four copies of ESE-A produced a marked (about two-fold) increase in the co-localization of U2AF and U2 snRNPs with pre-mRNA. We conclude that the single molecule results are consistent with the results of conventional experiments. A Strong ESE functions by looping to its target site The results described above are consistent with activation via looping, although other mechanisms are possible, such as propagation of other proteins from SRSF1 or even some form of sliding or reeling of a protein complex relative to the RNA. To test these, we introduced non-RNA linkers between the 3’SS and ESE-Ax2 (Reichenbach et al., manuscript in preparation). The linkers comprised tandem repeats of hexaethyleneglycol (HEG) or 12 (deoxy)ribose-phosphodiester linkages of abasic DNA or abasic RNA, which were intended to provide flexibility while disrupting the interactions of RNA-binding proteins that might propagate or slide from SRSF1. Each linker was prepared by solid phase synthesis with a portion of the 3′ exon of SMN2 exon 7 at the 5′ end and two copies of ESE-A at the 3′ end. The linkers were attached by ligation to the body of BGSMN2. The distance from the 3’SS to the ESE was maintained by deleting 12 nts from the exon. The sequences removed contained the Tra2β enhancer region, which is superfluous in the presence of an additional SRSF1-dependent enhancer (42) (Psame population = 0.12; Supplementary Figure S5). The 2 × HEG linkers would be expected to be at least as flexible as the RNA sequence, since their length of ∼4.4 nm is about eight times the persistence length (65,66) and an unfolded RNA chain of 12 nts would have a contour length of up to ∼7 nm, which is about 3.4 persistence lengths (67). Splicing assays showed that the abasic RNA linker and HEG linkers did not block the ESE, even though its activity was reduced (Figure 7A and B). In contrast, the abasic DNA linker reduced splicing to the level seen in the absence of an ESE. Mass spectrometric analysis of proteins bound to the linkers in nuclear extract showed that the abasic DNA could bind known DNA damage repair proteins (manuscript in preparation). It is possible that the recruitment of these non-splicing factors interferes with ESE activity. Figure 7. Open in new tabDownload slide Effects of introducing non-RNA linkers between the 3’SS and the ESE or a 5’SS. (A) In vitro splicing reactions done in triplicate for 2 h with pre-mRNA containing hexaethyleneglycol, abasic RNA or abasic DNA linkers, or without a linker. In all cases, two copies of the ESE-A motif were attached at the 3′ terminus. (B) Quantitative results from the experiment in (A). The histogram shows the mean values and error bars show the S.E.M. Figure 7. Open in new tabDownload slide Effects of introducing non-RNA linkers between the 3’SS and the ESE or a 5’SS. (A) In vitro splicing reactions done in triplicate for 2 h with pre-mRNA containing hexaethyleneglycol, abasic RNA or abasic DNA linkers, or without a linker. In all cases, two copies of the ESE-A motif were attached at the 3′ terminus. (B) Quantitative results from the experiment in (A). The histogram shows the mean values and error bars show the S.E.M. We conclude from the ability of the ESE to act across a bridge of abasic RNA or HEG, without bases or even phosphodiester linkages, that the bound SR protein is most likely to act by looping (Figure 1E) rather than propagation of protein complexes along the RNA or by sliding. DISCUSSION The results described here show that tandemly repeated ESEs increased splicing efficiency linearly, but that only a single molecule of SRSF1 was recruited by the ESEs in addition to the one recruited by the 5’SS-bound U1 snRNP. The activity of the ESEs was not blocked by the introduction of non-RNA linkers between the 3’SS and the ESE. The best explanation for these observations is that SRSF1 binds very weakly, associating only transiently with an ESE, but that increasing the number of ESE repeats increases the probability that a molecule of SRSF1 bound to an ESE will interact with its target 3’SS (Figure 1D). It is this molecule of SRSF1 that is then bound stably and detected in our experiments. The interaction step is likely to be direct, involving 3D diffusion and collision (Figure 1E). The results shown here demonstrate the reasons why conventional, ensemble methods could not be used to test the hypotheses described. There is a high level of apparently non-specific binding that follows a geometrical distribution (e.g. Figure 3B), and the addition of ESEs has a relatively small effect on the proportion of RNA bound by SRSF1; instead, the distribution of numbers of molecules of SRSF1 on a molecule of pre-mRNA changes, and the geometrical distribution shrinks in importance. On pre-mRNA with a strong 3’SS, a similar geometric distribution is seen when U1 binding is prevented (Jobbins et al., submitted). Thus, single molecule measurements of binding stoichiometry enable novel insights. One of the limitations in our results is that we cannot determine the exact proportion of complexes associated with two molecules of SRF1. This is the result of an under-representation of complexes bleaching in three steps, which should arise from the intrinsic dimerization of mEGFP-SRSF1 when two molecules of SRSF1 are bound (c.f. Supplementary Figure S1). It is possible that the interactions resulting in dimerization are occluded in complex A. This limitation does not, however, affect the obvious conclusions that the proportions of complexes containing two molecules of SRSF1 depend on the identity of the ESE and increase with the number of ESE repeats. One difficulty with single molecule methods is that their most radical findings of necessity cannot be confirmed by conventional methods (48). However, we have repeatedly demonstrated that our results fit those obtained previously by ensemble methods where the strategies intersect. Examples include: our discovery that 3 molecules of PTB are bound to either side of the repressed exon 3 of Tpm1 matched the number of potential binding sites for this multi-domain protein (44); the demonstration that PTB binding sites antagonised U2AF65 binding in Tpm1 (47); the numbers of U1 snRNPs bound to alternative 5′ splice sites were exactly in agreement with our earlier ribonuclease H protection assays showing that both alternative sites could be occupied even when only one site was used (45); the finding that in a complete three-exon transcript the difference in sequence between exon 7 of SMN1 and SMN2 produced a major effect on U2 snRNP binding (47); the demonstration that an SRSF1-dependent ESE bound SRSF1 (this work); the observation that ESEs that stimulated splicing and complex A formation also stimulated the recruitment of a U2 snRNP (this work). Thus, while ensemble methods may not be able to directly validate single molecule results, the fact that single molecule results can replicate the results of ensemble methods, often with better precision, supports the validity of the unexpected findings too. The weakness of the initial interactions of SRSF1 with the ESE suggests that the protein binds independently, rather than as part of a complex such as the Tra/Tra2/SR complex that binds dsx repeats. The Kd of SRSF1 (lacking the RS domain) for an optimal site was ∼0.2 μM (34), and in functional conditions competition with other proteins may reduce the rate of binding or even facilitate dissociation. The transience of such interactions has interesting implications. One is that, prior to the interactions between a bound SR protein and 3’SS components, it seems improbable that stable or stoichiometrically defined complexes will form on an exon. If other proteins behave like SRSF1, then the multiple proteins associating with an exon are likely to do so dynamically, binding in numerous combinations over short periods of time. This would resolve the dilemma that more proteins appear to be able to bind to SMN2 exon 7 and affect its splicing than could possibly bind concurrently (68). On the other hand, this would complicate our ability to determine which sets of proteins are consistent with activation. The other question that arises is in regard to cross-linking assays, including transcriptome-wide CLIP. These will naturally tend to detect SRSF1 stabilized by post-binding interactions rather than molecules engaged in transient initial binding to the RNA. Thus, if multiple ESEs contribute towards an outcome, then it is possible that only one ESE would be bound stably on each transcript and it would be by far the most likely to be detected. The overall signal for that event for many molecules would be spread among all the contributing ESEs, and the apparent significance of the ESEs is likely to be under-estimated. This might have contributed to the difficulty of identifying the connection between the enrichment of SRSF1 binding sites in the transcriptome near 3’SS and the likelihood of skipping or inclusion (34,37). The use of non-RNA linkers to differentiate between direct through-space interactions and processive processes (propagation or scanning) in splicing was applied first to determine the basis of the interaction between splice sites across an intron (69). We used the strategy subsequently to investigate the mechanisms of action of ESEs in a trans-acting TOES oligonucleotide that stimulated inclusion of SMN2 exon 7 in a three-exon pre-mRNA (70). Like the ESE investigated here, it recruited SRSF1 and stimulated the recruitment of U2 snRNP and associated proteins (42). However, the hexaethyleneglycol linker was inserted via copper-catalyzed alkyne-azide cycloaddition (CuAAC) ligation, forming exclusively a 1,4-triazole between the annealing domain and the ESE domain. The activity of the TOES oligonucleotide was actually increased by the inclusion of one or two units of hexaethylene glycol, but this improvement was lost if ten, eleven or twenty units were included, which is consistent with interactions by diffusion (70). The work described here with 3’ESE sequences that act in cis confirms that intervening non-RNA linkers do not prevent the activation of a 3’SS by an ESE. Our conclusion that the activation step involves looping is in agreement with that of Shen & Green (2004), who used an MS2-tethered RS domain with an intervening TEV cleavage site to show that the RS domain tethered to the 3′ exon could be cross-linked to the branch site in complex A (38). A different outcome was found when we tested the effects of a non-RNA linker on a 5’ESE that affects 5’SS selection (71). The activity of the 5’ESE was ablated by inclusion of hexaethyleneglycol, inserted via CuAAC ligation between the ESE and the exon upstream of the target 5’SS. It is hard to resist the inference that the effects of an ESE on a 5′ splice site are mediated by a process involving contact with the intervening RNA, and thus are quite different from the looping process in which an ESE activates a 3’SS. The possibility that these are quite different processes is in agreement with the observations that the ability of SRSF1 to stimulate U1 snRNP binding to a 5’SS does not require the RS domain (61,72,73), whereas the tethering experiments have shown that the activation step for 3’SSs requires only the RS domain (19,38). It remains to be seen whether such differences are characteristic of the mechanisms by which ESEs affect 5’SSs as opposed to 3’SSs or whether there are other important determinants of the activation step mechanism. The picture that emerges from these experiments is that SRSF1 binds transiently to SRSF1-dependent ESEs, and that each ESE makes an independent but effectively additive contribution to the probability that one SRSF1 molecule will be locked into place in complex A by interactions with other components of the early splicing complexes. It is not clear yet to what extent this new model applies to the many other proteins that interact with exons. It has been estimated that most positions in a short exon contribute significantly to the outcome of splicing (25,27,74) and that mutations act independently and additively (74,75). This is consistent with the possibility that most interacting proteins bind transiently (74), as we have inferred for SRSF1, and that the exon–protein complex is in a state of rapid flux. If so, it may be that none of the many possible combinations of proteins bound to the exon is uniquely required for activation. If repressor proteins behave similarly and are purely competitive antagonists, then even an exon with a heavy preponderance of silencers would be bound eventually by an SR protein and make contact with the 3′ splice site. To prevent this, there may be a time limit, creating a window of opportunity within which an ESE-bound SR protein might contact the 3′ splice site. The limit might be imposed by the slower formation of stable repressor complexes, such as those of PTBP1 around Tpm1 exon 3 (44) or exon N1 of Src (76), or of Rbfox in LASR complexes (77), or more generally by a transition into an alternative pathway, both in vivo and in vitro. The existence and nature of any limiting events have yet to be determined. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank Dr A.R. Krainer for the anti-SRSF1 antibody. FUNDING Leverhulme Trust [RPG-2014-001 to G.A.B. and I.C.E.]. Funding for open access charge: Universities of Leicester and Strathclyde. Conflict of interest statement. None declared. REFERENCES 1. Mardon H.J. , Sebastio G. , Baralle F.E. A role for exon sequences in alternative splicing of the human fibronectin gene . Nucleic Acids Res. 1987 ; 15 : 7725 – 7733 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Helfman D.M. , Ricci W.M. , Finn L.A. Alternative splicing of tropomyosin pre-mRNAs in vitro and in vivo . Genes Dev. 1988 ; 2 : 1627 – 1638 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Streuli M. , Saito H. Regulation of tissue-specific alternative splicing: exon-specific cis-elements govern the splicing of leukocyte common antigen pre-mRNA . EMBO J. 1989 ; 8 : 787 – 796 . Google Scholar PubMed WorldCat 4. Hedley M.L. , Maiatis T. Identification of exon sequences required for female-specific splicing of drosophila doublesex pre-messenger RNA . FASEB J. 1991 ; 5 : A1505 . WorldCat 5. Graham I.R. , Hamshere M. , Eperon I.C. Alternative splicing of a human alpha-tropomyosin muscle-specific exon: identification of determining sequences . Mol. Cell. Biol. 1992 ; 12 : 3872 – 3882 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Watakabe A. , Tanaka K. , Shimura Y. The role of exon sequences in splice site selection . Genes Dev. 1993 ; 7 : 407 – 418 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Yeakley J.M. , Hedjran F. , Morfin J.P. , Merillat N. , Rosenfeld M.G. , Emeson R.B. Control of calcitonin/calcitonin gene-related peptide pre-mRNA processing by constitutive intron and exon elements . Mol. Cell. Biol. 1993 ; 13 : 5999 – 6011 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Lavigueur A. , La Branche H. , Kornblihtt A.R. , Chabot B. A splicing enhancer in the human fibronectin alternate ED1 exon interacts with SR proteins and stimulates U2 snRNP binding . Genes Dev. 1993 ; 7 : 2405 – 2417 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Sun Q. , Mayeda A. , Hampson R.K. , Krainer A.R. , Rottman F.M. General splicing factor SF2/ASF promotes alternative splicing by binding to an exonic splicing enhancer . Genes Dev. 1993 ; 7 : 2598 – 2608 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Tian M. , Maniatis T. A splicing enhancer complex controls alternative splicing of doublesex pre-mRNA . Cell . 1993 ; 74 : 105 – 114 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Graveley B.R. Sorting out the complexity of SR protein functions . RNA . 2000 ; 6 : 1197 – 1211 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Long J.C. , Caceres J.F. The SR protein family of splicing factors: master regulators of gene expression . Biochem J. 2009 ; 417 : 15 – 27 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Shepard P.J. , Hertel K.J. The SR protein family . Genome Biol. 2009 ; 10 : 242 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Staknis D. , Reed R. SR proteins promote the first specific recognition of Pre-mRNA and are present together with the U1 small nuclear ribonucleoprotein particle in a general splicing enhancer complex . Mol. Cell. Biol. 1994 ; 14 : 7670 – 7682 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Wang Z. , Hoffmann H.M. , Grabowski P.J. Intrinsic U2AF binding is modulated by exon enhancer signals in parallel with changes in splicing activity . RNA . 1995 ; 1 : 21 – 35 . Google Scholar PubMed WorldCat 16. Bouck J. , Fu X.D. , Skalka A.M. , Katz R.A. Role of the constitutive splicing factors U2AF65 and SAP49 in suboptimal RNA splicing of novel retroviral mutants . J. Biol. Chem. 1998 ; 273 : 15169 – 15176 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Wu J.Y. , Maniatis T. Specific interactions between proteins implicated in splice site selection and regulated alternative splicing . Cell . 1993 ; 75 : 1061 – 1070 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Zuo P. , Maniatis T. The splicing factor U2AF35 mediates critical protein-protein interactions in constitutive and enhancer-dependent splicing . Genes Dev. 1996 ; 10 : 1356 – 1368 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Graveley B.R. , Hertel K.J. , Maniatis T. A systematic analysis of the factors that determine the strength of pre- mRNA splicing enhancers . EMBO J. 1998 ; 17 : 6747 – 6756 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Hertel K.J. , Maniatis T. The function of multisite splicing enhancers . Mol. Cell . 1998 ; 1 : 449 – 455 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Fairbrother W.G. , Yeh R.F. , Sharp P.A. , Burge C.B. Predictive identification of exonic splicing enhancers in human genes . Science . 2002 ; 297 : 1007 – 1013 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Lim K.H. , Ferraris L. , Filloux M.E. , Raphael B.J. , Fairbrother W.G. Using positional distribution to identify splicing elements and predict pre-mRNA processing defects in human genes . Proc. Natl Acad. Sci. U.S.A. 2011 ; 108 : 11093 – 11098 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Wang Z. , Rolish M.E. , Yeo G. , Tung V. , Mawson M. , Burge C.B. Systematic identification and analysis of exonic splicing silencers . Cell . 2004 ; 119 : 831 – 845 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Ke S. , Shang S. , Kalachikov S.M. , Morozova I. , Yu L. , Russo J.J. , Ju J. , Chasin L.A. Quantitative evaluation of all hexamers as exonic splicing elements . Genome Res. 2011 ; 21 : 1360 – 1374 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Singh N.N. , Androphy E.J. , Singh R.N. In vivo selection reveals combinatorial controls that define a critical exon in the spinal muscular atrophy genes . RNA . 2004 ; 10 : 1291 – 1305 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Xiong H.Y. , Alipanahi B. , Lee L.J. , Bretschneider H. , Merico D. , Yuen R.K. , Hua Y. , Gueroussov S. , Najafabadi H.S. , Hughes T.R. et al. RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease . Science . 2015 ; 347 : 1254806 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Julien P. , Minana B. , Baeza-Centurion P. , Valcarcel J. , Lehner B. The complete local genotype-phenotype landscape for the alternative splicing of a human exon . Nat. Commun. 2016 ; 7 : 11558 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Soemedi R. , Cygan K.J. , Rhine C.L. , Wang J. , Bulacan C. , Yang J. , Bayrak-Toydemir P. , McDonald J. , Fairbrother W.G. Pathogenic variants that alter protein code often disrupt splicing . Nat. Genet. 2017 ; 49 : 848 – 855 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Grodecka L. , Buratti E. , Freiberger T. Mutations of pre-mRNA splicing regulatory elements: are predictions moving forward to clinical diagnostics . Int. J. Mol. Sci. 2017 ; 18 : E1668 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Tacke R. , Manley J.L. The human splicing factors ASF/SF2 and SC35 possess distinct, functionally significant RNA binding specificities . EMBO J. 1995 ; 14 : 3540 – 3551 . Google Scholar PubMed WorldCat 31. Liu H.X. , Zhang M. , Krainer A.R. Identification of functional exonic splicing enhancer motifs recognized by individual SR proteins . Genes Dev. 1998 ; 12 : 1998 – 2012 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Ray D. , Kazan H. , Cook K.B. , Weirauch M.T. , Najafabadi H.S. , Li X. , Gueroussov S. , Albu M. , Zheng H. , Yang A. et al. A compendium of RNA-binding motifs for decoding gene regulation . Nature . 2013 ; 499 : 172 – 177 . Google Scholar Crossref Search ADS PubMed WorldCat 33. Clery A. , Sinha R. , Anczukow O. , Corrionero A. , Moursy A. , Daubner G.M. , Valcarcel J. , Krainer A.R. , Allain F.H. Isolated pseudo-RNA-recognition motifs of SR proteins can regulate splicing using a noncanonical mode of RNA recognition . Proc. Natl Acad. Sci. U.S.A. 2013 ; 110 : E2802 – E2811 . Google Scholar Crossref Search ADS PubMed WorldCat 34. Anczukow O. , Akerman M. , Clery A. , Wu J. , Shen C. , Shirole N.H. , Raimer A. , Sun S. , Jensen M.A. , Hua Y. et al. SRSF1-regulated alternative splicing in breast cancer . Mol. Cell. 2015 ; 60 : 105 – 117 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Sanford J.R. , Wang X. , Mort M. , Vanduyn N. , Cooper D.N. , Mooney S.D. , Edenberg H.J. , Liu Y. Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts . Genome Res. 2009 ; 19 : 381 – 394 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Anko M.L. , Muller-McNicoll M. , Brandl H. , Curk T. , Gorup C. , Henry I. , Ule J. , Neugebauer K.M. The RNA-binding landscapes of two SR proteins reveal unique functions and binding to diverse RNA classes . Genome Biol. 2012 ; 13 : R17 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Pandit S. , Zhou Y. , Shiue L. , Coutinho-Mansfield G. , Li H. , Qiu J. , Huang J. , Yeo G.W. , Ares M. Jr. , Fu X.D. Genome-wide analysis reveals SR protein cooperation and competition in regulated splicing . Mol. Cell. 2013 ; 50 : 223 – 235 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Shen H. , Kan J.L. , Green M.R. Arginine-serine-rich domains bound at splicing enhancers contact the branchpoint to promote prespliceosome assembly . Mol. Cell . 2004 ; 13 : 367 – 376 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Shen H. , Green M.R. A pathway of sequential arginine-serine-rich domain-splicing signal interactions during mammalian spliceosome assembly . Mol. Cell . 2004 ; 16 : 363 – 373 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Shen H. , Green M.R. RS domains contact splicing signals and promote splicing by a common mechanism in yeast through humans . Genes Dev. 2006 ; 20 : 1755 – 1765 . Google Scholar Crossref Search ADS PubMed WorldCat 41. Owen N. , Zhou H. , Malygin A.A. , Sangha J. , Smith L.D. , Muntoni F. , Eperon I.C. Design principles for bifunctional targeted oligonucleotide enhancers of splicing . Nucleic Acids Res. 2011 ; 39 : 7194 – 7208 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Smith L.D. , Dickinson R.L. , Lucas C.M. , Cousins A. , Malygin A.A. , Weldon C. , Perrett A.J. , Bottrill A.R. , Searle M.S. , Burley G.A. et al. A targeted oligonucleotide enhancer of SMN2 exon 7 splicing forms competing quadruplex and protein complexes in functional conditions . Cell Rep. 2014 ; 9 : 193 – 205 . Google Scholar Crossref Search ADS PubMed WorldCat 43. Dreumont N. , Hardy S. , Behm-Ansmant I. , Kister L. , Branlant C. , Stevenin J. , Bourgeois C.F. Antagonistic factors control the unproductive splicing of SC35 terminal intron . Nucleic Acids Res. 2010 ; 38 : 1353 – 1366 . Google Scholar Crossref Search ADS PubMed WorldCat 44. Cherny D. , Gooding C. , Eperon G.E. , Coelho M.B. , Bagshaw C.R. , Smith C.W. , Eperon I.C. Stoichiometry of a regulatory splicing complex revealed by single-molecule analyses . EMBO J. 2010 ; 29 : 2161 – 2172 . Google Scholar Crossref Search ADS PubMed WorldCat 45. Hodson M.J. , Hudson A.J. , Cherny D. , Eperon I.C. The transition in spliceosome assembly from complex E to complex A purges surplus U1 snRNPs from alternative splice sites . Nucleic Acids Res. 2012 ; 40 : 6850 – 6862 . Google Scholar Crossref Search ADS PubMed WorldCat 46. Gooding C. , Edge C. , Lorenz M. , Coelho M.B. , Winters M. , Kaminski C.F. , Cherny D. , Eperon I.C. , Smith C.W. MBNL1 and PTB cooperate to repress splicing of Tpm1 exon 3 . Nucleic Acids Res. 2013 ; 41 : 4765 – 4782 . Google Scholar Crossref Search ADS PubMed WorldCat 47. Chen L. , Weinmeister R. , Kralovicova J. , Eperon L.P. , Vorechovsky I. , Hudson A.J. , Eperon I.C. Stoichiometries of U2AF35, U2AF65 and U2 snRNP reveal new early spliceosome assembly pathways . Nucleic Acids Res. 2017 ; 45 : 2051 – 2067 . Google Scholar PubMed WorldCat 48. Hoskins A.A. , Friedman L.J. , Gallagher S.S. , Crawford D.J. , Anderson E.G. , Wombacher R. , Ramirez N. , Cornish V.W. , Gelles J. , Moore M.J. Ordered and dynamic assembly of single spliceosomes . Science . 2011 ; 331 : 1289 – 1295 . Google Scholar Crossref Search ADS PubMed WorldCat 49. Shcherbakova I. , Hoskins A.A. , Friedman L.J. , Serebrov V. , Correa I.R. Jr , Xu M.Q. , Gelles J. , Moore M.J. Alternative spliceosome assembly pathways revealed by single-molecule fluorescence microscopy . Cell Rep. 2013 ; 5 : 151 – 165 . Google Scholar Crossref Search ADS PubMed WorldCat 50. Skordis L.A. , Dunckley M.G. , Yue B. , Eperon I.C. , Muntoni F. Bifunctional antisense oligonucleotides provide a trans-acting splicing enhancer that stimulates SMN2 gene expression in patient fibroblasts . Proc. Natl Acad. Sci. U.S.A. 2003 ; 100 : 4114 – 4119 . Google Scholar Crossref Search ADS PubMed WorldCat 51. Eritja R. , Walker P.A. , Randall S.K. , Goodman M.F. , Kaplan B.E. Synthesis of oligonucleotides containing the abasic site model-compound 1,4-anhydro-2-deoxy-D-ribitol . Nucleos. Nucleot. 1987 ; 6 : 803 – 814 . Google Scholar Crossref Search ADS WorldCat 52. Taniho K. , Nakashima R. , Kandeel M. , Kitamura Y. , Kitade Y. Synthesis and biological properties of chemically modified siRNAs bearing 1-deoxy-D-ribofuranose in their 3′-overhang region . Bioorg. Med. Chem. Lett. 2012 ; 22 : 2518 – 2521 . Google Scholar Crossref Search ADS PubMed WorldCat 53. Crawford D.J. , Hoskins A.A. , Friedman L.J. , Gelles J. , Moore M.J. Single-molecule colocalization FRET evidence that spliceosome activation precedes stable approach of 5′ splice site and branch site . Proc. Natl Acad. Sci. U.S.A. 2013 ; 110 : 6783 – 6788 . Google Scholar Crossref Search ADS PubMed WorldCat 54. Ohrt T. , Prior M. , Dannenberg J. , Odenwalder P. , Dybkov O. , Rasche N. , Schmitzova J. , Gregor I. , Fabrizio P. , Enderlein J. et al. Prp2-mediated protein rearrangements at the catalytic core of the spliceosome as revealed by dcFCCS . RNA . 2012 ; 18 : 1244 – 1256 . Google Scholar Crossref Search ADS PubMed WorldCat 55. Lee K.A. , Bindereif A. , Green M.R. A small-scale procedure for preparation of nuclear extracts that support efficient transcription and pre-mRNA splicing . Gene Anal. Tech. 1988 ; 5 : 22 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 56. Donmez G. , Hartmuth K. , Kastner B. , Will C.L. , Luhrmann R. The 5′ end of U2 snRNA is in close proximity to U1 and functional sites of the pre-mRNA in early spliceosomal complexes . Mol. Cell . 2007 ; 25 : 399 – 411 . Google Scholar Crossref Search ADS PubMed WorldCat 57. Dickson R.M. , Cubitt A.B. , Tsien R.Y. , Moerner W.E. On/off blinking and switching behaviour of single molecules of green fluorescent protein . Nature . 1997 ; 388 : 355 – 358 . Google Scholar Crossref Search ADS PubMed WorldCat 58. Garcia-Parajo M.F. , Segers-Nolten G.M. , Veerman J.A. , Greve J. , van Hulst N.F. Real-time light-driven dynamics of the fluorescence emission in single green fluorescent protein molecules . Proc. Natl Acad. Sci. U.S.A. 2000 ; 97 : 7237 – 7242 . Google Scholar Crossref Search ADS PubMed WorldCat 59. Vamosi G. , Mucke N. , Muller G. , Krieger J.W. , Curth U. , Langowski J. , Toth K. EGFP oligomers as natural fluorescence and hydrodynamic standards . Sci. Rep. 2016 ; 6 : 33022 . Google Scholar Crossref Search ADS PubMed WorldCat 60. Ghigna C. , Giordano S. , Shen H. , Benvenuto F. , Castiglioni F. , Comoglio P.M. , Green M.R. , Riva S. , Biamonti G. Cell motility is controlled by SF2/ASF through alternative splicing of the Ron protooncogene . Mol. Cell. 2005 ; 20 : 881 – 890 . Google Scholar Crossref Search ADS PubMed WorldCat 61. Cho S. , Hoang A. , Sinha R. , Zhong X.Y. , Fu X.D. , Krainer A.R. , Ghosh G. Interaction between the RNA binding domains of Ser-Arg splicing factor 1 and U1-70K snRNP protein determines early spliceosome assembly . Proc. Natl Acad. Sci. U.S.A. 2011 ; 108 : 8233 – 8238 . Google Scholar Crossref Search ADS PubMed WorldCat 62. Smith P.J. , Zhang C. , Wang J. , Chew S.L. , Zhang M.Q. , Krainer A.R. An increased specificity score matrix for the prediction of SF2/ASF-specific exonic splicing enhancers . Hum. Mol. Genet. 2006 ; 15 : 2490 – 2508 . Google Scholar Crossref Search ADS PubMed WorldCat 63. Hofmann Y. , Lorson C.L. , Stamm S. , Androphy E.J. , Wirth B. Htra2-beta 1 stimulates an exonic splicing enhancer and can restore full-length SMN expression to survival motor neuron 2 (SMN2) . Proc. Natl Acad. Sci. U.S.A. 2000 ; 97 : 9618 – 9623 . Google Scholar Crossref Search ADS PubMed WorldCat 64. Clery A. , Jayne S. , Benderska N. , Dominguez C. , Stamm S. , Allain F.H. Molecular basis of purine-rich RNA recognition by the human SR-like protein Tra2-beta1 . Nat. Struct. Mol. Biol. 2011 ; 18 : 443 – 450 . Google Scholar Crossref Search ADS PubMed WorldCat 65. Knowles D.B. , LaCroix A.S. , Deines N.F. , Shkel I. , Record M.T. Jr Separation of preferential interaction and excluded volume effects on DNA duplex and hairpin stability . Proc. Natl Acad. Sci. U.S.A. 2011 ; 108 : 12699 – 12704 . Google Scholar Crossref Search ADS PubMed WorldCat 66. Mark J.E. , Flory P.J. The configuration of the polyoxyethylene chain . J. Am. Chem. Soc. 1965 ; 87 : 1415 – 1423 . Google Scholar Crossref Search ADS WorldCat 67. Caliskan G. , Hyeon C. , Perez-Salas U. , Briber R.M. , Woodson S.A. , Thirumalai D. Persistence length changes dramatically as RNA folds . Phys. Rev. Lett. 2005 ; 95 : 268303 . Google Scholar Crossref Search ADS PubMed WorldCat 68. Wee C.D. , Havens M.A. , Jodelka F.M. , Hastings M.L. Targeting SR proteins improves SMN expression in spinal muscular atrophy cells . PLoS One . 2014 ; 9 : e115205 . Google Scholar Crossref Search ADS PubMed WorldCat 69. Pasman Z. , Garcia-Blanco M.A. The 5′ and 3′ splice sites come together via a three dimensional diffusion mechanism . Nucleic Acids Res. 1996 ; 24 : 1638 – 1645 . Google Scholar Crossref Search ADS PubMed WorldCat 70. Perrett A.J. , Dickinson R.L. , Krpetić Z. , Brust M. , Lewis H. , Eperon I.C. , Burley G.A. Conjugation of PEG and gold nanoparticles to increase the accessibility and valency of tethered RNA splicing enhancers . Chem. Sci. 2013 ; 4 : 257 – 265 . Google Scholar Crossref Search ADS WorldCat 71. Lewis H. , Perrett A.J. , Burley G.A. , Eperon I.C. An RNA splicing enhancer that does not act by looping . Angew. Chem. Int. Ed. Engl. 2012 ; 51 : 9800 – 9803 . Google Scholar Crossref Search ADS PubMed WorldCat 72. Eperon I.C. , Makarova O.V. , Mayeda A. , Munroe S.H. , Caceres J.F. , Hayward D.G. , Krainer A.R. Selection of alternative 5′ splice sites: role of U1 snRNP and models for the antagonistic effects of SF2/ASF and hnRNP A1 . Mol. Cell. Biol. 2000 ; 20 : 8303 – 8318 . Google Scholar Crossref Search ADS PubMed WorldCat 73. Roca X. , Krainer A.R. , Eperon I.C. Pick one, but be quick: 5′ splice sites and the problems of too many choices . Genes Dev. 2013 ; 27 : 129 – 144 . Google Scholar Crossref Search ADS PubMed WorldCat 74. Ke S. , Anquetil V. , Zamalloa J.R. , Maity A. , Yang A. , Arias M.A. , Kalachikov S. , Russo J.J. , Ju J. , Chasin L.A. Saturation mutagenesis reveals manifold determinants of exon definition . Genome Res. 2018 ; 28 : 11 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 75. Rosenberg A.B. , Patwardhan R.P. , Shendure J. , Seelig G. Learning the sequence determinants of alternative splicing from millions of random sequences . Cell . 2015 ; 163 : 698 – 711 . Google Scholar Crossref Search ADS PubMed WorldCat 76. Wongpalee S.P. , Vashisht A. , Sharma S. , Chui D. , Wohlschlegel J.A. , Black D.L. Large-scale remodeling of a repressed exon ribonucleoprotein to an exon definition complex active for splicing . Elife . 2016 ; 5 : e19743 . Google Scholar Crossref Search ADS PubMed WorldCat 77. Damianov A. , Ying Y. , Lin C.H. , Lee J.A. , Tran D. , Vashisht A.A. , Bahrami-Samani E. , Xing Y. , Martin K.C. , Wohlschlegel J.A. et al. Rbfox proteins regulate splicing as part of a large multiprotein complex LASR . Cell . 2016 ; 165 : 606 – 619 . Google Scholar Crossref Search ADS PubMed WorldCat Author notes These authors contributed equally to this work as first authors. © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnosticsArdui,, Simon;Ameur,, Adam;Vermeesch, Joris, R;Hestand, Matthew, S
doi: 10.1093/nar/gky066pmid: 29401301
Abstract Short read massive parallel sequencing has emerged as a standard diagnostic tool in the medical setting. However, short read technologies have inherent limitations such as GC bias, difficulties mapping to repetitive elements, trouble discriminating paralogous sequences, and difficulties in phasing alleles. Long read single molecule sequencers resolve these obstacles. Moreover, they offer higher consensus accuracies and can detect epigenetic modifications from native DNA. The first commercially available long read single molecule platform was the RS system based on PacBio's single molecule real-time (SMRT) sequencing technology, which has since evolved into their RSII and Sequel systems. Here we capsulize how SMRT sequencing is revolutionizing constitutional, reproductive, cancer, microbial and viral genetic testing. INTRODUCTION Modern medical genomics research and diagnostics relies heavily on DNA sequencing. Sequencing technologies are used in a wide range of applications during the entire human lifespan, from prenatal diagnostics, to newborn screening, to diagnosing rare diseases, hereditary forms of cancer, pharmacogenetics testing and predisposition testing for a plethora of diseases. It can even include testing for future generations in terms of carrier screening and pre-implantation genetic diagnoses (1,2). The history of sequencing technologies can be broken up into three phases: first-, second- and third-generation sequencing (3). Though earlier first-generation technologies provided ground breaking discoveries, the big revolution in sequencing began with the invention of the ‘chain-termination’ or dideoxy technique, or what is today called Sanger sequencing (3,4). Improvements in chemistry and switching from gels to capillary based electrophoresis led to the current Sanger machines that provide low-throughput, high quality reads of up to ∼1 kb. Sanger sequencing is still often referred to as the gold standard and is commonly used for diagnosing Mendelian disorders (5) and targeted validation of higher-throughput sequencing results. The first decade of the 21st century brought forth the development of multiple new methods of DNA sequencing (6). As opposed to first-generation platforms, these new second-generation technologies have considerably shorter reads (up to a few hundred bps), but at massively higher throughput (up to billions of reads per run). Common short-read platforms based on fluorescence include Illumina's bridge amplification and sequencing by synthesis technologies (e.g. HiSeq and MiSeq), Roche 454 pyrosequencers, and Applied Biosystem's sequencing by oligonucleotide ligation and detection (SOLiD) platforms. Additional short-read platforms include the Ion Torrent sequencers that detect nucleotides by the difference in pH as a result of hydrogen ions emitted during polymerisation, as opposed to light signals. Though these short-read platforms have permitted scientists to quickly hunt for causative mutations in a panel of disease genes, the exome, or even the entire human genome in both research and clinical settings (7), they all share common pitfalls and drawbacks. The short read lengths hinder assigning reads to complex parts of the genome (8), phasing of variants (9), resolving repeat regions (10) and introduce gaps and ambiguous regions in de novo assemblies (11,12). The amplification steps during library preparation and/or the actual sequencing reaction also introduce chimeric reads (13), variation in repeat size, and an underrepresentation of GC-rich/poor regions. Taken together, these drawbacks hinder the utility of diagnostic variant detection. Third-generation is in general characterized as single molecule sequencing and is fundamentally different from clonal based second generation sequencing methods. Helicos provided the first commercial application of single molecule sequencing based on fluorescence detection and sequencing by synthesis. Though lacking amplification biases, such as underrepresentation of GC-rich/poor regions, this early single molecule sequencing still produced short (often 35 bp) read lengths (14). Two newer technologies, single molecule real-time (SMRT) sequencing by Pacific Biosciences (PacBio) (15) and nanopore sequencing by Oxford Nanopore Technologies (16), offer the advantages of single molecule sequencing, including exceptionally long read lengths (>20 kb). These platforms permit sequencing/assembly through repetitive elements, direct variant phasing, and even direct detection of epigenetic modifications (17,18). Sequencing also only lasts several hours, which is very appealing in a diagnostic setting. Though simple and low-cost nanopore based technologies (reviewed in (18–20)) are catching on and likely represent future platforms, SMRT sequencing is currently more matured and therefore diagnostically applicable at this time. Here we review how SMRT sequencing is being implemented in human genetic diagnostics. SMRT SEQUENCING TECHNOLOGY AND TERMINOLOGY Before SMRT sequencing, a library needs to be prepared from double stranded DNA input material (Figure 1A). Typically this often requires five or more micrograms of DNA which can limit some applications. The library preparation consists of simply ligating hairpin adapters onto DNA molecules, thereby circularizing them into a construct termed a SMRTbell (Figure 1B) (21). Next, a primer and a polymerase are annealed to the adapter whereupon the library is loaded on a SMRT Cell containing 150 000 nanoscale observation chambers (Zero Mode Waveguides (ZMWs)) for the RSII system and up to a million on the newer Sequel platform. The polymerase bound SMRTbells are then loaded into the ZMWs (Figure 1C). Ideally as many ZMWs should be loaded with exactly one SMRTbell as possible to maximise throughput and read lengths. For a good run, this is around one third to one half of the ZMWs per SMRT cell. Hence a SMRT cell typically produces ∼55 000 reads for the RSII system and 365 000 reads for the Sequel system (Table 1). The actual sequencing reaction occurs within each ZMW, whose small diameter only permits the smallest available volume for light detection (22). The polymerase within each ZMW incorporates fluorescently labeled nucleotides, emitting a fluorescent signal that is recorded by a camera in real-time (Figure 1C). These signals are converted to long sequences termed continuous long reads (CLR) (22), linear reads, or polymerase reads. For a short insert library, the circular structure of the molecule results in the insert sequence being covered multiple times by the CLR. Each pass of an original strand is termed a subread. In addition, all subreads from the same molecule can be combined into one highly accurate consensus sequence termed a circular consensus sequence (CCS) or reads-of-insert (ROI) (Figure 1F–H, left panel). These two terms are often used interchangeably, but by definition the difference is CCS requires two full sequencing passes of the insert whereas ROI can be defined starting from even a partial pass. Figure 1. Open in new tabDownload slide Overview of SMRT Sequencing Technology. Sequencing starts with preparing a library from double stranded DNA (A) to which hairpin adapters are ligated (B). This library is thereafter loaded onto a SMRT Cell made up of nanoscale observation chambers (Zero Mode Waveguides (ZMWs)). The DNA molecules in the library will be pulled to the bottom of the ZMW where the polymerase will incorporate fluorescently labelled nucleotides (C). Note that not all ZMWs will contain a DNA molecule because the library is loaded by diffusion. The fluorescence emitted by the nucleotides is recorded by a camera in real-time. Hence, not only the fluorescence color can be registered, but also the time between nucleotide incorporation which is called the interpulse duration (IPD) (D, right panel). When a sequencing polymerase encounters nucleotides on the DNA strand containing an (epigenetic) modification, like for example a 6-methyl adenosine modification (E, left panel), then the IPD will be delayed (E, right panel) compared to non-methylated DNA (D, right panel). Due to the circular structure of the library, a short insert will be covered multiple times by the continuous long read (CLR). Each pass of the original DNA molecule is termed a subread, which can be combined into one highly accurate consensus sequence termed a circular consensus sequence (CCS) or reads-of-insert (ROI) (F–H, left panel). Though SMRT sequencing always uses a circular template, long insert libraries typically only have a single pass and hence generate a linear sequence with single pass error rates (black nucleotides) (FG, right panel). Afterwards, overlapping single passes can be combined into one consensus sequence of high quality (H, right panel). Overall, CCS reads have the advantage of being very accurate while single passes stand out for their long read lengths (>20 kb). Figure 1. Open in new tabDownload slide Overview of SMRT Sequencing Technology. Sequencing starts with preparing a library from double stranded DNA (A) to which hairpin adapters are ligated (B). This library is thereafter loaded onto a SMRT Cell made up of nanoscale observation chambers (Zero Mode Waveguides (ZMWs)). The DNA molecules in the library will be pulled to the bottom of the ZMW where the polymerase will incorporate fluorescently labelled nucleotides (C). Note that not all ZMWs will contain a DNA molecule because the library is loaded by diffusion. The fluorescence emitted by the nucleotides is recorded by a camera in real-time. Hence, not only the fluorescence color can be registered, but also the time between nucleotide incorporation which is called the interpulse duration (IPD) (D, right panel). When a sequencing polymerase encounters nucleotides on the DNA strand containing an (epigenetic) modification, like for example a 6-methyl adenosine modification (E, left panel), then the IPD will be delayed (E, right panel) compared to non-methylated DNA (D, right panel). Due to the circular structure of the library, a short insert will be covered multiple times by the continuous long read (CLR). Each pass of the original DNA molecule is termed a subread, which can be combined into one highly accurate consensus sequence termed a circular consensus sequence (CCS) or reads-of-insert (ROI) (F–H, left panel). Though SMRT sequencing always uses a circular template, long insert libraries typically only have a single pass and hence generate a linear sequence with single pass error rates (black nucleotides) (FG, right panel). Afterwards, overlapping single passes can be combined into one consensus sequence of high quality (H, right panel). Overall, CCS reads have the advantage of being very accurate while single passes stand out for their long read lengths (>20 kb). Comparison of PacBio sequencing platforms to two current industry standards Table 1. Comparison of PacBio sequencing platforms to two current industry standards Platform Read length Number reads Error rate Run rime PacBio RSII (per SMRT cell) Average 10–16 kb ∼55 000 13–15% 0.5–6 hours PacBio Sequel (per SMRT cell) Average 10–14 kb ∼365 000 13–15% 0.5–10 hours Illumina HiSeq 4000 2 × 150 bp 5 billion ∼0.1% <1–3.5 days Illumina MiSeq 2 × 300 bp 25 million ∼0.1% 4–55 hours Platform Read length Number reads Error rate Run rime PacBio RSII (per SMRT cell) Average 10–16 kb ∼55 000 13–15% 0.5–6 hours PacBio Sequel (per SMRT cell) Average 10–14 kb ∼365 000 13–15% 0.5–10 hours Illumina HiSeq 4000 2 × 150 bp 5 billion ∼0.1% <1–3.5 days Illumina MiSeq 2 × 300 bp 25 million ∼0.1% 4–55 hours Numbers from personal experience and company website (www.pacb.com and www.illumina.com) queries on 14 November 2017. Open in new tab Table 1. Comparison of PacBio sequencing platforms to two current industry standards Platform Read length Number reads Error rate Run rime PacBio RSII (per SMRT cell) Average 10–16 kb ∼55 000 13–15% 0.5–6 hours PacBio Sequel (per SMRT cell) Average 10–14 kb ∼365 000 13–15% 0.5–10 hours Illumina HiSeq 4000 2 × 150 bp 5 billion ∼0.1% <1–3.5 days Illumina MiSeq 2 × 300 bp 25 million ∼0.1% 4–55 hours Platform Read length Number reads Error rate Run rime PacBio RSII (per SMRT cell) Average 10–16 kb ∼55 000 13–15% 0.5–6 hours PacBio Sequel (per SMRT cell) Average 10–14 kb ∼365 000 13–15% 0.5–10 hours Illumina HiSeq 4000 2 × 150 bp 5 billion ∼0.1% <1–3.5 days Illumina MiSeq 2 × 300 bp 25 million ∼0.1% 4–55 hours Numbers from personal experience and company website (www.pacb.com and www.illumina.com) queries on 14 November 2017. Open in new tab Due to the real-time detection of the nucleotide incorporation rate, the pace of the polymerase progressing through the DNA strand is registered during sequencing (23). The time between nucleotide incorporations is termed the interpulse duration (IPD) and varies with epigenetic changes on the DNA (Figure 1D and E). Since a polymerase is not holding a single nucleotide during sequencing, but approximately twelve nucleotides, an epigenetic change on one nucleotide can actually affect the incorporation rate of surrounding nucleotides. This results in a ‘fingerprint,’ (24) some of which have been characterized, such as for 6-mA, 4-mC and (Tet-converted) 5-mC. In addition to fewer but longer reads (Table 1), PacBio data differs from short read sequencing technologies in several aspects. Reads are not a set read length, but a distribution of read lengths depending on how long each individual polymerase is active. Since there is no need for amplification during the library preparation, nor during the sequencing process, biases such as GC-skewing are near absent. In contrary to second-generation platforms, raw PacBio reads also differ in error types (more indels than mismatches) and have a much higher abundance (∼13–15%, Table 1), though they are spread randomly across the reads (25,26). This randomness enables highly accurate consensuses (>99%) to be build up rapidly by sequencing multiple times the same molecule (CCS reads) (15) or by combining different CLRs derived from the same locus (Figure 1G and H). Also, diffusion loading creates a preference towards shorter molecules which might negatively impact sequencing runs. This loading bias can be mitigated by using magbead loading which keeps molecules <1 kb from binding to the bottom of ZMWs, size selection to remove short molecules, and/or by adding polyethylene glycol during loading to enhance packing of large DNA molecules. It is possible that a complete length independent loading can be achieved in the (near) future by applying an electrical field to force charged molecules into ZMW’s (27). To address these inherently different reads, bioinformatic analyses require adapting current tools and/or developing new methods, such as for alignment (26,28–32) and assembly (33–39). Many PacBio specific tools and pipelines (including those for demultiplexing, creating CCS reads, long amplicon analyses, de novo assemblies (34) and epigenetic analyses) are available in PacBio's SMRT analysis suite (openly available, www.pacb.com/support/software-downloads/) via the command line or their SMRT Portal and SMRT Link graphical user interfaces. CONSTITUTIONAL Tandem repeat disorders Tandem repeats cause more than 40 neurological, neurodegenerative or neuromuscular diseases when mutated (40). Unfortunately, sequencing those DNA elements is difficult with short-read platforms because the reads are too short to span most tandem repeats. The first tandem repeat studied by SMRT sequencing was the FMR1 CGG repeat (41). Healthy individuals carry around 30 CGG units which is mostly interrupted by one or two AGG units. An expansion of the repeat to more than 200 units causes the Fragile X Syndrome (FXS), which is one of the most frequent causes of inherited intellectual disability and autism. Loomis et al. (41) showed they could sequence through a long full mutation allele of 750 units which equals 2 kb of 100% GC and repetitive content. Interestingly, expansions to full mutations only occur upon maternal transmission whereby the risk directly correlates with increasing repeat size and fewer AGG interruptions (42). SMRT sequencing can be used to determine the repeat size and the detection of the number of interrupting AGG units (43). A main advantage of this approach is the unambiguous separation of the two CGG repeats on the different X chromosomes of females thereby outperforming all other (PCR) approaches. Afterward, the information generated by SMRT sequencing is used clinically for improved genetic counselling of woman weighing the risk of having a child with FXS (43–45). Another example of tackling a tandem repeat by SMRT sequencing is the ATTCT repeat embedded in intron 9 of the Spinocerebellar ataxia type 10 gene (SCA10) (10). For the first time the full length of an expanded ATTCT repeat was completely sequenced using SMRT technology. The repeat was reconstructed by assembly and both known and novel interruptions were detected (10). The presence of those interruptions influence the phenotype of SCA10 patients and hence knowing the exact repeat structure allows for better genotype-phenotype correlations. It will be interesting to use SMRT sequencing in the near future for other tandem repeats with interruptions like Myotonic Dystrophy (46) and Friedreich's Ataxia (47) to increase our knowledge on tandem repeat configuration, its influence on stability of the repeat, and phenotype of an individual. Where all of the above applications use PCR, novel amplification free enrichment methods are currently being developed. Methods using amplification are very error-prone, especially when amplifying (tandem) repeats (41), and remove all epigenetic marks (48). Thus using amplification impedes a complete genetic and epigenetic characterization of tandem repeats. Currently two methods are under development. The first method presented by Pham et al. (48) is based on type IIS restriction enzyme digestions, customized hairpin adapters especially designed to anneal at the targeted digest overhangs, and a ‘capture-hook’ method. A second and more recent method (bioRxivhttps://doi.org/10.1101/203919) is based on restriction enzyme digestion followed by cleavage of SMRT bells containing the target of interest using the CRISPR/Cas9 system. By ligating a specific capture adapter at the CRISPR/Cas9 DNA cleavage sites, the SMRT bell molecules of interest can then be selectively pulled down by magnetic beads targeting the capture adapter. The high throughput of SMRT sequencing enables different targets (e.g. FMR1 CGG repeat, C9ORF72 GGGGCC repeat, HTT CAG repeat, Sca10 ATTCT repeat, etc.) from one DNA sample to be simultaneously enriched and sequenced in a single run (bioRxivhttps://doi.org/10.1101/203919). Both methods have been used to target the FMR1 CGG repeat and showed for the first time the true biological CGG repeat variation in human cell lines (48) (bioRxivhttps://doi.org/10.1101/203919). Besides avoiding amplification biases, these methods permit native DNA capture and hence direct detection of epigenetics. In the future, this technique can possibly be used diagnostically to screen for full mutations and assess the methylation status of the FMR1 CGG repeat, both of which influence the phenotype of FXS (49–51). Traditionally this would be determined by Southern blots, a labour intensive and inaccurate method. Thus replacing Southern Blots with faster and more direct SMRT sequencing will greatly enhance FMR1 and additional repeat disorder diagnostics (49–52). PacBio's enrichment technique has also been used to study patients with expanded Sca10 ATTCT repeats (53). Here, SMRT sequencing revealed a complete absence of interruptions which could be linked to the parkinsonism phenotype of the patient. Polymorphic regions Genotyping the human leukocyte antigen (HLA) region, or the human major histocompatibility complex (MHC), is crucial for diagnosing autoimmune disorders and selection of donors in organ and stem cell transplantation. Genes in the region can be highly polymorphic, HLA-B being the most variable with >2000 alleles already annotated in 2012 (54). The high variability in sequence make this region exceptionally difficult to map with short reads (54). HLA can be divided into three molecule classes and regions, termed class I, II and III, though the first two are primarily studied. Amplicons of ∼400–900 bp have been used with 454 sequencing to target specific exons of class I genes (55,56). However, considering these genes are ∼3kb in length, entire alleles, as opposed to exons, can be sequenced in a single PacBio read. Class II genes can exceed 10kb making them more difficult, but still possible. Full length class I HLA alleles have been targeted in humans with hybrid PacBio-Illumina approaches (57) and PacBio only approaches (58,59). Many large HLA typing labs, such as the Anthony Nolan Research Institute (58,59), are utilizing or developing SMRT sequencing pipelines of their own or using commercial kits, such as those offered by GenDx (Utrecht, The Netherlands), to now target class I, as wells as many class II genes. This is rapidly expanding the number of known HLA alleles (57) and is becoming a gold standard for organ transplant genotyping and blood stem cell transplantation. Similarly complex regions can also be analyzed with these approaches. The killer cell immunoglobulin-like receptor (KIR) region, whose genes encode proteins with domains that recognize HLA proteins, was recently analyzed with SMRT sequencing and for the first time multiple haplotypes were phased without imputation (60). Pseudogene discrimination The high sequence similarity between pseudogenes and their homologous functional genes makes distinguishing variation between the two extremely difficult when using short read technologies. In general, long reads spanning the actual gene regions can be used to anchor to unique regions and/or phase variants to discriminate between the pseudogene and the actual gene. For diagnostics it is common to target a specific locus or set of loci of interest as a cost effective way to overcome the limited throughput of current generation SMRT sequencing platforms. The easiest option to enrich for specific loci is amplifying the targets by doing a (multiplex) long-range PCR (up to 10 kb). To differentiate samples, barcodes can be added directly during PCR via primers (61,62), by a nested PCR approach (57,61,63,64), or by ligating hairpin adapters containing barcodes during library preparation (Pacific Biosciences Product Note: www.pacb.com/wp-content/uploads/2015/09/ProductNote-Barcoded-Adapters-Barcoded-Universal-Primers.pdf). Therefore, for multiplexed long-amplicon tests only a single library preparation is needed after pooling the barcoded amplicons, as opposed to fragmentation and multiple barcoded library preparations for short-read platforms. This therefore enables fast, cheap library preparations that can be sequenced in just a few hours, permitting the next step in complex gene loci diagnoses. One application is using barcoded 6–8 kb amplicons, and potential nested amplicons, to target the drug metabolism gene CYP2D6 (61,63). This gene has homologous pseudogenes and copy number variants which impair reliable genotyping with short-read platforms (61,63). After SMRT sequencing, reads can then be aligned and variants called using alignment based or ‘Long Amplicon Analysis’ (LAA, included in SMRT analysis) based pipelines. LAA is particularly powerful in that it enables reference free analyses and phasing of the two alleles (61). The pipeline first demultiplexes reads (if needed), then looks for overlap, performs clustering (i.e. determines different amplicons), phases the clustered reads (i.e. determines different alleles), and determines consensus sequences with Quiver (34). LAA may require optimization, such as the minimal number of reads used for clustering. Too many can result in false alleles and long run times, whereas too little may result in allelic dropouts. Once assembled, alleles can be compared to each other or to a reference genome for annotation. Overall, SMRT sequencing permits expanding from targeting specific CYP2D6 variants/exons, to identification of phased variants across the entire loci, including up/downstream and all introns, that will enhance identification of metabolizer phenotypes in tested individuals and enhance personalized medicine (61). Similar long-range PCR with PacBio applications have been used to genotype and discriminate other genes from pseudogenes (Table 2), including PKD1 for diagnosing autosomal-dominant polycystic kidney disease (64) and IKBKG for diagnosing primary immunodeficiency diseases in patients suffering from life-threatening invasive pyogenic bacterial infections (65). Applications of human SMRT sequencing and clinical utility Table 2. Applications of human SMRT sequencing and clinical utility Target Disease Ref. Tandem repeat sequencing FMR1 Fragile X Syndrome (43)a HTT Huntington's Disease a C9orf72 Amyotrophic Lateral Sclerosis (ALS) a SCA10 Spinocerebellar ataxia type 10, Parkinson's disease (10,53)a Highly polymorphic regions HLA Autoimmune disorders & transplantation (57–59) KIR Autoimmune diseases & transplantation (60) Pseudogene discrimination CYP2D6 Drug metabolism (61,63) PKD1 Autosomal-dominant polycystic kidney disease (64) IKBKG Primary immunodeficiency diseases (65) Cancer BCR-ABL1 Chronic Myeloid Leukemia (CML) (69) TP53 Myelodysplastic Syndromes (MDS) and Acute Myeloblastic Leukemia (AML) (70) Reproductive genomics TCOF1 Treacher Collins syndrome (67) PTPN11 Noonan syndrome (67) Target Disease Ref. Tandem repeat sequencing FMR1 Fragile X Syndrome (43)a HTT Huntington's Disease a C9orf72 Amyotrophic Lateral Sclerosis (ALS) a SCA10 Spinocerebellar ataxia type 10, Parkinson's disease (10,53)a Highly polymorphic regions HLA Autoimmune disorders & transplantation (57–59) KIR Autoimmune diseases & transplantation (60) Pseudogene discrimination CYP2D6 Drug metabolism (61,63) PKD1 Autosomal-dominant polycystic kidney disease (64) IKBKG Primary immunodeficiency diseases (65) Cancer BCR-ABL1 Chronic Myeloid Leukemia (CML) (69) TP53 Myelodysplastic Syndromes (MDS) and Acute Myeloblastic Leukemia (AML) (70) Reproductive genomics TCOF1 Treacher Collins syndrome (67) PTPN11 Noonan syndrome (67) abioRxivhttps://doi.org/10.1101/203919. Open in new tab Table 2. Applications of human SMRT sequencing and clinical utility Target Disease Ref. Tandem repeat sequencing FMR1 Fragile X Syndrome (43)a HTT Huntington's Disease a C9orf72 Amyotrophic Lateral Sclerosis (ALS) a SCA10 Spinocerebellar ataxia type 10, Parkinson's disease (10,53)a Highly polymorphic regions HLA Autoimmune disorders & transplantation (57–59) KIR Autoimmune diseases & transplantation (60) Pseudogene discrimination CYP2D6 Drug metabolism (61,63) PKD1 Autosomal-dominant polycystic kidney disease (64) IKBKG Primary immunodeficiency diseases (65) Cancer BCR-ABL1 Chronic Myeloid Leukemia (CML) (69) TP53 Myelodysplastic Syndromes (MDS) and Acute Myeloblastic Leukemia (AML) (70) Reproductive genomics TCOF1 Treacher Collins syndrome (67) PTPN11 Noonan syndrome (67) Target Disease Ref. Tandem repeat sequencing FMR1 Fragile X Syndrome (43)a HTT Huntington's Disease a C9orf72 Amyotrophic Lateral Sclerosis (ALS) a SCA10 Spinocerebellar ataxia type 10, Parkinson's disease (10,53)a Highly polymorphic regions HLA Autoimmune disorders & transplantation (57–59) KIR Autoimmune diseases & transplantation (60) Pseudogene discrimination CYP2D6 Drug metabolism (61,63) PKD1 Autosomal-dominant polycystic kidney disease (64) IKBKG Primary immunodeficiency diseases (65) Cancer BCR-ABL1 Chronic Myeloid Leukemia (CML) (69) TP53 Myelodysplastic Syndromes (MDS) and Acute Myeloblastic Leukemia (AML) (70) Reproductive genomics TCOF1 Treacher Collins syndrome (67) PTPN11 Noonan syndrome (67) abioRxivhttps://doi.org/10.1101/203919. Open in new tab REPRODUCTIVE GENOMICS Reproductive genomic medicine and associated counseling, including pre-implantation genetic diagnosis (PGD), relies heavily on the ability to haplotype or phase alleles in embryos, patients, and parents. Long reads enable direct phasing of amplicons from targeted loci which can be used to determine parent-of-origin alleles in embryos or patients (66,67). In a family having one child with Treacher Collins syndrome, SMRT amplicons sequencing was used to confirm the paternal transmission of a TCOF1 variant that affects splicing of the gene and potentially causes the disease (67). For apparent de novo mutations that are a result of germ line mosaicism, determining the frequency of damaging alleles is informative in predicting recurrence in future offspring. For a couple with multiple miscarriages and suspected Noonan syndrome in the fetuses, SMRT amplicon sequencing identified a disease causing PTPN11 variant in 37% of the father's sperm (67). Digital Droplet PCR showed no signs of the variant in the father's blood, but confirmed the 40% frequency in the fathers sperm (67). This therefore enabled an estimate of recurrent risk for subsequent pregnancies. Whole-genome single-cell haplotyping based on arrays is already being used in practice for embryo selection before implantation, though phasing still requires additional family members (68). We envision a profound impact on future PGD applications by incorporating long-read whole-genome sequencing for direct phasing to eliminate the need for analyzing additional family members. CANCER During treatment of cancer patients, it is crucial to monitor low frequency mutations that can lead to a proliferative advantage of malignant cells. Chronic myeloid leukemia (CML) is a blood cancer that is caused by a translocation between chromosomes 9 and 22, giving rise to the BCR-ABL1 fusion protein. CML patients are normally treated with tyrosine kinase inhibitors (TKIs) to suppress BCR-ABL1, but the therapy can induce point mutations leading to drug resistance. It is therefore important to screen the BCR-ABL1 gene in CML patients responding poorly to TKI treatment and study the mutational landscape. In a study by Cavelier et al. (69), a ∼1.5 kb amplicon was constructed from BCR-ABL1 cDNA. SMRT sequencing allowed for detection of TKI resistance mutations down to a level of 1%, a significantly lower detection threshold as compared to the 15–20% reached by Sanger sequencing. Moreover, it was possible to phase co-existing mutations thereby giving new information about the clonal distribution of resistance mutations in BCR-ABL1, and also to identify a number of distinct splice isoforms. Apart from BCR-ABL1, a number of other cancer genes are suitable targets for clinical SMRT sequencing (Table 2). In a study of loss-of-function mutations in the tumor suppressor TP53, SMRT sequencing revealed that tumors from acute myeloblastic leukemia (AML) and myelodysplatic syndrome (MDS) patients harbor multiple TP53 mutations distributed in different alleles (70). In the future, detailed information about the subclonal heterogeneity of TP53 could be used to guide the treatment of these patients. Minor variants can also be detected in other types of somatic variation, unrelated to cancer. Gudmunsson et al. (71) used SMRT sequencing to obtain phasing information of somatic mosaicism mutations in GJB2 that led to the repair of skin lesions in a patient with keratitis-ichthyosis-deafness syndrome. Whole genome and transcriptome sequencing (addressed in later sections) is at the moment only affordable for research, but in the near future will become a diagnostic option. Already whole genome and transcriptome SMRT sequencing has been applied to breast cancer cell models identifying novel gene fusion events with the known oncogene Her2 (Case Study: www.pacb.com/wp-content/uploads/Case-Study-Scientists-deconstruct-cancer-complexity-through-genome-and-transcriptome-analysis.pdf). Whole transcriptome sequencing of prostate cell models has also identified novel RLN1 and RLN2 gene fusions in prostate cancer (72). Importantly, SMRT sequencing can give a more precise view of the cancer gene structure, as was demonstrated in a study by Kohli et al. where a cryptic exon was detected in AR-V9 that was previously thought to be present only in AR-V7 (73). AR-V7 has been studied as a potential biomarker for drug resistance in prostate cancer, based on knockdown experiments that have in fact targeted both isoforms. Thus, AR-V9 may actually be a predictive biomarker for resistance. Global changes in epigenetics is also a hallmark in cancer. Single molecule real-time bisulfite sequencing (SMRT-BS) enables quantitative and highly multiplexed detection of methylation in 1.5–2 kb amplicons (74,75). This is an improvement of the previous technologies that could only target typical bisulfite PCR sizes (∼300–500 bp) and potentially enables ∼91% of CpG islands in the human genome to be evaluated (75). To date this has been applied to multiple cancer cell lines, including those from an acute myeloid leukemia, chronic myeloid leukemia, anaplastic large cell lymphoma, plasma cell leukemia, Burkitt lymphoma, B-cell lymphoma and multiple myelomas (75). Expanding to genome wide diagnostics, when whole genome SMRT sequencing is performed on non-amplified material it is theoretically possible to determine epigenetic status across all nucleotides based on IPD ratios. Therefore, we envision in the near future cancer genomes, transcriptomes and epigenomes will commonly be characterized at previously unparalleled resolution. VIRAL AND MICROBIAL MEDICAL SEQUENCING In infectious disease, SMRT sequencing has been used to analyse influenza viruses (76), hepatitis B viruses (HBV) (77), hepatitis C viruses (HCV) (77,78) and human immunodeficiency viruses (HIV) (79,80) (Table 3). HCV and HIV are RNA molecules of a length of approximately 9 kb, while HBV is a circular DNA virus of size 3 kb. These viruses are suitable subjects for SMRT sequencing, since the entire virus genome can easily be contained in a single read. For example, Bull et al. (77) developed an assay where the resulting reads covered nearly the entire sequence for all six major HCV genotypes. In addition to determining the genome sequence of the infecting viruses, it is also possible to monitor mutations that are developing as a result of drug treatment. For HCV, resistance associated variants (RAVs) in the NS5A gene occurring at a frequency of <0.5% were successfully identified in samples from patients undergoing treatment by direct acting antiviral drugs (DAAs) (78). By full-length sequencing of the HIV-1 provirus, a 9700 bp molecule that encodes nine major proteins via alternative splicing, Ocwieja et al. (80) detected at least 109 different spliced RNAs, including two of which encode new proteins. The fact that this relatively small study could generate a lot of novel information about HIV-1, a molecule that has already been studied in great detail, demonstrates the advantage of full-length RNA sequencing to study the distribution of splicing isoforms in specific genes. Results from these types of experiments could possibly open up novel therapeutic opportunities in infectious disease. Medically relevant microbial SMRT sequencing Table 3. Medically relevant microbial SMRT sequencing Target/disease Ref. Hepatitis B/C virus (77,78) HIV (79,80) Influenza viruses (76) Tuberculosis bacteria (85) E. coli / Hemolytic–Uremic Syndrome (86) Salmonella enterica subsp. enterica serovar/gastroenteritis (87) Leishmania (88) Leptospira interrogans/leptospirosis (90) Helicobacter pylori strains/gastrointestinal diseases (91) Target/disease Ref. Hepatitis B/C virus (77,78) HIV (79,80) Influenza viruses (76) Tuberculosis bacteria (85) E. coli / Hemolytic–Uremic Syndrome (86) Salmonella enterica subsp. enterica serovar/gastroenteritis (87) Leishmania (88) Leptospira interrogans/leptospirosis (90) Helicobacter pylori strains/gastrointestinal diseases (91) Open in new tab Table 3. Medically relevant microbial SMRT sequencing Target/disease Ref. Hepatitis B/C virus (77,78) HIV (79,80) Influenza viruses (76) Tuberculosis bacteria (85) E. coli / Hemolytic–Uremic Syndrome (86) Salmonella enterica subsp. enterica serovar/gastroenteritis (87) Leishmania (88) Leptospira interrogans/leptospirosis (90) Helicobacter pylori strains/gastrointestinal diseases (91) Target/disease Ref. Hepatitis B/C virus (77,78) HIV (79,80) Influenza viruses (76) Tuberculosis bacteria (85) E. coli / Hemolytic–Uremic Syndrome (86) Salmonella enterica subsp. enterica serovar/gastroenteritis (87) Leishmania (88) Leptospira interrogans/leptospirosis (90) Helicobacter pylori strains/gastrointestinal diseases (91) Open in new tab For bacteria, a single SMRT Cell often provides enough data to de novo assemble Escherichia coli size genomes into single contigs. HGAP is the most widely used assembler and works by taking a selection of longest reads and error correcting them with all reads, followed by Celera assembly (81,82), and finalized by polishing with all reads aligned to the final assembly (34). These long reads and new algorithms enable PacBio assemblies to be more complete and accurate compared to second-generation sequencing methods (83,84). Clinically relevant bacterial assemblies include a strain of the Tuberculosis bacteria Mycobacterium tuberculosis (85), the E. coli strain that caused a Hemolytic–Uremic Syndrome outbreak in Germany in 2011 (86), and strains of Salmonella enterica subsp. enterica serovar that cause gastroenteritis in humans (87) (Table 3). Pacbio sequencing and HGAP have also been used to assemble pathogenic single-cell eukaryote genomes that are more complex than a single chromosome, such as for a new Leishmania reference genome (88), a protozoan parasite that kills >30 000 people each year. Though long reads permit superb microbial assemblies, what truly differentiates SMRT sequencing from second-generation machines is the ability to directly determine the epigenetics of these organisms. DNA methylation is overall ubiquitous in bacterial genomes (89), which simplifies SMRT analysis of epigenetic characteristics in these organisms. Analyses can be performed using IPD ratios of cases versus controls or vs an in silico control compared to known epigenetics signatures for 6-mA, 4-mC and (Tet converted) 5-mC (available in SMRT analysis). This has been used to discriminate virulent from avirulent Leptospira interrogans, a cause of leptospirosis in humans (90). The genome sequences have no major differences between strains, but higher levels of methylation are found in the avirulent strain (90). Methylation analysis has also been used to identify virulence factor genotype-dependent motifs in eight different H. pylori strains, a bacteria that can lead to gastrointestinal diseases (91). The simplicity to sequence, assemble, and call nucleotide, structural and epigenetic variation for a complete genome from a single SMRT Cell makes SMRT sequencing a truly revolutionizing technology in microbiology. FUTURE: WHOLE TRANSCRIPTOME AND GENOME SEQUENCING Traditionally RNA is converted to cDNA and then fragmented for short read sequencing (RNA-seq). Assembling the host of exons detected from RNA-seq into individual transcripts is extremely difficult and error prone. SMRT sequencing eliminates the need for fragmentation, instead sequencing cDNAs from the 5′ end of transcripts to the poly-A tail, termed Iso-Seq. This is an ideal method for complete cDNA sequencing (92). Iso-Seq has been used to sequence full transcriptomes from the blood of a normal Chinese adult male (93), a pool of 20 RNAs from different normal human tissues and organs (92), a trio of lymphoblastoid transcriptomes (94), and analyse prostate and breast cancer cell models (73) (Case Study: www.pacb.com/wp-content/uploads/Case-Study-Scientists-deconstruct-cancer-complexity-through-genome-and-transcriptome-analysis.pdf). As opposed to complex short-read alignment and re-assemblies, these papers demonstrate long-reads can easily detect splicing isoforms in human genes. Besides detecting a vast number of known isoforms, this method has also identified novel splicing forms and genes that have not previously been detected by short-read sequencing (93). Similar to genomic variant phasing, for gene loci with transcribed single nucleotide variants, these can be used to determine precisely which allele isoforms are expressed from (94). Though Iso-Seq is exceptional for transcript structure determination, the lower throughput when compared to second-generation platforms currently limits its usage for expression analysis. However, as costs drop and throughput increases, unbiased PacBio expression and isoform detection will become routine in the near future. Whole genome sequencing (WGS) has become a widely used method to study variation in the human genome, and several 100’s of thousands of human genomes have been sequenced with short-reads during the last few years. However, the nature of these reads permit only relatively small assemblies and alignments provide only limited information on variation outside of SNPs and small insertions/deletions. SMRT sequencing is greatly expanding the utility of WGS, permitting a factor greater in assembly completeness (93,95) (BioRxiv: https://doi.org/10.1101/067447), even nearing reference genome contig sizes and including diploid aware assemblies by applying algorithms like FALCON-unzip (37). These PacBio WGS’s also demonstrate a vast repertoire of variation missed by short read WGSs. Low coverage (4–8×) sequencing recently was used to characterize structural variation in chromothrypsis-like chromosomes (96) and identify a pathogenic heterozygous 2184 bp deletion in a patient who presented with Carney complex that could not be identified by short-read sequencing (97). Higher coverage sequencing (∼60×) of two haploid genomes has also been used to identify a vast array of structural variations (461 553 from 2 bp to 28 kb in length), including >89% being missed in the analysis of data from the 1000 Genomes Project (98). From this study, Huddleston et al. (98) estimate a 5× increase in discovering indels >7 bp and additional SVs <1 kb which in total bps represents a majority of the difference between genomes. Additional remarkable findings from individual human de novo assemblies is that there seems to exist several megabases of novel sequence, i.e. sequences that are absent from the current (GRCh38) version of the human reference. For example, Shi et al. (93) reported 12.8 Mb of novel sequence in their de novo assembled individual genome, which would correspond to over 0.4% of the entire human genome of size ∼3 Gb. At this point, it is not known whether this novel sequence is common between all human individuals (and thereby missing from GRCh38) or if it mainly represents sequence variation found only in some specific individuals or population groups. Overall, these WGS studies demonstrate long-read sequencing can identify a substantial number of variation missed by short read platforms, including those relevant to clinical diagnoses. CONCLUSIONS The myth that SMRT sequencing is too error prone to be diagnostically useful is being expunged and replaced by evidence that it offers advantages over short-read sequencers. SMRT sequencing is opening up new diagnostic avenues, such as the ability to determine tandem repeat lengths, interruptions, and even epigenetics in a single test at base pair resolution. Long read sequencing is already considered the gold standard for some applications, such as for HLA genotyping for tissue transplants. While large scale implementation appears to be hampered by the cost and community expertise, this is likely to change rapidly. In addition to systematic price reductions and a growing customer base, new single molecule technologies such as nanopore based systems are likely to propel the field. Just as second-generation platforms stepped beyond Sanger sequencing and enabled a revolution in genomics medicine, third-generation single molecule sequencing platforms will likely be the next genetic diagnostic revolution. ACKNOWLEDGEMENTS We wish to thank Vicky Van Sandt (Belgian Red Cross-Flanders) for constructive comments on HLA typing. FUNDING KU Leuven [SymBioSys PFV/10/016, GOA/12/015 to J.R.V.]; Hercules foundation [ZW11–14]; Agency for Innovation by Science and Technology (IWT) (PhD grant) [SB/131787]. Funding for open access charge: KU Leuven [SymBioSys PFV/10/016, GOA/12/015 to J.R.V.]; Hercules foundation [ZW11–14]; Agency for Innovation by Science and Technology (IWT) (PhD grant) [SB/131787]. Conflict of interest statement. None declared. REFERENCES 1. Katsanis S.H. , Katsanis N. Molecular genetic testing and the future of clinical genomics . Nat. Rev. Genet. 2013 ; 14 : 415 – 426 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Vermeesch J.R. , Voet T. , Devriendt K. Prenatal and pre-implantation genetic diagnosis . Nat. Rev. Genet. 2016 ; 17 : 643 – 656 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Heather J.M. , Chain B. The sequence of sequencers: The history of sequencing DNA . Genomics . 2016 ; 107 : 1 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Sanger F. , Nicklen S. , Coulson R. DNA sequencing with chain-terminating inhibitors . Proc. Natl. Acad. Sci. U.S.A. 1977 ; 74 : 5463 – 5467 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Krier J.B. , Kalia S.S. , Green R.C. Genomic sequencing in clinical practice: applications, challenges, and opportunities . Dialogues Clin. Neurosci. 2016 ; 18 : 299 – 312 . Google Scholar PubMed WorldCat 6. Levy S.E. , Myers R.M. Advancements in next-generation sequencing . Annu. Rev. Genomics Hum. Genet. 2016 ; 17 : 95 – 115 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Koboldt D.C. , Steinberg K.M. , Larson D.E. , Wilson R.K. , Mardis E.R. The next-generation sequencing revolution and its impact on genomics . Cell . 2013 ; 155 : 27 – 38 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data . Bioinformatics . 2011 ; 27 : 2987 – 2993 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Browning S.R. , Browning B.L. Haplotype phasing: existing methods and new developments . Nat. Rev. Genet. 2011 ; 12 : 703 – 714 . Google Scholar Crossref Search ADS PubMed WorldCat 10. McFarland K.N. , Liu J. , Landrian I. , Godiska R. , Shanker S. , Yu F. , Farmerie W.G. , Ashizawa T. SMRT Sequencing of Long Tandem Nucleotide Repeats in SCA10 Reveals Unique Insight of Repeat Expansion Structure . PLoS One . 2015 ; 10 : e0135906 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Schatz M.C. , Delcher A.L. , Salzberg S.L. Assembly of large genomes using second-generation sequencing . Genome Res. 2010 ; 20 : 1165 – 1173 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Alkan C. , Sajjadian S. , Eichler E.E. Limitations of next-generation genome sequence assembly . Nat. Methods . 2011 ; 8 : 61 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Guan P. , Sung W.K. Structural variation detection using next-generation sequencing data: a comparative technical review . Methods . 2016 ; 102 : 36 – 49 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Harris T.D. , Buzby P.R. , Babcock H. , Beer E. , Bowers J. , Braslavsky I. , Causey M. , Colonell J. , Dimeo J. , Efcavitch J.W. et al. Single-molecule DNA sequencing of a viral genome . Science . 2008 ; 320 : 106 – 109 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Eid J. , Fehr A. , Gray J. , Luong K. , Lyle J. , Otto G. , Peluso P. , Rank D. , Baybayan P. , Bettman B. et al. Real-time DNA sequencing from single polymerase molecules . Science . 2009 ; 323 : 133 – 138 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Clarke J. , Wu H.C. , Jayasinghe L. , Patel A. , Reid S. , Bayley H. Continuous base identification for single-molecule nanopore DNA sequencing . Nat. Nanotechnol. 2009 ; 4 : 265 – 270 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Flusberg B.A. , Webster D.R. , Lee J.H. , Travers K.J. , Olivares E.C. , Clark T.A. , Korlach J. , Turner S.W. Direct detection of DNA methylation during single-molecule, real-time sequencing . Nat. Methods . 2010 ; 7 : 461 – 465 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Jain M. , Olsen H.E. , Paten B. , Akeson M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community . Genome Biol. 2016 ; 17 : 239 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Deamer D. , Akeson M. , Branton D. Three decades of nanopore sequencing . Nat. Biotechnol. 2016 ; 34 : 518 – 524 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Lu H. , Giordano F. , Ning Z. Oxford Nanopore MinION Sequencing and Genome Assembly . Genomics Proteomics Bioinformatics . 2016 ; 14 : 265 – 279 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Travers K.J. , Chin C.S. , Rank D.R. , Eid J.S. , Turner S.W. A flexible and efficient template format for circular consensus sequencing and SNP detection . Nucleic Acids Res. 2010 ; 38 : e159 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Rhoads A. , Au K.F. PacBio sequencing and its applications . Genomics Proteomics Bioinformatics . 2015 ; 13 : 278 – 289 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Eid J. , Fehr A. , Gray J. , Luong K. , Lyle J. , Otto G. , Peluso P. , Rank D. , Baybayan P. , Bettman B. et al. Real-time DNA sequencing from single polymerase molecules . Science . 2009 ; 323 : 133 – 138 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Schadt E.E. , Banerjee O. , Fang G. , Feng Z. , Wong W.H. , Zhang X. , Kislyuk A. , Clark T.A. , Luong K. , Keren-Paz A. et al. Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases . Genome Res. 2013 ; 23 : 129 – 141 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Chaisson M.J. , Wilson R.K. , Eichler E.E. Genetic variation and the de novo assembly of human genomes . Nat. Rev. Genet. 2015 ; 16 : 627 – 640 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Carneiro M.O. , Russ C. , Ross M.G. , Gabriel S.B. , Nusbaum C. , Depristo M.A. Pacific biosciences sequencing technology for genotyping and variation discovery in human data . BMC Genomics . 2012 ; 13 : 375 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Larkin J. , Henley R.Y. , Jadhav V. , Korlach J. , Wanunu M. Length-independent DNA packing into nanopore zero-mode waveguides for low-input DNA sequencing . Nat. Nanotechnol. 2017 ; 12 : 1169 – 1175 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Li H. , Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform . Bioinformatics . 2010 ; 26 : 589 – 595 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Chaisson M.J. , Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory . BMC Bioinformatics . 2012 ; 13 : 238 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Krizanovic K. , Echchiki A. , Roux J. , Sikic M. Evaluation of tools for long read RNA-seq splice-aware alignment . Bioinformatics . 2017 ; doi:10.1093/bioinformatics/btx668 . WorldCat 31. Wu T.D. , Reeder J. , Lawrence M. , Becker G. , Brauer M.J. GMAP and GSNAP for genomic sequence alignment: enhancements to speed, accuracy, and functionality . Methods Mol. Biol. 2016 ; 1418 : 283 – 334 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Liu B. , Guan D. , Teng M. , Wang Y. rHAT: fast alignment of noisy long reads with regional hashing . Bioinformatics . 2016 ; 32 : 1625 – 1631 . Google Scholar Crossref Search ADS PubMed WorldCat 33. Koren S. , Schatz M.C. , Walenz B.P. , Martin J. , Howard J.T. , Ganapathy G. , Wang Z. , Rasko D.A. , McCombie W.R. , Jarvis E.D. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads . Nat. Biotechnol. 2012 ; 30 : 693 – 700 . Google Scholar Crossref Search ADS PubMed WorldCat 34. Chin C.S. , Alexander D.H. , Marks P. , Klammer A.A. , Drake J. , Heiner C. , Clum A. , Copeland A. , Huddleston J. , Eichler E.E. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data . Nat. Methods . 2013 ; 10 : 563 – 569 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Vaser R. , Sovic I. , Nagarajan N. , Sikic M. Fast and accurate de novo genome assembly from long uncorrected reads . Genome Res. 2017 ; 27 : 737 – 746 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Kamath G.M. , Shomorony I. , Xia F. , Courtade T.A. , Tse D.N. HINGE: long-read assembly achieves optimal repeat resolution . Genome Res. 2017 ; 27 : 747 – 756 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Chin C.S. , Peluso P. , Sedlazeck F.J. , Nattestad M. , Concepcion G.T. , Clum A. , Dunn C. , O’Malley R. , Figueroa-Balderas R. , Morales-Cruz A. et al. Phased diploid genome assembly with single-molecule real-time sequencing . Nat. Methods . 2016 ; 13 : 1050 – 1054 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Xiao C.L. , Chen Y. , Xie S.Q. , Chen K.N. , Wang Y. , Han Y. , Luo F. , Xie Z. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads . Nat. Methods . 2017 ; 14 : 1072 – 1074 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Koren S. , Walenz B.P. , Berlin K. , Miller J.R. , Bergman N.H. , Phillippy A.M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation . Genome Res. 2017 ; 27 : 722 – 736 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Schmidt M.H. , Pearson C.E. Disease-associated repeat instability and mismatch repair . DNA Repair (Amst.) . 2016 ; 38 : 117 – 126 . Google Scholar Crossref Search ADS PubMed WorldCat 41. Loomis E.W. , Eid J.S. , Peluso P. , Yin J. , Hickey L. , Rank D. , McCalmon S. , Hagerman R.J. , Tassone F. , Hagerman P.J. Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene . Genome Res. 2013 ; 23 : 121 – 128 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Yrigollen C.M. , Martorell L. , Durbin-Johnson B. , Naudo M. , Genoves J. , Murgia A. , Polli R. , Zhou L. , Barbouth D. , Rupchock A. et al. AGG interruptions and maternal age affect FMR1 CGG repeat allele stability during transmission . J. Neurodev. Disord. 2014 ; 6 : 24 . Google Scholar Crossref Search ADS PubMed WorldCat 43. Ardui S. , Race V. , Zablotskaya A. , Hestand M.S. , Van Esch H. , Devriendt K. , Matthijs G. , Vermeesch J.R. Detecting AGG interruptions in male and female FMR1 premutation carriers by single-molecule sequencing . Hum. Mutat. 2017 ; 38 : 324 – 331 . Google Scholar Crossref Search ADS PubMed WorldCat 44. Chen L. , Hadd A. , Sah S. , Filipovic-Sadic S. , Krosting J. , Sekinger E. , Pan R. , Hagerman P.J. , Stenzel T.T. , Tassone F. et al. An information-rich CGG repeat primed PCR that detects the full range of fragile X expanded alleles and minimizes the need for southern blot analysis . J. Mol. Diagn. 2010 ; 12 : 589 – 600 . Google Scholar Crossref Search ADS PubMed WorldCat 45. Hayward B.E. , Usdin K. Improved assays for AGG interruptions in fragile X premutation carriers . J. Mol. Diagn. 2017 ; 19 : 828 – 835 . Google Scholar Crossref Search ADS PubMed WorldCat 46. Musova Z. , Mazanec R. , Krepelova A. , Ehler E. , Vales J. , Jaklova R. , Prochazka T. , Koukal P. , Marikova T. , Kraus J. et al. Highly unstable sequence interruptions of the CTG repeat in the myotonic dystrophy gene . Am. J. Med. Genet. A . 2009 ; 149 : 1365 – 1374 . Google Scholar Crossref Search ADS WorldCat 47. Holloway T.P. , Rowley S.M. , Delatycki M.B. , Sarsero J.P. Detection of interruptions in the GAA trinucleotide repeat expansion in the FXN gene of Friedreich ataxia . Biotechniques . 2011 ; 50 : 182 – 186 . Google Scholar PubMed WorldCat 48. Pham T.T. , Yin J. , Eid J.S. , Adams E. , Lam R. , Turner S.W. , Loomis E.W. , Wang J.Y. , Hagerman P.J. , Hanes J.W. Single-locus enrichment without amplification for sequencing and direct detection of epigenetic modifications . Mol. Genet. Genomics . 2016 ; 291 : 1491 – 1504 . Google Scholar Crossref Search ADS PubMed WorldCat 49. Pretto D. , Yrigollen C.M. , Tang H.T. , Williamson J. , Espinal G. , Iwahashi C.K. , Durbin-Johnson B. , Hagerman R.J. , Hagerman P.J. , Tassone F. Clinical and molecular implications of mosaicism in FMR1 full mutations . Front. Genet. 2014 ; 5 : 318 . Google Scholar Crossref Search ADS PubMed WorldCat 50. Pretto D.I. , Eid J.S. , Yrigollen C.M. , Tang H.T. , Loomis E.W. , Raske C. , Durbin-Johnson B. , Hagerman P.J. , Tassone F. Differential increases of specific FMR1 mRNA isoforms in premutation carriers . J. Med. Genet. 2015 ; 52 : 42 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 51. Usdin K. , Hayward B.E. , Kumari D. , Lokanga R.A. , Sciascia N. , Zhao X.N. Repeat-mediated genetic and epigenetic changes at the FMR1 locus in the Fragile X-related disorders . Front. Genet. 2014 ; 5 : 226 . Google Scholar Crossref Search ADS PubMed WorldCat 52. Dion V. , Wilson J.H. Instability and chromatin structure of expanded trinucleotide repeats . Trends Genet. 2009 ; 25 : 288 – 297 . Google Scholar Crossref Search ADS PubMed WorldCat 53. Schule B. , McFarland K.N. , Lee K. , Tsai Y.C. , Nguyen K.D. , Sun C. , Liu M. , Byrne C. , Gopi R. , Huang N. et al. Parkinson's disease associated with pure ATXN10 repeat expansion . NPJ Parkinsons Dis. 2017 ; 3 : 27 . Google Scholar Crossref Search ADS PubMed WorldCat 54. Trowsdale J. , Knight J.C. Major histocompatibility complex genomics and human disease . Annu. Rev. Genomics Hum. Genet. 2013 ; 14 : 301 – 323 . Google Scholar Crossref Search ADS PubMed WorldCat 55. Gabriel C. , Danzer M. , Hackl C. , Kopal G. , Hufnagl P. , Hofer K. , Polin H. , Stabentheiner S. , Proll J. Rapid high-throughput human leukocyte antigen typing by massively parallel pyrosequencing for high-resolution allele identification . Hum. Immunol. 2009 ; 70 : 960 – 964 . Google Scholar Crossref Search ADS PubMed WorldCat 56. Erlich R.L. , Jia X. , Anderson S. , Banks E. , Gao X. , Carrington M. , Gupta N. , DePristo M.A. , Henn M.R. , Lennon N.J. et al. Next-generation sequencing for HLA typing of class I loci . BMC Genomics . 2011 ; 12 : 42 . Google Scholar Crossref Search ADS PubMed WorldCat 57. Albrecht V. , Zweiniger C. , Surendranath V. , Lang K. , Schofl G. , Dahl A. , Winkler S. , Lange V. , Bohme I. , Schmidt A.H. Dual redundant sequencing strategy: Full-length gene characterisation of 1056 novel and confirmatory HLA alleles . HLA . 2017 ; 90 : 79 – 87 . Google Scholar Crossref Search ADS PubMed WorldCat 58. Mayor N.P. , Robinson J. , McWhinnie A.J. , Ranade S. , Eng K. , Midwinter W. , Bultitude W.P. , Chin C.S. , Bowman B. , Marks P. et al. HLA typing for the next generation . PLoS One . 2015 ; 10 : e0127153 . Google Scholar Crossref Search ADS PubMed WorldCat 59. Turner T.R. , Hayhurst J.D. , Hayward D.R. , Bultitude W.P. , Barker D.J. , Robinson J. , Madrigal J.A. , Mayor N.P. , Marsh S.G.E. Single molecule real-time (SMRT(R)) DNA sequencing of HLA genes at ultra-high resolution from 126 International HLA and Immunogenetics Workshop cell lines . Hla . 2017 ; doi:10.1111/tan.13184 . WorldCat 60. Roe D. , Vierra-Green C. , Pyo C.W. , Eng K. , Hall R. , Kuang R. , Spellman S. , Ranade S. , Geraghty D.E. , Maiers M. Revealing complete complex KIR haplotypes phased by long-read sequencing technology . Genes Immun. 2017 ; 18 : 127 – 134 . Google Scholar Crossref Search ADS PubMed WorldCat 61. Buermans H.P. , Vossen R.H. , Anvar S.Y. , Allard W.G. , Guchelaar H.J. , White S.J. , den Dunnen J.T. , Swen J.J. , van der Straaten T. Flexible and scalable full-length CYP2D6 long amplicon PacBio sequencing . Hum. Mutat. 2017 ; 38 : 310 – 316 . Google Scholar Crossref Search ADS PubMed WorldCat 62. Hestand M.S. , Van Houdt J. , Cristofoli F. , Vermeesch J.R. Polymerase specific error rates and profiles identified by single molecule sequencing . Mutat. Res. 2016 ; 784–785 : 39 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 63. Qiao W. , Yang Y. , Sebra R. , Mendiratta G. , Gaedigk A. , Desnick R.J. , Scott S.A. Long-read single molecule real-time full gene sequencing of cytochrome P450-2D6 . Hum. Mutat. 2016 ; 37 : 315 – 323 . Google Scholar Crossref Search ADS PubMed WorldCat 64. Borras D.M. , Vossen R. , Liem M. , Buermans H.P.J. , Dauwerse H. , van Heusden D. , Gansevoort R.T. , den Dunnen J.T. , Janssen B. , Peters D.J.M. et al. Detecting PKD1 variants in polycystic kidney disease patients by single-molecule long-read sequencing . Hum. Mutat. 2017 ; 38 : 870 – 879 . Google Scholar Crossref Search ADS PubMed WorldCat 65. Frans G. , Meert W. , Van der Werff Ten Bosch J. , Meyts I. , Bossuyt X. , Vermeesch J.R. , Hestand M.S. Conventional and single-molecule targeted sequencing method for specific variant detection in IKBKG whilst bypassing the IKBKGP1 pseudogene . J. Mol. Diagn. 2017 ; doi:10.1016/j.jmoldx.2017.10.005 . WorldCat 66. Mensah M.A. , Hestand M.S. , Larmuseau M.H. , Isrie M. , Vanderheyden N. , Declercq M. , Souche E.L. , Van Houdt J. , Stoeva R. , Van Esch H. et al. Pseudoautosomal region 1 length polymorphism in the human population . PLoS Genet. 2014 ; 10 : e1004578 . Google Scholar Crossref Search ADS PubMed WorldCat 67. Wilbe M. , Gudmundsson S. , Johansson J. , Ameur A. , Stattin E.L. , Anneren G. , Malmgren H. , Frykholm C. , Bondeson M.L. A novel approach using long-read sequencing and ddPCR to investigate gonadal mosaicism and estimate recurrence risk in two families with developmental disorders . Prenat. Diagn. 2017 ; 37 : 1146 – 1154 . Google Scholar Crossref Search ADS PubMed WorldCat 68. Dimitriadou E. , Melotte C. , Debrock S. , Esteki M.Z. , Dierickx K. , Voet T. , Devriendt K. , de Ravel T. , Legius E. , Peeraer K. et al. Principles guiding embryo selection following genome-wide haplotyping of preimplantation embryos . Hum. Reprod. 2017 ; 32 : 687 – 697 . Google Scholar Crossref Search ADS PubMed WorldCat 69. Cavelier L. , Ameur A. , Haggqvist S. , Hoijer I. , Cahill N. , Olsson-Stromberg U. , Hermanson M. Clonal distribution of BCR-ABL1 mutations and splice isoforms by single-molecule long-read RNA sequencing . BMC Cancer . 2015 ; 15 : 45 . Google Scholar Crossref Search ADS PubMed WorldCat 70. Lode L. , Ameur A. , Coste T. , Menard A. , Richebourg S. , Gaillard J.B. , Le Bris Y. , Bene M.C. , Lavabre-Bertrand T. , Soussi T. Single-molecule DNA sequencing of acute myeloid leukemia and myelodysplastic syndromes with multiple TP53 alterations . Haematologica . 2017 ; 103 : e13 – e16 . Google Scholar Crossref Search ADS PubMed WorldCat 71. Gudmundsson S. , Wilbe M. , Ekvall S. , Ameur A. , Cahill N. , Alexandrov L.B. , Virtanen M. , Hellstrom Pigg M. , Vahlquist A. , Torma H. et al. Revertant mosaicism repairs skin lesions in a patient with keratitis-ichthyosis-deafness syndrome by second-site mutations in connexin 26 . Hum. Mol. Genet. 2017 ; 26 : 1070 – 1077 . Google Scholar Crossref Search ADS PubMed WorldCat 72. Tevz G. , McGrath S. , Demeter R. , Magrini V. , Jeet V. , Rockstroh A. , McPherson S. , Lai J. , Bartonicek N. , An J. et al. Identification of a novel fusion transcript between human relaxin-1 (RLN1) and human relaxin-2 (RLN2) in prostate cancer . Mol. Cell Endocrinol. 2016 ; 420 : 159 – 168 . Google Scholar Crossref Search ADS PubMed WorldCat 73. Kohli M. , Ho Y. , Hillman D.W. , Van Etten J.L. , Henzler C. , Yang R. , Sperger J.M. , Li Y. , Tseng E. , Hon T. et al. Androgen receptor variant AR-V9 Is coexpressed with AR-V7 in prostate cancer metastases and predicts abiraterone resistance . Clin. Cancer Res. 2017 ; 23 : 4704 – 4715 . Google Scholar Crossref Search ADS PubMed WorldCat 74. Yang Y. , Scott S.A. DNA methylation profiling using long-read single molecule real-time bisulfite sequencing (SMRT-BS) . Methods Mol. Biol. 2017 ; 1654 : 125 – 134 . Google Scholar Crossref Search ADS PubMed WorldCat 75. Yang Y. , Sebra R. , Pullman B.S. , Qiao W. , Peter I. , Desnick R.J. , Geyer C.R. , DeCoteau J.F. , Scott S.A. Quantitative and multiplexed DNA methylation analysis using long-read single-molecule real-time bisulfite sequencing (SMRT-BS) . BMC Genomics . 2015 ; 16 : 350 . Google Scholar Crossref Search ADS PubMed WorldCat 76. Nakano K. , Shiroma A. , Shimoji M. , Tamotsu H. , Ashimine N. , Ohki S. , Shinzato M. , Minami M. , Nakanishi T. , Teruya K. et al. Advantages of genome sequencing by long-read sequencer using SMRT technology in medical area . Hum. Cell . 2017 ; 30 : 149 – 161 . Google Scholar Crossref Search ADS PubMed WorldCat 77. Bull R.A. , Eltahla A.A. , Rodrigo C. , Koekkoek S.M. , Walker M. , Pirozyan M.R. , Betz-Stablein B. , Toepfer A. , Laird M. , Oh S. et al. A method for near full-length amplification and sequencing for six hepatitis C virus genotypes . BMC Genomics . 2016 ; 17 : 247 . Google Scholar Crossref Search ADS PubMed WorldCat 78. Bergfors A. , Leenheer D. , Bergqvist A. , Ameur A. , Lennerstrand J. Analysis of hepatitis C NS5A resistance associated polymorphisms using ultra deep single molecule real time (SMRT) sequencing . Antiviral Res. 2016 ; 126 : 81 – 89 . Google Scholar Crossref Search ADS PubMed WorldCat 79. Dilernia D.A. , Chien J.T. , Monaco D.C. , Brown M.P. , Ende Z. , Deymier M.J. , Yue L. , Paxinos E.E. , Allen S. , Tirado-Ramos A. et al. Multiplexed highly-accurate DNA sequencing of closely-related HIV-1 variants using continuous long reads from single molecule, real-time sequencing . Nucleic Acids Res. 2015 ; 43 : e129 . Google Scholar Crossref Search ADS PubMed WorldCat 80. Ocwieja K.E. , Sherrill-Mix S. , Mukherjee R. , Custers-Allen R. , David P. , Brown M. , Wang S. , Link D.R. , Olson J. , Travers K. et al. Dynamic regulation of HIV-1 mRNA populations analyzed by single-molecule enrichment and long-read sequencing . Nucleic Acids Res. 2012 ; 40 : 10345 – 10355 . Google Scholar Crossref Search ADS PubMed WorldCat 81. Myers E.W. , Sutton G.G. , Delcher A.L. , Dew I.M. , Fasulo D.P. , Flanigan M.J. , Kravitz S.A. , Mobarry C.M. , Reinert K.H. , Remington K.A. et al. A whole-genome assembly of Drosophila . Science . 2000 ; 287 : 2196 – 2204 . Google Scholar Crossref Search ADS PubMed WorldCat 82. Miller J.R. , Delcher A.L. , Koren S. , Venter E. , Walenz B.P. , Brownley A. , Johnson J. , Li K. , Mobarry C. , Sutton G. Aggressive assembly of pyrosequencing reads with mates . Bioinformatics . 2008 ; 24 : 2818 – 2824 . Google Scholar Crossref Search ADS PubMed WorldCat 83. Miyamoto M. , Motooka D. , Gotoh K. , Imai T. , Yoshitake K. , Goto N. , Iida T. , Yasunaga T. , Horii T. , Arakawa K. et al. Performance comparison of second- and third-generation sequencers using a bacterial genome with two chromosomes . BMC Genomics . 2014 ; 15 : 699 . Google Scholar Crossref Search ADS PubMed WorldCat 84. Powers J.G. , Weigman V.J. , Shu J. , Pufky J.M. , Cox D. , Hurban P. Efficient and accurate whole genome assembly and methylome profiling of E. coli . BMC Genomics . 2013 ; 14 : 675 . Google Scholar Crossref Search ADS PubMed WorldCat 85. Miyoshi-Akiyama T. , Satou K. , Kato M. , Shiroma A. , Matsumura K. , Tamotsu H. , Iwai H. , Teruya K. , Funatogawa K. , Hirano T. et al. Complete annotated genome sequence of Mycobacterium tuberculosis (Zopf) Lehmann and Neumann (ATCC35812) (Kurono) . Tuberculosis (Edinb.) . 2015 ; 95 : 37 – 39 . Google Scholar Crossref Search ADS PubMed WorldCat 86. Rasko D.A. , Webster D.R. , Sahl J.W. , Bashir A. , Boisen N. , Scheutz F. , Paxinos E.E. , Sebra R. , Chin C.S. , Iliopoulos D. et al. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany . N. Engl. J. Med. 2011 ; 365 : 709 – 717 . Google Scholar Crossref Search ADS PubMed WorldCat 87. Yao K. , Muruvanda T. , Roberts R.J. , Payne J. , Allard M.W. , Hoffmann M. Complete Genome and Methylome Sequences of Salmonella enterica subsp. enterica Serovar Panama (ATCC 7378) and Salmonella enterica subsp. enterica Serovar Sloterdijk (ATCC 15791 . Genome Announc. 2016 ; 4 : e00133-16 . Google Scholar Crossref Search ADS PubMed WorldCat 88. Dumetz F. , Imamura H. , Sanders M. , Seblova V. , Myskova J. , Pescher P. , Vanaerschot M. , Meehan C.J. , Cuypers B. , De Muylder G. et al. Modulation of Aneuploidy in Leishmania donovani during Adaptation to Different In Vitro and In Vivo Environments and Its Impact on Gene Expression . MBio . 2017 ; 8 : e00599-17 . Google Scholar Crossref Search ADS PubMed WorldCat 89. Blow M.J. , Clark T.A. , Daum C.G. , Deutschbauer A.M. , Fomenkov A. , Fries R. , Froula J. , Kang D.D. , Malmstrom R.R. , Morgan R.D. et al. The epigenomic landscape of prokaryotes . PLoS Genet. 2016 ; 12 : e1005854 . Google Scholar Crossref Search ADS PubMed WorldCat 90. Satou K. , Shimoji M. , Tamotsu H. , Juan A. , Ashimine N. , Shinzato M. , Toma C. , Nohara T. , Shiroma A. , Nakano K. et al. Complete genome sequences of low-passage virulent and high-passage avirulent variants of pathogenic Leptospira interrogans Serovar Manilae Strain UP-MMC-NIID, originally isolated from a patient with severe Leptospirosis, determined using PacBio single-molecule real-time technology . Genome Announc. 2015 ; 3 : e00882-15 . Google Scholar Crossref Search ADS PubMed WorldCat 91. Satou K. , Shiroma A. , Teruya K. , Shimoji M. , Nakano K. , Juan A. , Tamotsu H. , Terabayashi Y. , Aoyama M. , Teruya M. et al. Complete genome sequences of eight Helicobacter pylori strains with different virulence factor genotypes and methylation profiles, isolated from patients with diverse gastrointestinal diseases on Okinawa Island, Japan, determined using PacBio single-molecule real-time technology . Genome Announc. 2014 ; 2 : e00286-14 . Google Scholar Crossref Search ADS PubMed WorldCat 92. Sharon D. , Tilgner H. , Grubert F. , Snyder M. A single-molecule long-read survey of the human transcriptome . Nat. Biotechnol. 2013 ; 31 : 1009 – 1014 . Google Scholar Crossref Search ADS PubMed WorldCat 93. Shi L. , Guo Y. , Dong C. , Huddleston J. , Yang H. , Han X. , Fu A. , Li Q. , Li N. , Gong S. et al. Long-read sequencing and de novo assembly of a Chinese genome . Nat. Commun. 2016 ; 7 : 12065 . Google Scholar Crossref Search ADS PubMed WorldCat 94. Tilgner H. , Grubert F. , Sharon D. , Snyder M.P. Defining a personal, allele-specific, and single-molecule long-read transcriptome . Proc. Natl. Acad. Sci. U.S.A. 2014 ; 111 : 9869 – 9874 . Google Scholar Crossref Search ADS PubMed WorldCat 95. Seo J.S. , Rhie A. , Kim J. , Lee S. , Sohn M.H. , Kim C.U. , Hastie A. , Cao H. , Yun J.Y. , Kim J. et al. De novo assembly and phasing of a Korean human genome . Nature . 2016 ; 538 : 243 – 247 . Google Scholar Crossref Search ADS PubMed WorldCat 96. Masset H. , Hestand M.S. , Van Esch H. , Kleinfinger P. , Plaisancie J. , Afenjar A. , Molignier R. , Schluth-Bolard C. , Sanlaville D. , Vermeesch J.R. A distinct class of chromoanagenesis events characterized by focal copy number gains . Hum. Mutat. 2016 ; 37 : 661 – 668 . Google Scholar Crossref Search ADS PubMed WorldCat 97. Merker J.D. , Wenger A.M. , Sneddon T. , Grove M. , Zappala Z. , Fresard L. , Waggott D. , Utiramerur S. , Hou Y. , Smith K.S. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease . Genet Med. 2018 ; 20 : 159 – 163 . Google Scholar Crossref Search ADS PubMed WorldCat 98. Huddleston J. , Chaisson M.J.P. , Steinberg K.M. , Warren W. , Hoekzema K. , Gordon D. , Graves-Lindsay T.A. , Munson K.M. , Kronenberg Z.N. , Vives L. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data . Genome Res. 2017 ; 27 : 677 – 685 . Google Scholar Crossref Search ADS PubMed WorldCat Author notes Present address: Matthew S. Hestand, Division of Human Genetics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA. © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Splicing regulation by long noncoding RNAsRomero-Barrios,, Natali;Legascue, Maria, Florencia;Benhamed,, Moussa;Ariel,, Federico;Crespi,, Martin
doi: 10.1093/nar/gky095pmid: 29425321
Abstract Massive high-throughput sequencing techniques allowed the identification of thousands of noncoding RNAs (ncRNAs) and a plethora of different mRNA processing events occurring in higher organisms. Long ncRNAs can act directly as long transcripts or can be processed into active small si/miRNAs. They can modulate mRNA cleavage, translational repression or the epigenetic landscape of their target genes. Recently, certain long ncRNAs have been shown to play a crucial role in the regulation of alternative splicing in response to several stimuli or during disease. In this review, we focus on recent discoveries linking gene regulation by alternative splicing and its modulation by long and small ncRNAs. INTRODUCTION Eukaryotic genes are often discontinuous with noncoding (intronic sequences) and coding DNA (exonic sequences). Therefore, in order to produce a mature mRNA that can be translated into a protein, introns need to be removed and exons joined by a process called splicing. In certain circumstances, splicing reactions can be modulated generating two or more mRNA isoforms from a single pre-mRNA by a process called alternative splicing (AS). In recent years, improvements in our ability to sequence entire genomes and complete pools of transcripts allowed the identification of a plethora of different mRNA isoforms in higher organisms. More than 90% of intron-containing genes in humans and over 60% in plants, are alternatively spliced in controlled conditions (1–4). The significant diversity in the number of transcripts in comparison to the number of genes emphasizes the extensive regulation occurring at transcriptional and post-transcriptional levels (5). Many genes have been found to produce tissue or condition-dependent isoforms (6). In humans, numerous studies point out a link between RNA splicing misregulation and several diseases (7–10). In plants, under stress conditions, AS plays an important role in the control of gene expression for an adequate response (11–14). Interestingly, the identification of alternatively spliced transcripts revealed a high number of genes encoding multiple proteins which may perform relevant functions in gene expression, as splicing regulators and transcription factors (15–20). In fact, some AS features seemed to be conserved across species and indeed genes homologous to common splicing regulators are also conserved (20). Growing evidence indicates that AS is an important mechanism that controls gene expression by (i) increasing gene-coding capacities, thus proteome complexity through the generation of different mRNA isoforms, and/or (ii) facilitating mRNA degradation through the introduction of a premature termination codon in specific isoforms that would lead to non-sense mediated decay (NMD). Recognition of the splicing site can be modulated by cis-regulatory sequences, known as splicing enhancers or silencers, which contribute to the generation of two or more alternatively-spliced mRNAs from the same pre-mRNA. Computational analyses on genome-wide RNA sequencing have allowed the identification of splicing patterns in different organisms under various specific conditions, showing the dynamic nature of AS regulation in gene expression (21–23). AS defects are also related to various diseases in mammals (for review see, (24–26)), and to perturbations of plant responses to environmental cues (11,17,27,28). Besides the identification of numerous AS events on mRNAs, next generation sequencing technologies have facilitated the identification of thousands of RNAs with no or low coding potential (the so-called noncoding RNAs, ncRNAs). Globally, ncRNAs are classified by their size. The long ncRNAs (lncRNAs, over 200 nt) act directly in a long form by lncRNA-protein interactions, whereas the small ncRNAs (smRNAs) act by smRNA-protein interactions and base paring recognition of their mRNA targets. They were found to be modulated by different stimuli, exerting regulatory roles in gene expression by their implied involvement in mRNA cleavage, translational repression or epigenetic DNA/chromatin modification of their targets, hence leading to a wide range of biological outputs required for cell viability and function (29–31). It has been known for a long time that splicing mechanisms are regulated by small ncRNAs named nuclear uridine (U)-rich RNAs (U snRNAs). This group of noncoding RNAs catalyze each step of the splicing reaction (for review, (32)) in collaboration with core small nuclear ribonucleoprotein complex subunits (snRNPs), whereas more than 200 non-snRNPs splicing factors (SFs) (33–35) fine-tune complex splicing regulations. Recently, a growing number of ncRNAs have emerged as modulators of AS of specific genes. In this review, we discuss our current knowledge of long and small ncRNAs identified as splicing regulators and their mechanisms of action. REGULATION OF ALTERNATIVE SPLICING BY LONG NONCODING RNAs In the last few years, the identification of large amounts of lncRNAs induced or repressed in particular conditions or diseases suggested their possible implication in a wide range of biological processes. Detailed studies of some of them in different eukaryotic organisms have uncovered several lncRNA-mediated mechanisms fine-tuning almost every step of gene expression, including chromatin remodeling, transcriptional control, co-and posttranscriptional regulation, miRNA processing, and protein stability during different developmental processes (36–40). In particular, a growing number of lncRNAs have been linked to the modulation of AS in both plants and animals (Table 1). The main mechanisms involving lncRNAs in AS modulation can be classified in three ways: (i) lncRNAs interacting with specific SFs; (ii) lncRNAs forming RNA-RNA duplexes with pre-mRNA molecules and (iii) lncRNAs affecting chromatin remodelling, thus fine-tuning the splicing of target genes. Sub-classifications based on similarities and differences among mechanisms are proposed below. Developmental processes and diseases related to splicing-associated lncRNAs from the animal (background in red) and plant (in green) kingdoms Table 1. Developmental processes and diseases related to splicing-associated lncRNAs from the animal (background in red) and plant (in green) kingdoms Open in new tab Table 1. Developmental processes and diseases related to splicing-associated lncRNAs from the animal (background in red) and plant (in green) kingdoms Open in new tab LONG NONCODING RNAs AS SPLICING FACTOR INTERACTORS One of the first presumptions of the role of lncRNAs in splicing was given by a genome-wide screening in human and mouse cells that identified, in both cases, the following nuclear noncoding transcripts: NUCLEAR PARASPECKLE ASSEMBLY TRANSCRIPT 1 (NEAT1) and NUCLEAR PARASPECKLE ASSEMBLY TRANSCRIPT 2 (NEAT2) / METASTASIS ASSOCIATED LUNG ADENOCARCINOMA TRANSCRIPT 1 (MALAT1). RNA FISH analyses revealed an intimate association of NEAT1 and MALAT1 with the SC35 SF-containing nuclear speckles in both human and mouse cells, suggesting their participation in mRNA splicing. These studies also showed that NEAT1 localizes to the speckles periphery, whereas MALAT1 is part of the polyadenylated component of nuclear speckles (41). More recently, further nuclear-localized lncRNAs were linked to splicing regulation both in animals (e.g. GOMAFU, SAF) and plants (e.g. ASCO), whereas other lncRNAs have been found both in the nucleus and the cytoplasm (e.g. LINC01133 and ENOD40). Among them, a subset of lncRNAs seems recognized by SFs, impacting their activity in two ways: (i) modulating their posttranslational modifications (e.g. phosphorylation), or (ii) regulating the interaction with other SFs, and/or with protein-coding (pre)mRNAs. NEAT1 and MALAT1/NEAT2 regulate the phosphorylation status of splicing factors Serine/arginine-rich (SR) proteins are a conserved family of proteins largely involved in splicing. These proteins are composed of two domains, the RNA recognition motif (RRM) and the SR domain (42). SR proteins are commonly localized in the nucleus, although several of them are known to shuttle between the nucleus and the cytoplasm. Basically, a continuous phosphorylation/dephosphorylation cycle of SR proteins is required for proper pre-mRNA splicing and the regulation of AS patterns. The hyperphosphorylation of the SR domain influences the binding of SR proteins to target pre-mRNA, impacting the splice site selection. Partially dephosphorylated SR proteins support the first steps of the transesterification reactions (43–45). Also, intranuclear trafficking of SR proteins between nuclear speckles and transcription sites is dependent on their phosphorylation status (46,47). Thus, it is of paramount importance to better understand how lncRNAs may modulate SR phosphorylation, thus affecting AS determination of SR targets. NEAT1 is a highly abundant 4 kb lncRNA found in paraspeckles, nuclear domains that control sequestration of related proteins. NEAT1 accumulation is dynamically regulated during adipocyte differentiation, modulating the AS profile of PPARγ mRNA into both isoforms, PPARγ1 and PPARγ2, which code for the major transcription factor driving adipogenesis. The SR protein SRp40 (SFRS5) is involved in the regulation of PPARγ2 splicing. During differentiation, SRp40 levels are significantly increased when the splicing of PPARγ2 is induced. PPARγ mRNA splicing regulation by SRp40 is modulated by its phosphorylation status determined by the Clk kinase (48). It was demonstrated that SRp40 directly recognizes NEAT1, exhibiting a dynamic association throughout differentiation. NEAT1 depletion causes a decrease of PPARγ1 and PPARγ2, in particular the PPARγ2 isoform. The siRNA-directed NEAT1 depletion impaired the cells ability to phosphorylate SRp40. Furthermore, siRNA treatment for SRp40 resulted in deregulation of PPARγ1 and, primarily, PPARγ2 mRNA levels, whereas the overexpression of SRp40 increased PPARγ2, but not PPARγ1. Therefore, it was proposed that an increased concentration of phosphorylated SRp40 protein is present after being released from NEAT1, promoting the splicing of PPARγ2. This NEAT1-dependent mechanism fine-tunes the relative abundance of mRNA isoforms of key genes during adipogenesis (49) (Figure 1i). Figure 1. Open in new tabDownload slide Long noncoding RNAs regulate the phosphorylation status of splicing factors. (i) NEAT1 (blue) modulates SRp40 phosphorylation status by interaction with the Clk kinase. Phosphorylated SRp40 promotes the processing of the PPARy pre-mRNA into the PPARy2 mRNA, whereas the dephosphorylation of SRp40 favors the accumulation of the PPARy1 isoform. ( ii) MALAT1 (red) was proposed to modulate the phosphorylation status of SR proteins in the nucleus, including the MALAT1-interacting SRSF1, likely by interaction with PP1/2A phosphatases or with the SRPK1 kinase. Phosphorylated SRSF1 is accumulated in nuclear speckles (NS), whereas its dephosphorylation promotes the interaction with mRNAs (green), their transport and accumulation in the cytoplasm, likely affecting also protein translation and/or incorporation into P-bodies (PB) hosting the non-sense mediated decay (NMD) machinery. The three question marks indicate that each step was proposed albeit non validated. Figure 1. Open in new tabDownload slide Long noncoding RNAs regulate the phosphorylation status of splicing factors. (i) NEAT1 (blue) modulates SRp40 phosphorylation status by interaction with the Clk kinase. Phosphorylated SRp40 promotes the processing of the PPARy pre-mRNA into the PPARy2 mRNA, whereas the dephosphorylation of SRp40 favors the accumulation of the PPARy1 isoform. ( ii) MALAT1 (red) was proposed to modulate the phosphorylation status of SR proteins in the nucleus, including the MALAT1-interacting SRSF1, likely by interaction with PP1/2A phosphatases or with the SRPK1 kinase. Phosphorylated SRSF1 is accumulated in nuclear speckles (NS), whereas its dephosphorylation promotes the interaction with mRNAs (green), their transport and accumulation in the cytoplasm, likely affecting also protein translation and/or incorporation into P-bodies (PB) hosting the non-sense mediated decay (NMD) machinery. The three question marks indicate that each step was proposed albeit non validated. The lncRNA MALAT1 (NEAT2) acts as an oncogene transcript and its aberrant expression is involved in the development and progression of many types of cancers (50–52). Also, its accumulation was found to be associated with cancer patient resistance to chemotherapy and radiotherapy (53,54). Studies on human cells indicate that MALAT1 regulates splicing by modulating SR splicing factor distribution and phosphorylation (55). The depletion of MALAT1 enhances the dephosphorylated pool of SR proteins, displaying a more homogeneous nuclear distribution and resulting in the mislocalization of speckle components and changes in AS of pre-mRNAs. It was proposed that the observed changes in AS of endogenous pre-mRNAs could be a consequence of the mislocalization of pre-mRNA processing factors, assuming a critical role exerted by MALAT1 in the shuttling of SR proteins between speckles and the sites of transcription. The control of the levels of phosphorylated SR proteins impacts not only AS but also likely regulates other SR-dependent post-transcriptional regulatory mechanisms, including RNA export, NMD and translation (42,56). Interestingly, it was shown that MALAT1-depleted cells exhibit an increased cytoplasmic pool of poly(A)+ RNA, likely due to the altered cellular levels of dephosphorylated splicing factor SRSF1. It was previously shown that dephosphorylation of SRSF1 is critical for the export of mRNA associated proteins (mRNPs) and is also known to enhance binding of SRSF1 to mRNAs in the cytoplasm (57,58). Furthermore, in hepatocellular carcinoma, MALAT1 oncogenic activity relies on the indirect modulation of AS by transcriptional induction of SRSF1 (51). This induction leads to the over accumulation of active SRSF1 in the cell nucleus and the modulation of SRSF1 splicing targets, including the production of anti-apoptotic AS isoforms of S6K1 (51). However, the precise mechanisms by which MALAT1 depletion alters the ratios of phosphorylated to dephosphorylated SR proteins in the cell remain largely unknown (55). Possibly, MALAT1 modulates the activity of kinases (SRPKs or Clk/STY family), or of phosphatases (PP1 or PP2A), that modify SR proteins (56). Both SRPK1 and PP1 influence AS by regulating SR protein phosphorylation (59,60). Interestingly, the localization of SRPK1 is altered in MALAT1-depleted cells, indicating the potential involvement of MALAT1 in SRPK1 localization and activity. Alternatively, SR protein stability may be modulated by direct interaction with MALAT1, considering the increased cellular levels of SR proteins and the changes observed in AS patterns exhibited by MALAT1-depleted cells (55) (Figure 1ii). Long noncoding RNAs as splicing factor hijackers The growth of cancer cells is promoted by the proto-oncogene PTBP2 (POLYPYRIMIDINE TRACT-BINDING PROTEIN 2; 61,62). SFPQ (proline- and glutamine-rich SF; or PSF for PTB-associated SF) is considered a tumor suppressor gene, which regulates the tumor-promoting effects of PTBP2 by direct protein–protein interaction (63–65). More recently, it was demonstrated that MALAT1 is directly recognized by SFPQ, but not by PTBP2, likely disrupting a splicing regulator complex containing this tumor suppressor (66). Considering that SFPQ contains two RNA-binding domains, it was proposed that MALAT1 or other lncRNAs may hijack SFPQ to partially inhibit the interaction between SFPQ and PTBP2, affecting the regulatory role of SFPQ on PTBP2 (67,68). The resulting release of PTBP2 from the SFPQ-PTBP2 complex would allow the promotion of tumor growth and metastasis (66) (Figure 2i). Figure 2. Open in new tabDownload slide Long noncoding RNAs as splicing factor hijackers. ( i) MALAT1 can disrupt the formation of a splicing modulator complex, by directly hijacking the SFPQ factor, thus inhibiting its interaction with the tumor growth factor PTBP2. SFPQ-released PTBP2 promotes the proliferation of cancer cells. (ii) Celf3 and SF1 normally co-localize in the CS nuclear bodies. GOMAFU lncRNA is recognized by both proteins, although it is not accumulated in CS bodies. It was proposed that GOMAFU promotes the re-localization of Celf and SF1 out from CS bodies to the nucleoplasm, modulating their activity in splicing. ( iii) SPA and sno-lncRNAs recruit RNA binding proteins such as FOX, TDP43 and hnRNP M, and may titrate their availability for splicing regulation throughout the nucleus. In PWS patients SPA and sno-lncRNAs loci are deleted or not expressed, thus the related proteins are more uniformly distributed, impacting the AS pattern of their target genes. (iv) ASCO lncRNA is directly recognized by NSRa and b, competing with their binding to NSR-targeted pre-mRNAs. ASCO-mediated modulation of NSRs results in alternative processing of mature mRNAs, exhibiting events of intron retention, exon skipping and alternative 5′ or 3′ ends. (v) ENOD40 lncRNA is directly recognized by the nuclear speckle protein RBP1. ENOD40 participates in the nucleocytoplasmic trafficking of RBP1, inducing its accumulation into cytoplasmic granules, likely modulating RBP1-dependent splicing. Figure 2. Open in new tabDownload slide Long noncoding RNAs as splicing factor hijackers. ( i) MALAT1 can disrupt the formation of a splicing modulator complex, by directly hijacking the SFPQ factor, thus inhibiting its interaction with the tumor growth factor PTBP2. SFPQ-released PTBP2 promotes the proliferation of cancer cells. (ii) Celf3 and SF1 normally co-localize in the CS nuclear bodies. GOMAFU lncRNA is recognized by both proteins, although it is not accumulated in CS bodies. It was proposed that GOMAFU promotes the re-localization of Celf and SF1 out from CS bodies to the nucleoplasm, modulating their activity in splicing. ( iii) SPA and sno-lncRNAs recruit RNA binding proteins such as FOX, TDP43 and hnRNP M, and may titrate their availability for splicing regulation throughout the nucleus. In PWS patients SPA and sno-lncRNAs loci are deleted or not expressed, thus the related proteins are more uniformly distributed, impacting the AS pattern of their target genes. (iv) ASCO lncRNA is directly recognized by NSRa and b, competing with their binding to NSR-targeted pre-mRNAs. ASCO-mediated modulation of NSRs results in alternative processing of mature mRNAs, exhibiting events of intron retention, exon skipping and alternative 5′ or 3′ ends. (v) ENOD40 lncRNA is directly recognized by the nuclear speckle protein RBP1. ENOD40 participates in the nucleocytoplasmic trafficking of RBP1, inducing its accumulation into cytoplasmic granules, likely modulating RBP1-dependent splicing. Recently, a SF-interacting lncRNA was also reported in cancer disease. The lncRNA LINC01133 shows a tissue- and organ-specific expression pattern and was linked to different cancers (69,70). This LINC01133 RNA is directly recognized by the AS factor SRSF6. LINC01133 expression inhibits metastasis in a SRSF6-dependent manner. It was proposed that LINC01133 acts as a target mimic to titrate SRSF6 action on other mRNA substrates, leading to the inhibition of epithelial-mesenchymal transition and metastasis in colorectal cancer cells (69). Thus, this lncRNA modulates SRSF6 activity as a decoy element of AS, in order to shape the population of AS isoforms of SRSF6 mRNA targets. Another example of SF-associated lncRNA is GOMAFU, which was involved in schizophrenia-associated AS (71). GOMAFU is expressed in a specific group of neurons in adult mice and has been implicated in retinal cell development (72,73), brain development (74) and post-mitotic neuronal function (75,76). GOMAFU’s downregulation leads to aberrant AS patterns, resembling those observed in typically schizophrenia-associated genes like DISRUPTED IN SCHIZOPHRENIA 1 (DISC1) which exerts a role in the neuronal system, as well as ERYTHROBLASTIC LEUKEMIA VIRAL ONCOGENE HOMOLOG 4 (ERBB4) implicated in mental illness (71). Moreover, GOMAFU was found to be downregulated in post-mortem cortical gray matter from the superior temporal gyrus in schizophrenia, suggesting that aberrant splicing patterns of DISC1 and ERBB4 in schizophrenia could be mediated by GOMAFU. Furthermore, GOMAFU was found to directly interact with the SFs QUAKING homolog QKI and SRSF1 (71), likely modulating AS. QK1 and SRSF1 deregulation could lead to mental disorders such as schizophrenia. GOMAFU is also recognized through a tandem array of UACUAAC motifs by the splicing factor SF1, which participates in the early stages of spliceosome assembly (77). More recently, Celf3 (CUGBP Elav-like family member 3, also referred to as Tnrc4, Brunol1, CAGH4 or ERDA4; reviewed in (78)) was also found to specifically interact with GOMAFU. Interestingly, Celf3 forms novel nuclear bodies (named CS bodies) in the neuroblastoma cell line Neuro2A, colocalizing with the GOMAFU-interacting protein SF1. However, GOMAFU was not observed in the CS bodies but separately distributed throughout the nucleus. Therefore, it was proposed that GOMAFU indirectly modulates the function of RNA-binding proteins in CS bodies by sequestering them to separate regions of the nucleus (79) (Figure 2ii). Apart from cancer and neurological disorders, other diseases may be related to AS modulated by lncRNAs. For example, the 205 kb-lncRNA LINC-HELLP, which is implicated in the pregnancy-associated disease HELLP, has also been implicated in splicing regulation. LINC-HELLP purification followed by mass spectrometry revealed that this lncRNA is recognized by components of the splicing and the ribosomal machineries, respectively (80), including the splicing-related factors Y-BOX BINDING PROTEIN 1 (YBX1), and POLY(RC) BINDING PROTEINS 1 and 2 (PCBP1 and PCBP2). Although the molecular mechanisms governing splicing regulation by LINC-HELLP remain largely unknown, it was demonstrated that upon the occurrence of mutations in HELLP patients, the 5′-end up to the middle of the LINC-HELLP transcript loses its ability to interact with its protein partners, whereas binding is gained with mutations at the far 3′-end (80). Small nucleolar RNAs (snoRNAs) are a family of conserved nuclear RNAs located in Cajal bodies or nucleoli. They participate in the modification of snRNAs or rRNA, or participate in the processing of rRNA during ribosome subunit maturation (81–83). SnoRNAs are processed from excised and debranched introns by exonucleolytic trimming (84). The snoRNA HBII-52 was implicated in the congenital disease Prader-Willi syndrome (PWS), by modulating the AS of the Serotonin Receptor 2C (85). More recently, a novel class of nuclear-enriched intron-derived lncRNAs was identified, which are processed on both ends by the snoRNA machinery (sno-lncRNAs). Sno-lncRNAs derived from the PWS critical region of chromorome 15 are recognized by the FOX2 SF. It was proposed that these sno-lncRNAs can recruit FOX proteins, regulating their availability for splicing of their targets. In patients suffering PWS, particular sno-lncRNAs are deleted or not expressed. As a result, FOX proteins are more uniformly distributed in the cell nucleus, and modify the AS of specific genes during early embryonic development and adulthood (84). An additional sub-class of snoRNA-derived lncRNAs generated from the PWS locus was recently identified. SPA lncRNAs are snoRNA-5′ capped and 3′ polyadenylated accumulated in the nucleus. They were shown to recruit key RNA binding proteins, such as TDP43, RBFOX2, and hnRNP M, involved in multiple aspects of mRNA metabolism regulation (86). Therefore, SPA lncRNAs also fine-tune the availability of RNA binding proteins, as the previously characterized sno-lncRNAs (Figure 2iii). In Arabidopsis thaliana, the lncRNA ASCO (ALTERNATIVE SPLICING COMPETITOR) is recognized in vivo by the plant-specific NUCLEAR SPECKLE RNA-BINDING PROTEINS (NSRs), involved in splicing (39). A transcriptomic dataset served to identify RNA processing events in nsra/b double mutants using a bioinformatic approach. This analysis revealed an important number of intron retention events and differential 5′ start or 3′ end in a subset of genes in the nsra/b mutant compared to wild type plants (22). NSRs are positively regulated by the phyto-hormone auxin, to which the nsra/b mutant exhibits a decreased sensitivity, e.g. lower lateral root number than wild type plants in response to auxin treatment. This phenotype was related to the one observed for ASCO overexpressing lines. Interestingly, the splicing of a high number of auxin-related genes was perturbed in nsra/b mutants and some of them behaved accordingly in the ASCO overexpressing lines (39). An in vitro approach showed that ASCO ‘competes’ with other mRNA-target for its binding to the NSR regulators, suggesting that this competition would modulate the affinity of NSRs for their targets. The ASCO-NSR interaction could then regulate AS during auxin response in roots (39) (Figure 2iv). In the model legume Medicago truncatula, NSRs closest homolog, RNA-BINDING PROTEIN 1 (RBP1), is localized in nuclear speckles where many components of the splicing machinery are hosted in plant cells. Remarkably, RBP1 interacts with a highly structured lncRNA, EARLYNODULIN40 (ENOD40), which participates in root symbiotic nodule organogenesis (87). ENOD40 is highly conserved among legumes and was also found in other species such as rice (Oryza sativa; (88)). In contrast to the nuclear localization of Arabidopsis ASCO, ENOD40 was found both in the nucleus and the cytoplasm, and it is able to relocalize RBP1 from nuclear speckles into cytoplasmic granules during nodulation. These observations hint a role of the lncRNA ENOD40 in nucleocytoplasmic trafficking, likely modulating RBP1-dependent splicing (89) (Figure 2v). ANTISENSE lncRNAs FORMING RNA-RNA DUPLEXES One particular class of lncRNAs is defined by natural antisense transcripts (NATs). NATs are transcribed from the opposite strand of a protein-coding gene and may or may not overlap with portions of coding sequences (90). A pioneering work in the early 1990s showed that the population of the oncogene N-myc transcriptional isoforms correlates with the presence of a NAT encoded across exon 1 in cell cultures. RNA-RNA duplexes between sense and antisense transcripts were detected in vivo. Furthermore, duplex formation appeared to occur with only a subset of the multiple isoforms of the N-myc mRNA. The precise transcriptional initiation site of the NAT was suggested to play a role in determining this selectivity. It was proposed that duplex formation could modulate pre-mRNA processing by preserving a population of intron 1-retained N-myc mRNA (91). It was recently shown that the N-myc NAT is indeed a protein-coding RNA, the resultant product of which acts as an onco-promoting factor in human cancer (92). Soon after the association of N-myc NAT with AS, it was demonstrated in rats that transcription of overlapping antisense transcripts can modulate the preference of splicing sites, thus impacting the population of alternatively spliced variants of a single locus (93). The erbAα locus encodes two overlapping mRNAs, corresponding to the two α-thyroid hormone receptor isoforms, TR α1 and TR α2, which arise from alternative polyadenylation sites of the locus. The erbAα locus also codes for a third mRNA, Rev-erbAα, which is transcribed in the opposite direction from α1 and α2. Rev-erbAα encodes another protein belonging to the same steroid/thyroid hormone receptor superfamily. Rev-erbAα overlaps with the 3′ end of α2, but not α1 mRNA. Interestingly, in vitro assays demonstrated that the relative abundance of the Rev-erbAα RNA is capable of blocking the accurate splicing of erbAα2, likely by RNA-RNA base pairing. This is consistent with the observation that in tissues where Rev-erbAα mRNA levels are high, the ratio of α2/α1 isoforms is relatively low (93,94). More recently, it was shown that TRα2 is not expressed in marsupials and that the antisense overlap between erbAα and Rev-erbα is unique to eutherian mammals (95). Although Rev-erbAα and N-myc NAT are both protein-coding genes, these paradigmatic cases of co-regulation of overlapping antisense transcripts hinted the existence of RNAs with dual roles (96), e.g. to code for proteins on one hand, and to exert a role as an RNA molecule itself, modulating the splicing of a neighboring gene, on the other. More recently, the NAT named SAF was linked to programmed cell death or apoptosis. One major apoptotic pathway is triggered by the interaction between the cellular Fas receptor (Fas) and its Fas ligand (FasL). Tumor cells are frequently resistant to Fas-mediated apoptosis, because they produce soluble Fas proteins that bind FasL, blocking apoptosis. Soluble Fas (sFas) is encoded by the exon 6-skipped alternatively spliced variant of Fas pre-mRNA (FasΔEx6), which lacks the transmembrane domain. It was shown that the nuclear-enriched NAT SAF is transcribed in reverse orientation and from the opposite strand of intron 1 of the Fas locus. An RNA pull-down assay using in vitro transcribed biotin-labeled SAF served to demonstrate that it can directly interact with the sense transcript Fas. Furthermore, an RNAse A protection assay revealed that the interaction between SAF lncRNA and Fas pre-mRNA occurs predominantly at exon 5–6 and exon 6–7 junctions. Strikingly, RNA pull-down followed by mass spectrometry revealed that SAF is recognized by the human SF 45 (SPF45), likely facilitating AS and exclusion of exon 6. The regulation of FasΔEx6 mRNA accumulation leads to the production of soluble sFas protein, which protects tumor cells against FasL-induced apoptosis (97) (Figure 3i). Figure 3. Open in new tabDownload slide Long noncoding RNAs form RNA-RNA duplexes with pre-mRNAs. (i) SAF NAT is encoded in the first intron of the Fas locus. In tumor cells, SAF is transcribed and it specifically interacts with the exon 6 flanking regions of the pre-mRNA, conforming RNA-RNA duplexes. SAF recruits SPF46, promoting the exclusion of exon 6. The resulting mRNA encodes for soluble Fas (sFas), which lacks the transmembrane domain. As a result, the presence of the Fas receptor is reduced in the cell surface whereas sFas sequesters Fas ligand (FasL), rendering cells less sensitive to Fas-mediated apoptosis. (ii) In epithelial cells, the Zeb2 locus is transcribed into a full 5′UTR-including pre-mRNA. The spliceosome mediates the removal of a 3 kb-long intron in the 5′UTR. The resulting mRNA contains a stable secondary structure before the AUG, which is capable of blocking translation in the polysomes. On the other hand, after EMT, the Snail1 transcription factor induces the transcription of Zeb2 NAT in mesenchymal cells. A specific RNA-RNA duplex encompassing the 5′ splice site of the 5′UTR intron prevents the binding of the spliceosome. Thus, the mRNA contains the full isoform of the 5′UTR, including an internal ribosome entry site (IRES) proximal to the Zeb2 AUG, favoring translation. Zeb2 transcription factor extends the repression of the E-cadherin gene initiated by Snail1 during EMT. Figure 3. Open in new tabDownload slide Long noncoding RNAs form RNA-RNA duplexes with pre-mRNAs. (i) SAF NAT is encoded in the first intron of the Fas locus. In tumor cells, SAF is transcribed and it specifically interacts with the exon 6 flanking regions of the pre-mRNA, conforming RNA-RNA duplexes. SAF recruits SPF46, promoting the exclusion of exon 6. The resulting mRNA encodes for soluble Fas (sFas), which lacks the transmembrane domain. As a result, the presence of the Fas receptor is reduced in the cell surface whereas sFas sequesters Fas ligand (FasL), rendering cells less sensitive to Fas-mediated apoptosis. (ii) In epithelial cells, the Zeb2 locus is transcribed into a full 5′UTR-including pre-mRNA. The spliceosome mediates the removal of a 3 kb-long intron in the 5′UTR. The resulting mRNA contains a stable secondary structure before the AUG, which is capable of blocking translation in the polysomes. On the other hand, after EMT, the Snail1 transcription factor induces the transcription of Zeb2 NAT in mesenchymal cells. A specific RNA-RNA duplex encompassing the 5′ splice site of the 5′UTR intron prevents the binding of the spliceosome. Thus, the mRNA contains the full isoform of the 5′UTR, including an internal ribosome entry site (IRES) proximal to the Zeb2 AUG, favoring translation. Zeb2 transcription factor extends the repression of the E-cadherin gene initiated by Snail1 during EMT. Epithelial–mesenchymal transition (EMT) is first triggered by the expression of the Snail1 transcription factor in epithelial cells. Snail1 down-regulates specific epithelial genes, including E-cadherin, and induces the expression of Zeb1 and two transcriptional regulators, which might extend the repression of E-cadherin initiated by Snail1. Normally, the Zeb2 5′-UTR is spliced, conserving a structured region that inhibits scanning by the ribosomes and therefore prevents translation of the Zeb2 protein. However, Snail1 promotes the transcription of a NAT encoded in the opposite strand of the Zeb2 locus, covering the 5′ splice site of the Zeb2 5′-UTR. It was proposed that Zeb2 NAT prevents the recognition of the spliceosome by RNA-RNA duplex conformation, promoting the inclusion of the intron present in the Zeb2 5′-UTR. This intron contains an internal ribosome entry site (IRES) located close to the start of translation. IRES recognition by the ribosomes promotes Zeb2 translation, activating EMT (98) (Figure 3ii). Although the underlying molecular mechanisms remain largely unknown, another NAT modulating AS also came to light in relation to brain illness. The Pol III-dependent lncRNA 17A is transcribed in an antisense way from the intron 3 of the GPR51 gene (G-PROTEIN-COUPLED RECEPTOR 51, coding GABA B2 receptor or GABAB R2) and is induced by inflammatory molecules. This lncRNA was found to regulate AS of GPR51 (99). Expression of 17A leads to the production of the GABAB R2 protein isoform without transduction activity together with a dramatic down-regulation of the expression of the canonical full-length GABAB R2 variant, thus impairing GABAB signaling. This changing ratio of AS was found to be linked to Alzheimer's disease. Also, 17A expression was increased in patient brains, suggesting an eventual role of this lncRNA in GPR51 splicing regulation to preserve cerebral function (99). LONG NONCODING RNAs AS CHROMATIN REMODELERS In the last decade, chromatin structure and histone modifications have emerged as key regulators of AS. The interaction between histone modifications, chromatin-binding proteins and SFs possibly constitutes a complex network of communication between chromatin and RNA (100). Also, it was demonstrated that the chromatin context influences RNA polymerase II (Pol II) elongation rate which in turns affects AS (101–102). This means that epigenetic regulation not only determines which parts of the genome are expressed, but also how they are spliced (100). Several lncRNAs participate in chromatin structure determination and dynamics, which may then impact the splicing output, notably by: (i) direct interaction between lncRNAs and DNA forming heteroduplexes, (ii) recruitment of chromatin modifiers to specific loci, or (iii) shaping the 3D organization of chromatin conformation across the cell nucleus. Finally, we discuss how lncRNA-derived small RNAs control chromatin remodeling and may determine AS patterns through this mechanism. Splicing regulation by lncRNA-driven DNA–RNA duplexes Circular RNAs (circRNAs) are covalently-closed circular molecules of single-stranded RNA, resulting from a non-canonical splicing event, the so-called back-splicing. This event consists in the ligation of a downstream splice donor site reversely with an upstream splice acceptor site from the pre-mRNA, generating a circular lncRNA molecule. These circular transcripts are abundant and highly stable, and they may efficiently compete with the linear pre-mRNA for the recognition of related splicing protein complexes (103). For instance, in flies and humans, the SF MUSCLEBLIND can strongly and specifically bind to the circRNA derived from its own locus, called circMbl (104). In human cells, circRNAs are dynamically modulated by the SF QKI during human EMT (105), and it was shown that in human endothelial cells circRNAs occurrence correlates with exon skipping throughout the genome (106). However, the molecular mechanisms involving circRNAs in animals remain largely unknown. It was recently demonstrated in Arabidopsis that a circRNA can modulate the AS of its own parent gene by directly interacting with the DNA, forming an RNA-DNA hybrid known as an R-loop (107). The overexpression of the circRNA from exon 6 of the SEPALLATA 3 (SEP3) gene enhances the accumulation of the naturally-occurring SEP3.3 isoform, which consists of the exon 6-skipped transcript. SEP3 belongs to the MADS-box family (named after the founder members MCM1-AGAMOUS-DEFICIENS-SRF) of DNA-binding proteins, and it is involved in flower development in Arabidopsis. The modulation of SEP3 splicing gives rise to homeotic phenotypes in the flower. Strikingly, the exon 6 circRNA is capable of generating an R-loop by direct interaction with its own genomic locus, further supporting the idea that chromatin conformation plays a major role in splicing pattern determination (Figure 4i). Genome-wide characterization of R-loops will be needed to assess how widely this mechanism occurs throughout the Arabidopsis genome (108). Figure 4. Open in new tabDownload slide Long noncoding RNAs as chromatin remodelers. (i) The exon 6 of the SEP3 gene is transcribed and back-spliced into a circular RNA. The SEP3 circRNA directly interacts with its parent gene DNA, conforming a DNA–RNA duplex known as an R-loop and promoting the exon 6 skipping, thus the accumulation of the SEP3.3 mRNA isoform. (ii) The antisense transcript of the FGFR2 gene, called asFGFR2 (in red), recruits the PRC2 proteins EZH2 and SUZ12 to its parent locus, triggering the deposition of H3K27me3 and the recruitment of the H3K36 demethylase KDM2a. This complex enhances the deposition of H3K36me3 and impairs the binding of the chromatin-splicing adaptor complex MRG15–PTB to the exon IIIb, which is finally included in the mature mRNA (in green). (iii) NEAT1 and MALAT1 bind to common and distinct actively transcribed loci across the genome. Their binding on the gene body is different between them, e.g. NEAT1 binding peaks at the transcription start site as well as the end of the locus, whereas MALAT1 preferentially binds only at the end of the gene. It was proposed that MALAT1 and NEAT1 promote the formation of splicing-related nuclear speckles and paraspeckles, respectively, around its site of transcription of targeted loci. Figure 4. Open in new tabDownload slide Long noncoding RNAs as chromatin remodelers. (i) The exon 6 of the SEP3 gene is transcribed and back-spliced into a circular RNA. The SEP3 circRNA directly interacts with its parent gene DNA, conforming a DNA–RNA duplex known as an R-loop and promoting the exon 6 skipping, thus the accumulation of the SEP3.3 mRNA isoform. (ii) The antisense transcript of the FGFR2 gene, called asFGFR2 (in red), recruits the PRC2 proteins EZH2 and SUZ12 to its parent locus, triggering the deposition of H3K27me3 and the recruitment of the H3K36 demethylase KDM2a. This complex enhances the deposition of H3K36me3 and impairs the binding of the chromatin-splicing adaptor complex MRG15–PTB to the exon IIIb, which is finally included in the mature mRNA (in green). (iii) NEAT1 and MALAT1 bind to common and distinct actively transcribed loci across the genome. Their binding on the gene body is different between them, e.g. NEAT1 binding peaks at the transcription start site as well as the end of the locus, whereas MALAT1 preferentially binds only at the end of the gene. It was proposed that MALAT1 and NEAT1 promote the formation of splicing-related nuclear speckles and paraspeckles, respectively, around its site of transcription of targeted loci. Long noncoding RNAs as recruiters of chromatin remodelers An example of cell-specific AS mediated by lncRNA was linked to the antisense transcript called asFGFR2. The lncRNA asFGFR2 is generated from the human FGFR2 locus and it induces epithelial-specific AS of FGFR2 by promoting chromatin modifications in its own FGFR2 locus. It was proposed that asFGFR2 recruits chromatin modifiers specifically to this locus, perhaps via RNA-DNA heteroduplexes. Interestingly, chromatin pulldown of a biotinylated asFGFR2 RNA showed that upon its overexpression, asFGFR2 was targeted to the FGFR2 locus precisely around the differentially spliced intron (109). In epithelial cells, asFGFR2 was found to recruit chromatin modifiers like the Polycomb-group proteins and the H3K36 demethylase KDM2a to the FGFR2 locus. As a result, it generates a chromatin environment that prevents binding of inhibitory splicing regulators and favors exon IIIb inclusion (109). Polycomb-related proteins and KDM2a are differentially recruited along FGFR2 in a cell type–specific manner, correlating with FGFR2 splicing outcome. The hypothesis of a direct role of H3K27me3 and the PRC2 component EZH2 on FGFR2 AS was discarded by expressing EZH2 in cells after knockdown of KDM2a. EZH2 failed to induce exon IIIb inclusion in the absence of KDM2a, indicating that PRC2 promotes exon IIIb inclusion by maintaining low H3K36me2/3 levels via recruitment of KDM2a. This epigenetic landscape impairs binding of the chromatin-splicing adaptor complex MRG15–PTB, which normally inhibits the inclusion of exon IIIb. The antagonistic effect of H3K36me3 and H3K27me3 on FGFR2 splicing points to a lncRNA-mediated cross-talk between these histone modifications (109) (Figure 4ii). Also, the mentioned splicing-related lncRNA MALAT1 was identified as a key regulator of Polycomb 2 protein (Pc2) methylation status, impacting chromatin conformation (110). It was reported that the methylation/demethylation of Pc2 determines the relocation of growth control genes between Polycomb bodies (PcGs) and interchromatin granules (ICGs). This behavior is ruled by the binding of methylated and unmethylated Pc2 to two lncRNAs, TUG1 and MALAT1, located in PcGs and ICGs, respectively. TUG1- and MALAT1-associated proteins were identified by pull-down using biotinylated RNAs followed by mass spectrometric analysis. This approach revealed that MALAT1 RNA bound not only to pre-mRNA SFs, but also to transcriptional co-activators and histone methyltransferases/demethylases associated with active histone marks; whereas TUG1 RNA also specifically binds to a number of proteins involved in transcriptional repression, including histone methyltransferases/demethylases and chromatin modifiers. It transpired that these lncRNAs mediate the assembly of multiple co-repressors/co-activators, and can alter the histone marks read by Pc2 in vitro. Additionally, binding of MALAT1 to unmethylated Pc2 promotes SUMOylation of the transcription factor E2F1, leading to activation of the growth control gene program. Therefore, MALAT1 also participates in the modulation of the chromatin remodeling environment by selectively interacting with chromatin modifier proteins (110). Although there is no evidence of any direct link between MALAT1- or TUG1-mediated chromatin modulation and splicing, future studies will be needed to determine if the chromatin-related function of these lncRNAs may consequently affect AS of target genes. Long noncoding RNAs shape the three-dimensional genome organization The molecular pathway linking the actions of subnuclear structure-specific lncRNAs, such as TUG1 and MALAT1, and non-histone protein methylation to spatial relocation of transcription units in the nucleus, hints the role of lncRNAs in the dynamic 3D configuration of the genome in the cell nucleus. Nuclear spatial organization and chromatin 3D modulation by lncRNAs (37,111–116) as well as AS modulation by chromatin modifications (for review see (100,117)) have long been described in mammalian cells as well as in plants. The splicing-related lncRNAs NEAT1 and MALAT1 were used as baits to map their binding sites across the human genome (118) by Capture Hybridization Analysis of RNA Targets (CHART; (119)). Strikingly, NEAT1 and MALAT1 localize to hundreds of loci in human cells, primarily on actively transcribed genes. Many of these loci were co-enriched in NEAT1 and MALAT1 CHARTs, although displaying distinct gene body binding patterns, suggesting independent but complementary functions for both lncRNAs. CHART followed by mass spectrometry was also performed to identify NEAT1 and MALAT1 interactors, revealing common nuclear speckle and paraspeckle components. The elucidation of ribonucleoprotein complexes further supports complementary binding and functions exerted by both lncRNAs. The dynamic interactions between nuclear speckles and gene bodies indicate that speckles may serve as a concentrated reservoir of SFs that shuttle to transcribed genes (120,121). Considering that speckles frequently localize within the vicinity of actively transcribed genes undergoing co-transcriptional splicing (122,123), it was proposed that nuclear bodies may be organized around genes regulated by NEAT1 and MALAT1 (118). The previous elucidation of the 3D chromatin organization of human cells indicated that the MALAT1 and NEAT1 genomic loci are located in close proximity in the nucleus (124). According to this model, NEAT1 and MALAT1 could shape the structure of nuclear bodies at highly transcribed loci, as NEAT1 also participates in the organization of paraspeckle formation around its site of transcription (111,112) (Figure 4.iii). Alternatively, NEAT1 and MALAT1 may serve as scaffolds, such as Xist or HOTAIR (119,125,126), bringing proteins that also interact with components of nuclear speckles and paraspeckles, together with RNA and/or DNA binding proteins. This model considers the action of lncRNAs as molecular bridges between specific chromosomal locations and nuclear speckles and paraspeckles (118). Small RNAs in the interplay between splicing and chromatin compaction Small ncRNAs (smRNAs) derived from lncRNA precursors act as small molecules of <50 nt. Since the discovery of RNA-mediated gene silencing by small interfering RNAs (siRNAs) (127,128), other classes of small RNAs with multiple functions have been identified, such as microRNAs (miRNAs) (129), small RNA fragments derived from tRNAs (tsRNAs) and small RNA fragments derived from small nucleolar (sno)RNAs (sdRNAs, sno-derived RNAs) (130,131). It has been demonstrated that they regulate gene expression by multiple mechanisms, such as targeting mRNA cleavage, translational or transcriptional repression, decoys of mRNAs or through the generation of other secondary smRNAs (132–134) and protein sequestering or titration (135). During transcriptional gene silencing, siRNAs trigger heterochromatin formation at DNA target sequences. Various plants and yeast studies have reported the relationship between splicing and silencing mediated by smRNA-directed heterochromatin formation. In particular, several mutants in SFs-encoding genes turned out to be also impaired in the silencing of certain genes (136–140). In Schizosaccharomyces pombe, mutation in any of the two SF-encoding genes Cwf10 or Prp39 was found to reduce centromeric siRNAs accumulation and to increase repeated transcripts like dg and dh (136). In the same way, mutation of a single nucleotide in the U4 snRNA gene impairs centromere silencing (137). In both cases, some SFs were found to facilitate siRNA production to modulate heterochromatin formation and induce centromere silencing. Similarly, in Arabidopsis the SF SR45 was found to be involved in de novo methylation by the RNA-directed DNA methylation (RdDM) pathway. In fact, the sr45 mutant decreased siRNAs and DNA methylation in transgenic FWA (FLOWERING WAGENINGEN) in company with the associated late flowering phenotype (138). Another example is the Arabidopsis SMALL NUCLEAR RIBONUCLEOPROTEIN D1 (SmD1) which was proposed to play a role in plants during splicing since a mutant exhibits altered AS of certain genes. This protein was also found to facilitate post-transcriptional gene silencing (PTGS) by protecting transgene aberrant RNAs from degradation by the NMD pathway. As a result, enough template is provided for siRNAs production establishing a link between aberrant RNA and AS (141). Apart from these examples of interaction between the splicing machinery and smRNA-directed heterochromatin formation, new studies also pointed out a link involving smRNAs in the regulation of AS through chromatin remodelling, a process that can be regulated by smRNAs. It was shown in humans, flies and worms that nucleosome density is higher over exons than introns suggesting that nucleosome positioning defines exons at the chromatin level (142–146). The chromatin context influences RNA polymerase II (Pol II) elongation rate which in turns affects AS (100–102). Rapid transcription favors exon skipping whereas slower transcription stimulates the use of weak splice sites of variant exons promoting the intron inclusion process or other alternative sites for splicing reactions (146,147). For instance, the FIBRONECTIN 1 gene (FN1) produces different protein isoforms through AS of exon extra domain I (EDI). In hepatoma and HeLa cells, it was described that exogenous applied siRNAs targeting gene sequences located close to EDI alternative exon lead to a heterochromatic state in the site which affects Pol II elongation efficiency and mediates AS of EDI (148). This regulation was found to be dependent on ARGONAUTE 1 (AGO1), which is a crucial actor in RNA silencing, binding siRNAs to recognize their target RNAs. Another example of AS mediated by siRNAs is the inclusion of exon 18 of the NEURAL CELL ADHESION MOLECULE (NCAM) gene which is regulated by heterochromatin marks after differentiation of mouse N2a neural cells. This process could also be induced by exogenous application of exon-targeted siRNAs in undifferentiated N2a cells (149). These examples suggest that siRNAs could regulate AS through the modulation of heterochromatin in specific sites in order to fine-tune the Pol II elongation rate. Besides the mechanistic implications of using exogenous applied siRNAs to specific alternative exons, a genome-wide approach hinted the potential relevance of this mechanism in physiological conditions (150). Remarkably, purification of AGO1 and AGO2 chromatin associated complexes revealed their interaction with SFs. Furthermore, approximately one-third of smRNAs loaded in these AGO1 and AGO2 complexes align specifically with 3′ ends of introns, the intron–exon junctions (150). These observations suggest that these intron-related smRNAs and the RNAi protein machinery could have a function in AS regulation. Moreover, genome-wide exon arrays on embryonic fibroblasts of Ago2- or Dicer-null mice showed that they have similar altered AS events (150). In this work, the CD44 gene was taken as a model to characterise the underlying mechanism because several of its alternative exons can be highly included by a phorbol-12-myristate 13-acetate (PMA) treatment (150,151). In mammalian cells treated with PMA, smRNAs recruited AGO1 and AGO2 to the transcribed regions of CD44 in a Dicer- and HP1 (HETEROCHROMATIN PROTEIN 1)-dependent manner, which increased local H3K9me3 levels at the region corresponding to the variable exons. Subsequently, the AGO proteins facilitate spliceosome recruitment and modulation of Pol II processivity in order to shape CD44 AS (150), further suggesting an involvement of smRNAs in splicing regulation. Heterochromatin regulation by smRNAs is widely found in different organisms among yeast, plants and mammals. Therefore, it is possible that the interplay between splicing and smRNAs/heterochromatin pathway is a widespread mechanism in eukaryotes to rapidly regulate splicing and AS rates of specific genes throughout growth and differentiation. CONCLUSIONS Generally, regulation of AS by lncRNAs seems to be a common emerging event in many species although the underlying mechanisms differ among them. In this review, we have classified and sub-classified splicing-related lncRNAs characterized from different species and kingdoms, according to the commonalities and singularities of their interaction with SFs, their direct association with pre-mRNAs, their roles in nuclear localization processes or their impact on chromatin conformation. The involvement of lncRNAs in AS regulation seems to be critical under specific conditions, as impairing the production of lncRNAs could lead to diseases in mammals or developmental disorders in plants. LncRNA-derived small RNAs have also been described here, as their action in the modulation of the chromatin context in the gene body in order to modify Pol II elongation rate, may privilege inclusion or exclusion of specific exon/introns. The identification of certain proteins involved in both, splicing and siRNA production, supports the implication of smRNAs in the regulation of AS. Some lncRNAs and their mode of action are conserved in closely related species, as it is the case of MALAT1. Nevertheless, the sequence of the majority of lncRNAs are not conserved across species and common roles of different lncRNAs in AS regulation suggests that structural patterns may be conserved for proper recognition by SFs. Further studies on the structure of lncRNAs interacting with SFs across species should shed light on the evolution of these mechanisms. Novel methods developed in the last few years will certainly continue to cope with this challenge. For instance, Cross-Linking and ImmunoPrecipitation (CLIP) assays are used to identify specific RNA sequences bound by protein partners (152,153). Also, RNA Hybrid and Individual-nucleotide resolution UV Cross-Linking and ImmunoPrecipitation (hiCLIP) allows us to identify RNA duplexes, particular RNA intramolecular structures or intermolecular interactions at nucleotide resolution level, bound by specific RNA-associated proteins (154). It is worth considering the differential accumulation and sub-cellular localization of lncRNAs in response to external or intrinsic stimuli. This suggests that, for a given organism, different lncRNAs with similar structures may be recognized by common proteins and exert the same role in alternative developmental contexts. The link between ncRNAs and diseases clearly highlights the interest in further studies about the so-called dark matter of the genome. A deeper knowledge about lncRNAs activity will help us to detect and hopefully treat a wider range of diseases. For instance, the antisense transcript of the EPIDERMAL GROWTH FACTOR RECEPTOR (EGFR) coding locus, the NAT EGFR-AS1, was recently shown to modulate the EGFR isoforms abundance. Targeting EGFR is a validated approach in the treatment of squamous-cell cancers (SCCs). Notably, a silent punctual mutation impacts the accumulation of EGFR-AS1, what was proposed as a predictive biomarker for SCCs. However, the molecular mechanisms governing EGFR splicing regulation by AGFR-AS1 remain unknown (155). In plant biology, the identification of differentially expressed lncRNAs in crops, including variation in different ecotypes, may help us decipher how different varieties can adapt to changing environments. A more comprehensive knowledge about the plasticity of plant genomes and the impact of lncRNA in the modulation of the protein-coding genome will certainly allow us to develop novel strategies for sustainable agriculture. As the noncoding transcriptome, mainly composed of introns and ncRNAs, is the major component of the eukaryotic genome, there may be a large reservoir of mechanisms through which ncRNAs can interact with the splicing machinery to modulate and boost the proteome plasticity, contributing to the expansion of diversity in life. Mechanistic insight into how long and small noncoding RNAs are recruited and fine-tune AS may open large perspectives as novel tools in gene therapy for several diseases including cancer, as well as in biotechnological applications in agriculture and human health. FUNDING Funding for open access charge: grant from ANPCyT, Argentina, PICT 2016-0289. Conflict of interest statement. None declared. REFERENCES 1. Marquez Y. , Brown J.W.S. , Simpson C. , Barta A. , Kalyna M. Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis . Genome Res. 2012 ; 22 : 1184 – 1195 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Gerstein M.B. , Rozowsky J. , Yan K.-K. , Wang D. , Cheng C. , Brown J.B. , Davis C.A. , Hillier L. , Sisu C. , Li J.J. et al. Comparative analysis of the transcriptome across distant species . Nature . 2014 ; 512 : 445 – 448 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Pan Q. , Shai O. , Lee L.J. , Frey B.J. , Blencowe B.J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing . Nat. Genet. 2008 ; 40 : 1413 – 1415 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Wang E.T. , Sandberg R. , Luo S. , Khrebtukova I. , Zhang L. , Mayr C. , Kingsmore S.F. , Schroth G.P. , Burge C.B. Alternative isoform regulation in human tissue transcriptomes . Nature . 2008 ; 456 : 470 – 476 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Syed N.H. , Kalyna M. , Marquez Y. , Barta A. , Brown J.W.S. Alternative splicing in plants–coming of age . Trends Plant Sci. 2012 ; 17 : 616 – 623 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Djebali S. , Davis C.A. , Merkel A. , Gingeras T.R. Landscape of transcription in human cells . Nature . 2012 ; 489 : 101 – 108 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Boon K.-L. , Grainger R.J. , Ehsani P. , Barrass J.D. , Auchynnikava T. , Inglehearn C.F. , Beggs J.D. prp8 mutations that cause human retinitis pigmentosa lead to a U5 snRNP maturation defect in yeast . Nat. Struct. Mol. Biol. 2007 ; 14 : 1077 – 1083 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Tanackovic G. , Ransijn A. , Thibault P. , Elela S.A. , Klinck R. , Berson E.L. , Chabot B. , Rivolta C. PRPF mutations are associated with generalized defects in spliceosome formation and pre-mRNA splicing in patients with retinitis pigmentosa . Hum. Mol. Genet. 2011 ; 20 : 2116 – 2130 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Yoshida K. , Sanada M. , Shiraishi Y. , Nowak D. , Nagata Y. , Yamamoto R. , Sato Y. , Sato-Otsubo A. , Kon A. , Nagasaki M. et al. Frequent pathway mutations of splicing machinery in myelodysplasia . Nature . 2011 ; 478 : 64 – 69 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Scotti M.M. , Swanson M.S. RNA mis-splicing in disease . Nat. Rev. Genet. 2015 ; 17 : 19 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Yan K. , Liu P. , Wu C.-A. , Yang G.-D. , Xu R. , Guo Q.-H. , Huang J.-G. , Zheng C.-C. Stress-induced alternative splicing provides a mechanism for the regulation of microRNA processing in Arabidopsis thaliana . Mol. Cell . 2012 ; 48 : 521 – 531 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Reddy A.S.N. , Marquez Y. , Kalyna M. , Barta A. Complexity of the alternative splicing landscape in plants . Plant Cell . 2013 ; 25 : 3657 – 3683 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Ding F. , Cui P. , Wang Z. , Zhang S. , Ali S. , Xiong L. Genome-wide analysis of alternative splicing of pre-mRNA under salt stress in Arabidopsis . BMC Genomics . 2014 ; 15 : 1 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Zhan X. , Qian B. , Cao F. , Wu W. , Yang L. , Guan Q. , Gu X. , Wang P. , Okusolubo T.A. , Dunn S.L. et al. An Arabidopsis PWI and RRM motif-containing protein is critical for pre-mRNA splicing and ABA responses . Nat. Commun. 2015 ; 6 : 8139 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Palusa S.G. , Reddy A.S.N. Differential recruitment of splice variants from SR Pre-mRNAs to polysomes during development and in response to stresses . Plant Cell Physiol. 2015 ; 56 : 421 – 427 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Rauch H.B. , Patrick T.L. , Klusman K.M. , Battistuzzi F.U. , Mei W. , Brendel V.P. , Lal S.K. Discovery and Expression Analysis of Alternative Splicing Events Conserved among Plant SR Proteins . Mol. Biol. Evol. 2013 ; 31 : 605 – 613 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Palusa S.G. , Ali G.S. , Reddy A.S.N. Alternative splicing of pre-mRNAs of Arabidopsis serine / arginine-rich proteins: regulation by hormones and stresses . Plant J. 2007 ; 49 : 1091 – 1107 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Seo P.J. , Park M.-J. , Park C.-M. Alternative splicing of transcription factors in plant responses to low temperature stress: mechanisms and functions . Planta . 2013 ; 237 : 1415 – 1424 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Severing E.I. , Van Dijk A.D.J. , Morabito G. , Busscher-lange J. Predicting the Impact of Alternative Splicing on Plant MADS Domain Protein Function . PLoS One . 2012 ; 7 : e30524 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Lareau L.F. , Brenner S.E. Regulation of splicing factors by alternative splicing and NMD is conserved between kingdoms yet evolutionarily flexible . Mol. Biol. Evol. 2015 ; 32 : 1072 – 1079 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Reddy A.S.N. , Rogers M.F. , Richardson D.N. , Hamilton M. , Ben-Hur A. Deciphering the Plant Splicing Code: Experimental and Computational Approaches for Predicting Alternative Splicing and Splicing Regulatory Elements . Front. Plant Sci. 2012 ; 3 : 18 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Tran V.D.T. , Souiai O. , Romero-Barrios N. , Crespi M. , Gautheret D. Detection of generic differential RNA processing events from RNA-seq data . RNA Biol. 2016 ; 13 : 59 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Matlin A.J. , Clark F. , Smith C.W.J. Understanding alternative splicing: towards a cellular code . Nat. Rev. Mol. Cell Biol. 2005 ; 6 : 386 – 398 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Blencowe B.J. Alternative splicing: new insights from global analyses . Cell . 2006 ; 126 : 37 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Wang G.-S. , Cooper T.A. Splicing in disease: disruption of the splicing code and the decoding machinery . Nat. Rev. Genet. 2007 ; 8 : 749 – 761 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Faial T. RNA splicing in common disease . Nat Genet . 2015 ; 47 : 105 . Google Scholar Crossref Search ADS WorldCat 27. Filichkin S.A. , Priest H.D. , Givan S.A. , Shen R. , Bryant D.W. , Fox S.E. , Wong W. , Mockler T.C. Genome-wide mapping of alternative splicing in Arabidopsis thaliana . Genome Res. 2010 ; 20 : 45 – 58 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Tanabe N. , Yoshimura K. , Kimura A. , Yabuta Y. Differential Expression of Alternatively Spliced mRNAs of Arabidopsis SR Protein Homologs, atSR30 and atSR45a, in Response to Environmental Stress . Plant Cell Physiol. 2007 ; 48 : 1036 – 1049 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Wang K.C. , Chang H.Y. Molecular mechanisms of long noncoding RNAs . Mol. Cell . 2016 ; 43 : 904 – 914 . Google Scholar Crossref Search ADS WorldCat 30. Guttman M. , Rinn J.L. Modular regulatory principles of large non-coding RNAs . Nature . 2012 ; 482 : 339 – 346 . Google Scholar Crossref Search ADS PubMed WorldCat 31. Moran V.A. , Perera R.J. , Khalil A.M. Emerging functional and mechanistic paradigms of mammalian long non-coding RNAs . 2012 ; 40 : 6391 – 6400 . 32. Matera A.G. , Wang Z. A day in the life of the spliceosome . Nat. Rev. Mol. Cell Biol. 2014 ; 15 : 108 – 121 . Google Scholar Crossref Search ADS PubMed WorldCat 33. Rappsilber J. , Ryder U. , Lamond A.I. , Mann M. Large-scale proteomic analysis of the human spliceosome . Genome Res. 2002 ; 12 : 1231 – 1245 . Google Scholar Crossref Search ADS PubMed WorldCat 34. Herold N. , Will C.L. , Wolf E. , Kastner B. , Urlaub H. , Lu R. Conservation of the protein composition and electron microscopy structure of Drosophila melanogaster and human spliceosomal complexes . Mol. Cell Biol. 2009 ; 29 : 281 – 301 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Wang B. , Brendel V. The ASRG database: identification and survey of Arabidopsis thaliana genes involved in pre-mRNA splicing . Genome Biol. 2004 ; 5 : R102 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Zhang Y. , Zhang X.-O. , Chen T. , Xiang J.-F. , Yin Q.-F. , Xing Y.-H. , Zhu S. , Yang L. , Chen L.-L. Circular intronic long noncoding RNAs . Mol. Cell . 2013 ; 51 : 792 – 806 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Ariel F. , Jegu T. , Latrasse D. , Romero-Barrios N. , Christ A. , Benhamed M. , Crespi M. Noncoding transcription by alternative rna polymerases dynamically regulates an auxin-driven chromatin loop . Mol. Cell . 2014 ; 55 : 383 – 396 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Ariel F. , Romero-Barrios N. , Jegu T. , Benhamed M. , Crespi M. Battles and hijacks: noncoding transcription in plants . Trends Plant Sci. 2015 ; 20 : 362 – 371 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Bardou F. , Ariel F. , Simpson C.G. , Romero-Barrios N. , Laporte P. , Balzergue S. , Brown J.W.S. , Crespi M. Long noncoding RNA modulates alternative splicing regulators in Arabidopsis . Dev. Cell . 2014 ; 30 : 166 – 176 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Song P. , Ye L.-F. , Zhang C. , Peng T. , Zhou X.-H. Long non-coding RNA XIST exerts oncogenic functions in human nasopharyngeal carcinoma by targeting miR-34a-5p . Gene . 2016 ; 592 : 8 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 41. Hutchinson J.N. , Ensminger A.W. , Clemson C.M. , Lynch C.R. , Lawrence J.B. , Chess A. A screen for nuclear transcripts identifies two linked noncoding RNAs associated with SC35 splicing domains . BMC Genomics . 2007 ; 8 : 39 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Long J.C. , Caceres J.F. The SR protein family of splicing factors: master regulators of gene expression . Biochem. J. 2009 ; 417 : 15 – 27 . Google Scholar Crossref Search ADS PubMed WorldCat 43. Cao W. , Jamison S.F. , Garcia-Blanco M.A. Both phosphorylation and dephosphorylation of ASF/SF2 are required for pre-mRNA splicing in vitro . RNA . 1997 ; 3 : 1456 – 1467 . Google Scholar PubMed WorldCat 44. Xiao S.H. , Manley J.L. Phosphorylation of the ASF/SF2 RS domain affects both protein-protein and protein-RNA interactions and is necessary for splicing . Genes Dev. 1997 ; 11 : 334 – 344 . Google Scholar Crossref Search ADS PubMed WorldCat 45. Xiao S.H. , Manley J.L. Phosphorylation-dephosphorylation differentially affects activities of splicing factor ASF/SF2 . EMBO J. 1998 ; 17 : 6359 – 6367 . Google Scholar Crossref Search ADS PubMed WorldCat 46. Cáceres J.F. , Misteli T. , Screaton G.R. , Spector D.L. , Krainer A.R. Role of the modular domains of SR proteins in subnuclear localization and alternative splicing specificity . J. Cell Biol. 1997 ; 138 : 225 – 238 . Google Scholar Crossref Search ADS PubMed WorldCat 47. Misteli T. , Cáceres J.F. , Clement J.Q. , Krainer A.R. , Wilkinson M.F. , Spector D.L. Serine phosphorylation of SR proteins is required for their recruitment to sites of transcription in vivo . J. Cell Biol. 1998 ; 143 : 297 – 307 . Google Scholar Crossref Search ADS PubMed WorldCat 48. Jiang K. , Patel N.A. , Watson J.E. , Apostolatos H. , Kleiman E. , Hanson O. , Hagiwara M. , Cooper D.R. Akt2 regulation of Cdc2-like kinases (Clk/Sty), serine/arginine-rich (SR) protein phosphorylation, and insulin-induced alternative splicing of PKCβJII messenger ribonucleic acid . Endocrinology . 2009 ; 150 : 2087 – 2097 . Google Scholar Crossref Search ADS PubMed WorldCat 49. Cooper D.R. , Carter G. , Li P. , Patel R. , Watson J.E. , Patel N.A. Long non-coding RNA NEAT1 associates with SRp40 to temporally regulate PPARγ2 splicing during Adipogenesis in 3T3-L1 cells . Genes (Basel). 2014 ; 5 : 1050 – 1063 . Google Scholar Crossref Search ADS PubMed WorldCat 50. Wang X. , Sehgal L. , Jain N. , Khashab T. , Mathur R. , Samaniego F. LncRNA MALAT1 promotes development of mantle cell lymphoma by associating with EZH2 . J. Transl. Med. 2016 ; 14 : 346 . Google Scholar Crossref Search ADS PubMed WorldCat 51. Malakar P. , Shilo A. , Mogilavsky A. , Stein I. , Pikarsky E. , Nevo Y. , Benyamini H. , Elgavish S. , Zong X. , Prasanth K. V et al. Long noncoding RNA MALAT1 promotes hepatocellular carcinoma development by SRSF1 up-regulation and mTOR activation . Cancer Res. 2016 ; 77 : 1155 – 1167 . Google Scholar Crossref Search ADS PubMed WorldCat 52. Li R.-Q. , Ren Y. , Liu W. , Pan W. , Xu F.-J. , Yang M. MicroRNA-mediated silence of onco-lncRNA MALAT1 in different ESCC cells via ligand-functionalized hydroxyl-rich nanovectors . Nanoscale . 2017 ; 9 : 2521 – 2530 . Google Scholar Crossref Search ADS PubMed WorldCat 53. Chen W. , Xu X. , Li J. , Kong K. , Li H. , Chen C. MALAT1 is a prognostic factor in glioblastoma multiforme and induces chemoresistance to temozolomide through suppressing miR-203 and promoting thymidylate synthase expression . Oncotarget . 2017 ; 8 : 22783 – 22799 . Google Scholar PubMed WorldCat 54. Li Z. , Zhou Y. , Tu B. , Bu Y. , Liu A. , Xie C. Long noncoding RNA MALAT1 affects the efficacy of radiotherapy for esophageal squamous cell carcinoma by regulating Cks1 expression . J. Oral Pathol. Med. 2016 ; 46 : 583 – 590 . Google Scholar Crossref Search ADS WorldCat 55. Tripathi V. , Ellis J.D. , Shen Z. , Song D.Y. , Pan Q. , Watt A.T. , Freier S.M. , Bennett C.F. , Sharma A. , Bubulya P.A. et al. The nuclear-retained noncoding RNA MALAT1 regulates alternative splicing by modulating SR splicing factor phosphorylation . Mol. Cell . 2010 ; 39 : 925 – 938 . Google Scholar Crossref Search ADS PubMed WorldCat 56. Stamm S. Regulation of alternative splicing by reversible protein phosphorylation . J. Biol. Chem. 2008 ; 283 : 1223 – 1227 . Google Scholar Crossref Search ADS PubMed WorldCat 57. Huang Y. , Yario T.A. , Steitz J.A. A molecular link between SR protein dephosphorylation and mRNA export . Proc. Natl. Acad. Sci. U.S.A. 2004 ; 101 : 9666 – 9670 . Google Scholar Crossref Search ADS PubMed WorldCat 58. Sanford J.R. , Ellis J.D. , Cazalla D. , Cáceres J.F. Reversible phosphorylation differentially affects nuclear and cytoplasmic functions of splicing factor 2/alternative splicing factor . Proc. Natl. Acad. Sci. U.S.A. 2005 ; 102 : 15042 – 15047 . Google Scholar Crossref Search ADS PubMed WorldCat 59. Shi Y. , Manley J.L. A complex signaling pathway regulates SRp38 phosphorylation and pre-mRNA splicing in response to heat shock . Mol. Cell . 2007 ; 28 : 79 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 60. Zhong X.Y. , Ding J.H. , Adams J.A. , Ghosh G. , Fu X.D. Regulation of SR protein phosphorylation and alternative splicing by modulating kinetic interactions of SRPK1 with molecular chaperones . Genes Dev. 2009 ; 23 : 482 – 495 . Google Scholar Crossref Search ADS PubMed WorldCat 61. Patton J.G. , Mayer S.A. , Tempst P. , Nadal-Ginard B. Characterization and molecular cloning of polypymiridine tract-binding protein: a component of a complex necessary for pre-mRNA splicing . Genes Dev. 1991 ; 5 : 1237 – 1251 . Google Scholar Crossref Search ADS PubMed WorldCat 62. He X. , Pool M. , Darcy K.M. , Lim S.B. , Auersperg N. , Coon J.S. , Beck W.T. Knockdown of polypyrimidine tract-binding protein suppresses ovarian tumor cell growth and invasiveness in vitro . Oncogene . 2007 ; 26 : 4961 – 4968 . Google Scholar Crossref Search ADS PubMed WorldCat 63. Patton J.G. , Porto E.B. , Galceran J. , Paul T. , Nadal-ginard B. Cloning and characterization of PSF, a novel pre-mRNA splicing factor . Genes Dev. 1993 ; 7 : 393 – 406 . Google Scholar Crossref Search ADS PubMed WorldCat 64. Gozani O. , Patton J.G. , Reed R. A novel set of spliceosome-associated proteins and the essential splicing factor PSF bind stably to pre-mRNA prior to catalytic step II of the splicing reaction . EMBO J. 1994 ; 13 : 3356 – 3367 . Google Scholar PubMed WorldCat 65. Meissner M. , Dechat T. , Gerner C. , Grimm R. , Foisner R. , Sauermann G. Differential nuclear localization and nuclear matrix association of the splicing factors PSF and PTB . J. Cell. Biochem. 2000 ; 76 : 559 – 566 . Google Scholar Crossref Search ADS PubMed WorldCat 66. Ji Q. , Zhang L. , Liu X. , Zhou L. , Wang W. , Han Z. , Sui H. , Tang Y. , Wang Y. , Liu N. et al. Long non-coding RNA MALAT1 promotes tumour growth and metastasis in colorectal cancer through binding to SFPQ and releasing oncogene PTBP2 from SFPQ / PTBP2 complex . Br. J. Cancer . 2014 ; 111 : 736 – 748 . Google Scholar Crossref Search ADS PubMed WorldCat 67. Wang G. , Cui Y. , Zhang G. , Garen A. , Song X. Regulation of proto-oncogene transcription, cell proliferation, and tumorigenesis in mice by PSF protein and a VL30 noncoding RNA . Proc. Natl. Acad. Sci. U.S.A. 2009 ; 106 : 16794 – 16798 . Google Scholar Crossref Search ADS PubMed WorldCat 68. Li L. , Feng T. , Lian Y. , Zhang G. , Garen A. , Song X. Role of human noncoding RNAs in the control of tumorigenesis . Proc. Natl. Acad. Sci. U.S.A. 2009 ; 106 : 12956 – 12961 . Google Scholar Crossref Search ADS PubMed WorldCat 69. Kong J. , Sun W. , Li C. , Wan L. , Wang S. , Wu Y. , Xu E. , Zhang H. , Lai M. Long non-coding RNA LINC01133 inhibits epithelial-mesenchymal transition and metastasis in colorectal cancer by interacting with SRSF6 . Cancer Lett. 2016 ; 380 : 476 – 484 . Google Scholar Crossref Search ADS PubMed WorldCat 70. Zang C. , Nie F. , Wang Q. , Sun M. , Li W. , He J. , Zhang M. , Lu K. Long non-coding RNA LINC01133 represses KLF2, P21 and E-cadherin transcription through binding with EZH2, LSD1 in non small cell lung cancer . Oncotarget . 2016 ; 7 : 11696 – 11707 . Google Scholar PubMed WorldCat 71. Barry G. , Briggs J. , Vanichkina D. , Poth E. , Beveridge N. , Ratnu V. , Nayler S. , Nones K. , Hu J. , Bredy T. et al. The long non-coding RNA Gomafu is acutely regulated in response to neuronal activation and involved in schizophrenia-associated alternative splicing . Mol. Psychiatry . 2013 ; 19 : 486 – 494 . Google Scholar Crossref Search ADS PubMed WorldCat 72. Rapicavoli N. a , Poth E.M. , Blackshaw S. The long noncoding RNA RNCR2 directs mouse retinal cell specification . BMC Dev. Biol. 2010 ; 10 : 49 . Google Scholar Crossref Search ADS PubMed WorldCat 73. Rapicavoli N.A. , Blackshaw S. New meaning in the message: Noncoding RNAs and their role in retinal development . Dev. Dyn. 2009 ; 238 : 2103 – 2114 . Google Scholar Crossref Search ADS PubMed WorldCat 74. Mercer T.R. , Qureshi I.A. , Gokhan S. , Dinger M.E. , Li G. , Mattick J.S. , Mehler M.F. Long noncoding RNAs in neuronal-glial fate specification and oligodendrocyte lineage maturation . BMC Neurosci. 2010 ; 11 : 14 . Google Scholar Crossref Search ADS PubMed WorldCat 75. Mercer T.R. , Dinger M.E. , Sunkin S.M. , Mehler M.F. , Mattick J.S. Specific expression of long noncoding RNAs in the mouse brain . Proc. Natl. Acad. Sci. U.S.A. 2007 ; 105 : 716 – 721 . Google Scholar Crossref Search ADS WorldCat 76. Sone M. , Hayashi T. , Tarui H. , Agata K. , Takeichi M. , Nakagawa S. The mRNA-like noncoding RNA Gomafu constitutes a novel nuclear domain in a subset of neurons . J. Cell Sci. 2007 ; 120 : 2498 – 2506 . Google Scholar Crossref Search ADS PubMed WorldCat 77. Tsuiji H. , Yoshimoto R. , Hasegawa Y. , Furuno M. , Yoshida M. , Nakagawa S. Competition between a noncoding exon and introns: Gomafu contains tandem UACUAAC repeats and associates with splicing factor-1 . Genes Cells . 2011 ; 16 : 479 – 490 . Google Scholar Crossref Search ADS PubMed WorldCat 78. Ladd A.N. CUG-BP, Elav-like family (CELF)-mediated alternative splicing regulation in the brain during health and disease . Mol. Cell Neurosci. 2013 ; 29 : 456 – 464 . Google Scholar Crossref Search ADS WorldCat 79. Ishizuka A. , Hasegawa Y. , Ishida K. , Yanaka K. , Nakagawa S. Formation of nuclear bodies by the lncRNA Gomafu-associating proteins Celf3 and SF1 . Genes Cells . 2014 ; 19 : 704 – 721 . Google Scholar Crossref Search ADS PubMed WorldCat 80. van Dijk M. , Visser A. , Buabeng K.M.L. , Poutsma A. , van der Schors R.C. , Oudejans C.B.M. Mutations within the LINC-HELLP non-coding RNA differentially bind ribosomal and RNA splicing complexes and negatively affect trophoblast differentiation . Hum. Mol. Genet. 2015 ; 24 : 5475 – 5485 . Google Scholar Crossref Search ADS PubMed WorldCat 81. Boisvert F.M. , Van Koningsbruggen S. , Navascués J. , Lamond A.I. The multifunctional nucleolus . Nat. Rev. Mol. Cell Biol. 2007 ; 8 : 574 – 585 . Google Scholar Crossref Search ADS PubMed WorldCat 82. Kiss T. , Agris P. , Bachellerie J. , Michot B. , Nicoloso M. , Balakin A. , Ni J. , Fournier M. , Balakin A. , Smith L. et al. Small nucleolar RNA-guided post-transcriptional modification of cellular RNAs . EMBO J. 2001 ; 20 : 3617 – 3622 . Google Scholar Crossref Search ADS PubMed WorldCat 83. Matera A.G. , Terns R.M. , Terns M.P. Non-coding RNAs: Lessons from the small nuclear and small nucleolar RNAs . Nat. Rev. Mol. Cell Biol. 2007 ; 8 : 209 – 220 . Google Scholar Crossref Search ADS PubMed WorldCat 84. Yin Q.F. , Yang L. , Zhang Y. , Xiang J.F. , Wu Y.W. , Carmichael G.G. , Chen L.L. Long Noncoding RNAs with snoRNA Ends . Mol. Cell . 2012 ; 48 : 219 – 230 . Google Scholar Crossref Search ADS PubMed WorldCat 85. Kishore S. , Stamm S. The snoRNA HBII-52 regulates alternative splicing of the serotonin receptor 2C . Science . 2006 ; 311 : 230 – 232 . Google Scholar Crossref Search ADS PubMed WorldCat 86. Wu H. , Yin Q.F. , Luo Z. , Yao R.W. , Zheng C.C. , Zhang J. , Xiang J.F. , Yang L. , Chen L.L. Unusual processing generates SPA LncRNAs that sequester multiple RNA binding proteins . Mol. Cell . 2016 ; 64 : 534 – 548 . Google Scholar Crossref Search ADS PubMed WorldCat 87. Crespi M.D. , Jurkevitch E. , Poiret M. , d’Aubenton-Carafa Y. , Petrovics G. , Kondorosi E. , Kondorosi A. enod40, a gene expressed during nodule organogenesis, codes for a non-translatable RNA involved in plant growth . EMBO J. 1994 ; 13 : 5099 – 5112 . Google Scholar PubMed WorldCat 88. Gultyaev A.P. , Roussis A. Identification of conserved secondary structures and expansion segments in enod40 RNAs reveals new enod40 homologues in plants . Nucleic Acids Res. 2007 ; 35 : 3144 – 3152 . Google Scholar Crossref Search ADS PubMed WorldCat 89. Campalans A. , Kondorosi A. , Crespi M. Enod40, a short open reading frame – containing mRNA, induces cytoplasmic localization of a nuclear RNA binding protein in Medicago truncatula . Plant Cell . 2004 ; 16 : 1047 – 1059 . Google Scholar Crossref Search ADS PubMed WorldCat 90. Khorkova O. , Myers A.J. , Hsiao J. , Wahlestedt C. Natural antisense transcripts . Hum. Mol. Genet. 2014 ; 23 : R54 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 91. Krystal G.W. , Armstrong B.C. , Battey J.F. N-myc mRNA forms an RNA-RNA duplex with endogenous antisense transcripts . Mol. Cell. Biol. 1990 ; 10 : 4180 – 4191 . Google Scholar Crossref Search ADS PubMed WorldCat 92. Suenaga Y. , Islam S.M.R. , Alagu J. , Kaneko Y. , Kato M. , Tanaka Y. , Kawana H. , Hossain S. , Matsumoto D. , Yamamoto M. et al. NCYM, a cis-antisense gene of MYCN, encodes a de novo evolved protein that inhibits GSK3β resulting in the stabilization of MYCN in human neuroblastomas . PLoS Genet. 2014 ; 10 : e1003996 . Google Scholar Crossref Search ADS PubMed WorldCat 93. Munroe S.H. , Lazar M.A. Inhibition of c-erbA mRNA splicing by a naturally occurring antisense RNA . J. Biol. Chem. 1991 ; 266 : 22083 – 22086 . Google Scholar PubMed WorldCat 94. Hastings M.L. , Milcarek C. , Martincic K. , Peterson M.L. , Munroe S.H. Expression of the thyroid hormone receptor gene, erbAalpha, in B lymphocytes: alternative mRNA processing is independent of differentiation but correlates with antisense RNA levels . Nucleic Acids Res. 1997 ; 25 : 4296 – 4300 . Google Scholar Crossref Search ADS PubMed WorldCat 95. Rindfleisch B.C. , Brown M.S. , VandeBerg J.L. , Munroe S.H. Structure and expression of two nuclear receptor genes in marsupials: insights into the evolution of the antisense overlap between the α-thyroid hormone receptor and Rev-erbα . BMC Mol. Biol. 2010 ; 11 : 97 . Google Scholar Crossref Search ADS PubMed WorldCat 96. Bardou F. , Merchan F. , Ariel F. , Crespi M. Dual RNAs in plants . Biochimie . 2011 ; 93 : 1950 – 1954 . Google Scholar Crossref Search ADS PubMed WorldCat 97. Villamizar O. , Chambers C.B. , Riberdy J.M. , Persons D.A. , Wilber A. Long noncoding RNA Saf and splicing factor 45 increase soluble Fas and resistance to apoptosis . Oncotarget . 2015 ; 7 : 1 – 17 . WorldCat 98. Beltran M. , Puig I. , Peña C. , García J.M. , Álvarez A.B. , Peña R. , Bonilla F. , De Herreros A.G. A natural antisense transcript regulates Zeb2 / Sip1 gene expression during Snail1-induced epithelial – mesenchymal transition . Genes Dev. 2008 ; 22 : 756 – 769 . Google Scholar Crossref Search ADS PubMed WorldCat 99. Massone S. , Vassallo I. , Fiorino G. , Castelnuovo M. , Barbieri F. , Borghi R. , Tabaton M. , Robello M. , Gatta E. , Russo C. et al. 17A, a novel non-coding RNA, regulates GABA B alternative splicing and signaling in response to inflammatory stimuli and in Alzheimer disease . Neurobiol. Dis. 2011 ; 41 : 308 – 317 . Google Scholar Crossref Search ADS PubMed WorldCat 100. Luco R.F. , Allo M. , Schor I.E. , Kornblihtt A.R. , Misteli T. Epigenetics in alternative pre-mRNA splicing . Cell . 2011 ; 144 : 16 – 26 . Google Scholar Crossref Search ADS PubMed WorldCat 101. Batsché E. , Yaniv M. , Muchardt C. The human SWI/SNF subunit Brm is a regulator of alternative splicing . Nat. Struct. Mol. Biol. 2006 ; 13 : 22 – 29 . Google Scholar Crossref Search ADS PubMed WorldCat 102. Schor I.E. , Rascovan N. , Pelisch F. , Alló M. , Kornblihtt A.R. Neuronal cell depolarization induces intragenic chromatin modifications affecting NCAM alternative splicing . Proc. Natl. Acad. Sci. U.S.A. 2009 ; 106 : 4325 – 4330 . Google Scholar Crossref Search ADS PubMed WorldCat 103. Chen L. The biogenesis and emerging roles of circular RNAs . Nat. Rev. Mol. Cell Biol. 2016 ; 17 : 205 – 211 . Google Scholar Crossref Search ADS PubMed WorldCat 104. Ashwal-Fluss R. , Meyer M. , Pamudurti N.R. , Ivanov A. , Bartok O. , Hanan M. , Evantal N. , Memczak S. , Rajewsky N. , Kadener S. circRNA biogenesis competes with pre-mRNA splicing . Mol. Cell . 2014 ; 56 : 55 – 66 . Google Scholar Crossref Search ADS PubMed WorldCat 105. Conn S.J. , Pillman K.A. , Toubia J. , Conn V.M. , Salmanidis M. , Phillips C.A. , Roslan S. , Schreiber A.W. , Gregory P.A. , Goodall G.J. The RNA binding protein quaking regulates formation of circRNAs . Cell . 2015 ; 160 : 1125 – 1134 . Google Scholar Crossref Search ADS PubMed WorldCat 106. Kelly S. , Greenman C. , Cook P.R. , Papantonis A. Exon skipping is correlated with exon circularization . J. Mol. Biol. 2015 ; 427 : 2414 – 2417 . Google Scholar Crossref Search ADS PubMed WorldCat 107. Conn V.M. , Hugouvieux V. , Nayak A. , Conos S.A. , Capovilla G. , Cildir G. , Jourdain A. , Tergaonkar V. , Schmid M. , Zubieta C. et al. A circRNA from SEPALLATA3 regulates splicing of its cognate mRNA through R-loop formation . Nat. Plants . 2017 ; 3 : 17053 . Google Scholar Crossref Search ADS PubMed WorldCat 108. Ariel F. , Crespi M. Alternative splicing: the lord of the rings . Nat. Plants . 2017 ; 3 : 17065 . Google Scholar Crossref Search ADS PubMed WorldCat 109. Gonzalez I. , Munita R. , Agirre E. , Dittmer T. a , Gysling K. , Misteli T. , Luco R.F. A lncRNA regulates alternative splicing via establishment of a splicing-specific chromatin signature . Nat. Struct. Mol. Biol. 2015 ; 22 : 370 – 376 . Google Scholar Crossref Search ADS PubMed WorldCat 110. Yang L. , Lin C. , Liu W. , Zhang J. , Ohgi K.A. , Grinstein J.D. , Dorrestein P.C. , Rosenfeld M.G. NcRNA- and Pc2 methylation-dependent gene relocation between nuclear structures mediates gene activation programs . Cell . 2011 ; 147 : 773 – 788 . Google Scholar Crossref Search ADS PubMed WorldCat 111. Clemson C.M. , Hutchinson J.N. , Sara S.A. , Ensminger A.W. , Fox A.H. , Chess A. , Lawrence J.B. An architectural role for a nuclear noncoding RNA: NEAT1 RNA is essential for the structure of Paraspeckles . Mol. Cell . 2009 ; 33 : 717 – 726 . Google Scholar Crossref Search ADS PubMed WorldCat 112. Mao Y.S. , Sunwoo H. , Zhang B. , Spector D.L. Direct visualization of the co-transcriptional assembly of a nuclear body by noncoding RNAs . Nat. Cell Biol. 2011 ; 13 : 95 – 101 . Google Scholar Crossref Search ADS PubMed WorldCat 113. Engreitz J.M. , Sirokman K. , McDonel P. , Shishkin A.A. , Surka C. , Russell P. , Grossman S.R. , Chow A.Y. , Guttman M. , Lander E.S. RNA-RNA interactions enable specific targeting of noncoding RNAs to nascent pre-mRNAs and chromatin sites . Cell . 2014 ; 159 : 188 – 199 . Google Scholar Crossref Search ADS PubMed WorldCat 114. Ietswaart R. , Wu Z. , Dean C. Flowering time control: another window to the connection between antisense RNA and chromatin . Trends Genet. 2012 ; 28 : 445 – 453 . Google Scholar Crossref Search ADS PubMed WorldCat 115. Rodriguez-Granados N.Y. , Ramirez-Prado J.S. , Veluchamy A. , Latrasse D. , Raynaud C. , Crespi M. , Ariel F. , Benhamed M. Put your 3D glasses on: plant chromatin is on show . J. Exp. Bot. 2016 ; 67 : 3205 – 3221 . Google Scholar Crossref Search ADS PubMed WorldCat 116. Kim D.H. , Sung S. Vernalization-triggered intragenic chromatin loop formation by long noncoding RNAs . Dev. Cell . 2017 ; 40 : 302 – 312 . Google Scholar Crossref Search ADS PubMed WorldCat 117. Zhou H.-L. , Luo G. , Wise J.A. , Lou H. Regulation of alternative splicing by local histone modifications: potential roles for RNA-guided mechanisms . Nucleic Acids Res. 2014 ; 42 : 701 – 713 . Google Scholar Crossref Search ADS PubMed WorldCat 118. West J.A. , Davis C.P. , Sunwoo H. , Simon M.D. , Sadreyev R.I. , Wang P.I. , Tolstorukov M.Y. , Kingston R.E. The long noncoding RNAs NEAT1 and MALAT1 bind active chromatin sites . Mol. Cell . 2014 ; 55 : 791 – 802 . Google Scholar Crossref Search ADS PubMed WorldCat 119. Simon M.D. , Pinter S.F. , Fang R. , Sarma K. , Rutenberg-Schoenberg M. , Bowman S.K. , Kesner B.A. , Maier V.K. , Kingston R.E. , Lee J.T. High-resolution Xist binding maps reveal two-step spreading during X-chromosome inactivation . Nature . 2013 ; 504 : 465 – 469 . Google Scholar Crossref Search ADS PubMed WorldCat 120. Misteli T. , Cáceres J. , Spector D. The dynamics of a pre-mRNA splicing factor in living cells . Nature . 1997 ; 387 : 523 – 527 . Google Scholar Crossref Search ADS PubMed WorldCat 121. Zeng C. , Kim E. , Warren S.L. , Berget S.M. Dynamic relocation of transcription and splicing factors dependent upon transcriptional activity . EMBO J. 1997 ; 16 : 1401 – 1412 . Google Scholar Crossref Search ADS PubMed WorldCat 122. Huang S. , Spector D.L. Intron-dependent recruitment of pre-mRNA splicing factors to sites of transcription . J. Cell Biol. 1996 ; 133 : 719 – 732 . Google Scholar Crossref Search ADS PubMed WorldCat 123. Smith K.P. , Moen P.T. , Wydner K.L. , Coleman J.R. , Lawrence J.B. Processing of endogenous pre-mRNAs in association with SC-35 domains is gene specific . J. Cell Biol. 1999 ; 144 : 617 – 629 . Google Scholar Crossref Search ADS PubMed WorldCat 124. Jin F. , Li Y. , Dixon J.R. , Selvaraj S. , Ye Z. , Lee A.Y. , Yen C.-A. , Schmitt A.D. , Espinoza C.A. , Ren B. A high-resolution map of the three-dimensional chromatin interactome in human cells . Nature . 2013 ; 503 : 290 – 294 . Google Scholar Crossref Search ADS PubMed WorldCat 125. Engreitz J.M. , Pandya-Jones A. , McDonel P. , Shishkin A. , Sirokman K. , Surka C. , Kadri S. , Xing J. , Goren A. , Lander E.S. et al. The Xist lncRNA exploits three-dimensional genome architecture to spread across the X chromosome . Science . 2013 ; 341 : 1237973 . Google Scholar Crossref Search ADS PubMed WorldCat 126. Tsai M.-C. , Manor O. , Wan Y. , Mosammaparast N. , Wang J.K. , Lan F. , Shi Y. , Segal E. , Chang H.Y. Long noncoding RNA as modular scaffold of histone modification complexes . Science . 2010 ; 329 : 689 – 693 . Google Scholar Crossref Search ADS PubMed WorldCat 127. van der Krol A.R. , Mur L.A. , Beld M. , Mol J.N. , Stuitje A.R. Flavonoid genes in petunia: addition of a limited number of gene copies may lead to a suppression of gene expression . Plant Cell . 1990 ; 2 : 291 – 299 . Google Scholar Crossref Search ADS PubMed WorldCat 128. Napoli C. , Lemieux C. , Jorgensen R. Introduction of a chimeric chalcone synthase gene into petunia results in reversible co-suppression of homologous genes in trans . Plant Cell . 1990 ; 2 : 279 – 289 . Google Scholar Crossref Search ADS PubMed WorldCat 129. Ha M. , Kim V.N. Regulation of microRNA biogenesis . Nat. Rev. Mol. Cell Biol. 2014 ; 15 : 509 – 524 . Google Scholar Crossref Search ADS PubMed WorldCat 130. Taft R.J. , Glazov E.A. , Lassmann T. , Hayashizaki Y. , Carninci P. , Mattick J.S. Small RNAs derived from snoRNAs . RNA . 2009 ; 15 : 1233 – 1240 . Google Scholar Crossref Search ADS PubMed WorldCat 131. Thompson D.M. , Lu C. , Green P.J. , Parker R. tRNA cleavage is a conserved response to oxidative stress in eukaryotes . RNA . 2008 ; 14 : 2095 – 2103 . Google Scholar Crossref Search ADS PubMed WorldCat 132. Haussecker D. , Huang Y. , Lau A. , Parameswaran P. , Fire A.Z. , Kay M.A. Human tRNA-derived small RNAs in the global regulation of RNA silencing . RNA . 2010 ; 16 : 673 – 695 . Google Scholar Crossref Search ADS PubMed WorldCat 133. Catalanotto C. , Cogoni C. , Zardo G. MicroRNA in control of gene expression: an overview of nuclear functions . Int. J. Mol. Sci. 2016 ; 17 : e1712 . Google Scholar Crossref Search ADS PubMed WorldCat 134. Zhai J. , Bischof S. , Wang H. , Feng S. , Lee T.F. , Teng C. , Chen X. , Park S.Y. , Liu L. , Gallego-Bartolome J. et al. A one precursor one siRNA model for pol IV-dependent siRNA biogenesis . Cell . 2015 ; 163 : 445 – 455 . Google Scholar Crossref Search ADS PubMed WorldCat 135. Duss O. , Michel E. , Yulikov M. , Schubert M. , Jeschke G. , Allain F.H.-T. Structural basis of the non-coding RNA RsmZ acting as a protein sponge . Nature . 2014 ; 509 : 588 – 592 . Google Scholar Crossref Search ADS PubMed WorldCat 136. Bayne E.H. , Portoso M. , Kagansky A. , Kos-Braun I.C. , Urano T. , Ekwall K. , Alves F. , Rappsilber J. , Allshire R.C. Splicing factors facilitate RNAi-directed silencing in fission yeast . Science . 2008 ; 322 : 602 – 606 . Google Scholar Crossref Search ADS PubMed WorldCat 137. Chinen M. , Morita M. , Fukumura K. , Tani T. Involvement of the spliceosomal U4 small nuclear RNA in heterochromatic gene silencing at fission yeast centromeres . J. Biol. Chem. 2010 ; 285 : 5630 – 5638 . Google Scholar Crossref Search ADS PubMed WorldCat 138. Ausin I. , Greenberg M.V.C. , Li C.F. , Jacobsen S.E. The splicing factor SR45 affects the RNA-directed DNA methylation pathway in Arabidopsis . Epigenetics . 2012 ; 7 : 29 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 139. Dou K. , Huang C.F. , Ma Z.Y. , Zhang C.J. , Zhou J.X. , Huang H.W. , Cai T. , Tang K. , Zhu J.K. , He X.J. The PRP6-like splicing factor STA1 is involved in RNA-directed DNA methylation by facilitating the production of Pol V-dependent scaffold RNAs . Nucleic Acids Res. 2013 ; 41 : 8489 – 8502 . Google Scholar Crossref Search ADS PubMed WorldCat 140. Huang C.F. , Miki D. , Tang K. , Zhou H.R. , Zheng Z. , Chen W. , Ma Z.Y. , Yang L. , Zhang H. , Liu R. et al. A pre-mRNA-splicing factor is required for RNA-directed DNA methylation in Arabidopsis . PLoS Genet. 2013 ; 9 : e1003779 . Google Scholar Crossref Search ADS PubMed WorldCat 141. Elvira-Matelot E. , Bardou F. , Ariel F. , Jauvion V. , Bouteiller N. , Le Masson I. , Cao J. , Crespi M.D. , Vaucheret H. The nuclear ribonucleoprotein SmD1 interplays with splicing, RNA quality control and post-transcriptional gene silencing in Arabidopsis . Plant Cell . 2016 ; 28 : 426 – 438 . Google Scholar Crossref Search ADS PubMed WorldCat 142. Spies N. , Nielsen C.B. , Padgett R.A. , Burge C.B. Biased chromatin signatures around polyadenylation sites and exons . Mol. Cell . 2009 ; 36 : 245 – 254 . Google Scholar Crossref Search ADS PubMed WorldCat 143. Schwartz S. , Meshorer E. , Ast G. Chromatin organization marks exon-intron structure . Nat. Struct. Mol. Biol. 2009 ; 16 : 990 – 995 . Google Scholar Crossref Search ADS PubMed WorldCat 144. Andersson R. , Enroth S. , Rada-iglesias A. , Wadelius C. , Komorowski J. Nucleosomes are well positioned in exons and carry characteristic histone modifications . Genome Res. 2009 ; 19 : 1732 – 1741 . Google Scholar Crossref Search ADS PubMed WorldCat 145. Tilgner H. , Nikolaou C. , Althammer S. , Sammeth M. , Beato M. , Valcárcel J. , Guigó R. Nucleosome positioning as a determinant of exon recognition . Nat. Struct. Mol. Biol. 2009 ; 16 : 996 – 1001 . Google Scholar Crossref Search ADS PubMed WorldCat 146. de la Mata M. , Lafaille C. , Kornblihtt A.R. First come, first served revisited: factors affecting the same alternative splicing event have different effects on the relative rates of intron removal . RNA . 2010 ; 16 : 904 – 912 . Google Scholar Crossref Search ADS PubMed WorldCat 147. Kornblihtt A.R. Chromatin, transcript elongation and alternative splicing . Nat. Struct. Mol. Biol. 2006 ; 13 : 5 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 148. Alló M. , Buggiano V. , Fededa J.P. , Petrillo E. , Schor I. , De Mata M. , Agirre E. , Plass M. , Eyras E. , Elela S.A. et al. Control of alternative splicing through siRNA-mediated transcriptional gene silencing . Nat. Struct. Mol. Biol. 2009 ; 16 : 717 – 725 . Google Scholar Crossref Search ADS PubMed WorldCat 149. Schor I.E. , Fiszbein A. , Petrillo E. , Kornblihtt A.R. Intragenic epigenetic changes modulate NCAM alternative splicing in neuronal differentiation . EMBO J. 2013 ; 32 : 2264 – 2274 . Google Scholar Crossref Search ADS PubMed WorldCat 150. Ameyar-Zazoua M. , Rachez C. , Souidi M. , Robin P. , Fritsch L. , Young R. , Morozova N. , Fenouil R. , Descostes N. , Andrau J.-C. et al. Argonaute proteins couple chromatin silencing to alternative splicing . Nat. Struct. Mol. Biol. 2012 ; 19 : 998 – 1004 . Google Scholar Crossref Search ADS PubMed WorldCat 151. König H. , Ponta H. , Herrlich P. Coupling of signal transduction to alternative pre-mRNA splicing by a composite splice regulator . EMBO J. 1998 ; 17 : 2904 – 2913 . Google Scholar Crossref Search ADS PubMed WorldCat 152. Ule J. , Jensen K.B. , Ruggiu M. , Mele A. , Ule A. , Darnell R.B. CLIP identifies nova-regulated RNA networks in the brain . Science . 2003 ; 302 : 1212 – 1215 . Google Scholar Crossref Search ADS PubMed WorldCat 153. Lambert N. , Robertson A. , Jangi M. , McGeary S. , Sharp P.A. , Burge C.B. RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins . Mol. Cell . 2014 ; 54 : 887 – 900 . Google Scholar Crossref Search ADS PubMed WorldCat 154. Sugimoto Y. , Chakrabarti A.M. , Luscombe N.M. , Ule J. Using hiCLIP to identify RNA duplexes that interact with a specific RNA-binding protein . Nat. Protoc. 2017 ; 12 : 611 – 637 . Google Scholar Crossref Search ADS PubMed WorldCat 155. Tan D.S.W. , Chong F.T. , Leong H.S. , Toh S.Y. , Lau D.P. , Kwang X.L. , Zhang X. , Sundaram G.M. , Tan G.S. , Chang M.M. et al. Long noncoding RNA EGFR-AS1 mediates epidermal growth factor receptor addiction and modulates treatment response in squamous cell carcinoma . Nat. Med. 2017 ; 23 : 1167 – 1175 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]