Nucleic Acids Research

journal article

Open Access Collection

SUMOylation of the m6A-RNA methyltransferase METTL3 modulates its function

Du, Yuzhang;Hou, Guofang;Zhang, Hailong;Dou, Jinzhuo;He, Jianfeng;Guo, Yanming;Li, Lian;Chen, Ran;Wang, Yanli;Deng, Rong;Huang, Jian;Jiang, Bin;Xu, Ming;Cheng, Jinke;Chen, Guo-Qiang;Zhao, Xian;Yu, Jianxiu

2018 Nucleic Acids Research

doi: 10.1093/nar/gky156pmid: 29506078

Abstract The methyltransferase like 3 (METTL3) is a key component of the large N6-adenosine-methyltransferase complex in mammalian responsible for N6-methyladenosine (m6A) modification in diverse RNAs including mRNA, tRNA, rRNA, small nuclear RNA, microRNA precursor and long non-coding RNA. However, the characteristics of METTL3 in activation and post-translational modification (PTM) is seldom understood. Here we find that METTL3 is modified by SUMO1 mainly at lysine residues K177, K211, K212 and K215, which can be reduced by an SUMO1-specific protease SENP1. SUMOylation of METTL3 does not alter its stability, localization and interaction with METTL14 and WTAP, but significantly represses its m6A methytransferase activity resulting in the decrease of m6A levels in mRNAs. Consistently with this, the abundance of m6A in mRNAs is increased with re-expression of the mutant METTL3-4KR compared to that of wild-type METTL3 in human non-small cell lung carcinoma (NSCLC) cell line H1299-shMETTL3, in which endogenous METTL3 was knockdown. The alternation of m6A in mRNAs and subsequently change of gene expression profiles, which are mediated by SUMOylation of METTL3, may directly influence the soft-agar colony formation and xenografted tumor growth of H1299 cells. Our results uncover an important mechanism for SUMOylation of METTL3 regulating its m6A RNA methyltransferase activity. INTRODUCTION More than 140 types of nucleotide modifications have been reported in different cellular RNAs, including mRNAs, tRNAs, rRNAs, snRNAs and snoRNAs (1). N6-methyladenosine (m6A) is one of the most common modification in mRNA, rRNA, tRNA, microRNA and long noncoding RNA (2,3). More recently, it has been reported that the m6A methylation plays important roles in the regulation of the circadian clock, meiosis, mRNA degradation and translation as well as microRNA processing (3–7), and moreover, dysregulation of this modification is linked with cancer (8) as well as neurogenesis, learning and memory (9). The m6A methylation is a dynamic and reversible modification that tends to occur at a subset of RRACH motifs (R = G or A; H = A, C or U) (10). It is catalyzed by the methyltransferase complex (METTL3, METTL14 and WTAP) (11,12) while is removed by two demethylases FTO and ALKBH5 (13,14). METTL3 (methyltransferase like 3, also known as MTA70) is identified as the main methyltransferase critical for the m6A methylation (15). Deletion or over-expression of METTL3 certainly changes the total m6A methylation level, which has direct effect on the decay and translation of mRNA and microRNA biogenesis, leading to the emergence of human disease. However, up to now it has not been reported about both post-translational modifications (PTMs) and itself-regulation properties of METTL3. SUMOylation is a process of attaching small ubiquitin-like modifier (SUMO) to protein substrates at specific lysine residues (16,17). This reversible PTM can change the stability, localization, protein-protein interactions and activity of the targeted substrate protein (18–21). Importantly, it is noted that the dysregulation of the SUMO pathway is closely related to various human diseases (22–27). In this study we identified that METTL3 was modified by SUMO1 at the major sites K177, K211, K212 and K215. Sequence analysis revealed that these potential SUMOylation sites are highly conserved among METTL3 orthologues in different species. SUMOylation of METTL3 had little effect on its stability, localization and its interaction with METTL14 and WTAP. Interestingly, we found that SUMOylation of METTL3 might repress its methyltransferase activity for m6A RNA methylation. Furthermore, we proved that SUMOylation of METTL3 promoted colony formation and tumor growth in human non-small cell lung carcinoma (NSCLC) H1299 cells. These results suggested that SUMOylation of METTL3 was a novel molecular mechanism underlying regulation of m6A RNA methylation and its related physiological functions. MATERIALS AND METHODS Cell cultures and transfection Human cells were cultured in Dulbecco's modified Eagle's medium (Hyclone) supplemented with 10% fetal bovine serum (FBS) and antibiotics. Cells were grown in a 5% CO2 cell culture incubator at 37°C. Cell transfection was performed using Lipofectamine 2000 (Invitrogen). Antibodies and reagents The following antibodies were used in the study: mouse-anti-Flag, mouse-anti-HA (from Sigma); rabbit-anti-METTL3, rabbit-anti-m6A, mouse-anti-GAPDH and rabbit-anti-SENP1 (from Abcam); rabbit-anti-CBP80, rabbit-anti-EIF3B, rabbit-anti-EIF4E, mouse-anti-His and rabbit-anti-METTL3 (from ProteinTech Group); mouse-anti-β-Actin (from Santa Cruz), rabbit-anti-SUMO1 (from CST). Puromycin (#P8833) was obtained from Sigma. Ni2+-NTA agarose beads were purchased from Qiagen (Hilden, Germany) and Protein G Plus/Protein A agarose suspension (#IP05) was purchased from Calbiochem. Plasmids The human METTL3 cDNA and METTL14 were amplified by KOD-plus Kit (TOYOBO), then subcloned into the vectors pEF5-HA and pCMV-Tag2b, respectively. Mutations and truncations of METTL3 were obtained from PCR-directed mutagenesis. HA-METTL3 was subcloned into the Lentiviral vector pCD513B to yield lentivirus by transfection of HEK-293FT cells. Flag-WTAP plasmid was kindly provided by Dr Jianzhao Liu. The shRNA sequence 5′-GCTAAACCTGAAGAGTGATAT-3′ targeting METTL3 3′-UTR (shMETTL3) was designed and cloned into the Lentiviral vector pLKO.1. SUMOylation assays by Ni2+-NTA pull down METTL3 SUMOylation was analysed in HEK-293T cells by the method of in vivo SUMOylation assay using Ni2+-NTA beads as previously described by our lab (19–21,26–28). SUMOylation analysis by immunoprecipitation (IP) Endogenous SUMOylated-METTL3 were detected by immunoprecipitations following the published protocol (29) with minor changes. Briefly, 5 × 107 of cells were lysed in 1 ml of lysis buffer (20 mM sodium phosphate pH 7.4, 150 mM NaCl, 1% SDS, 1% Triton, 0.5% sodium deoxycholate, 5 mM EDTA, 5 mM EGTA, 10 mM N-ethylmaleimide (NEM), the protease inhibitors and phosphatase inhibitors). The viscous lysate was sonicated until it became fluid and then diluted 1:10 with RIPA buffer without SDS (20 mM sodium phosphate, pH 7.4, 150 mM NaCl, 1% Triton, 0.5% sodium deoxycholate, 5 mM EDTA, 5 mM EGTA, 20 mM NEM, the protease inhibitors and phosphatase inhibitors) and incubated with antibody-coupled beads overnight at 4°C. Beads were washed three times with high-salt buffer (20 mM sodium phosphate, pH 7.4, 500 mM NaCl, 1% Triton, 0.5% sodium deoxycholate, 5 mM EDTA, 5 mM EGTA, 20 mM NEM, the protease inhibitors and phosphatase inhibitors). Finally, beads were boiled for 10 min in SDS sample buffer, and followed by Western blotting analysis. Extraction of cytoplasmic and nuclear proteins Extraction of cytoplasmic and nuclear proteins was performed using the Nuclear/Cytosol Fractionation Kit (#266-100, BioVision) according to its instruction. Ubiquitination analysis by immunoprecipitation (IP) Cells transfected by HA-METTL3 or HA-METTL3-4KR with or without Myc-Ub plasmid were lysed in RIPA buffer (50 mM Tris–HCl, pH 7.5, 150 mM NaCl, 1% NP-40 and a protease inhibitor cocktail), and subjected to immunoprecipitation, then followed by immunoblotting with indicated antibodies. Immunofluorescence staining HeLa-shpLKO.1, HeLa-shUbc9 and HeLa-shSENP1 cells grown on the surface of coverslips were fixed with 4% paraformaldehyde under the room temperature, followed by permeabilization with 0.2% Triton X-100, and then blocked with 10% goat serum in PBS. Next, coverslips were incubated with primary antibody diluted in 5% goat serum in PBS (rabbit anti-METTL3 1:100) at 4°C overnight. Cells were washed five times with PBS and then incubated with fluorescent dye-conjugated secondary antibody diluted in 5% goat serum in PBS for 2 h away from light. Futhermore, cells were washed three times with PBS and then stained with DAPI for 1 h. The immunofluorescence images were recorded by a laser scanning confocal microscopy. Soft agar colony forming assay The effect of METTL3-WT and METTL3-4KR on cellular transformation and tumorigenesis was assessed by using a soft agar colony assay as previously described (26). This assay was performed in six-well plates with a base of 2 ml of medium containing 10% FBS with 0.6% Bacto agar (Amresco). Stable cells were seeded in 2 ml of medium containing 10% FBS with 0.35% agar at 1.0 × 103 cells/well and layered on the base gel. The photographs of colonies developed in soft agar were taken after staining with 0.05% crystal violet at day 20, and the number of colonies was scored by ImageJ (NIH, USA). At least three independent experiments were performed in triplicate. Xenografted tumor model Mouse xenografts models were established as described previously (26). Briefly, stable H1299 cell lines were injected subcutaneously into 5-week-old nude mice (n = 5) with 100 μl Opti-MEM containing 2.5 × 106 cells. Two weeks later, the tumors were measured every 3 days. Mice were killed 4 weeks later, and tumours were dissected and assessed by weight. All animal studies were conducted with the approval and guidance of Shanghai Jiao Tong University Medical Animal Ethics Committees. Analysis of mRNA m6A methylation by dot-blotting assay Analysis of mRNA m6A methylation by dot-blotting was performed followed by a published procedure with minor changes (13,30). Briefly, total RNAs were isolated using the Trizol method and mRNAs were isolated by using GenElute™ mRNA Miniprep Kit (Sigma). The concentration and purity of mRNAs were measured by NanoDrop 2000. The mRNAs were denatured by heating at 95°C for 5 min, followed by chilling on ice directly. Next, the mRNAs (50∼100 ng) was spotted directly onto the positively charged nylon membrane (GE Healthcare, USA) and air dried for 5 min. The membrane was then UV crosslinked in a Ultraviolet Crosslinker, blocked with 5% of nonfat milk in TBST, and then incubated with anti-m6A antibody overnight at 4°C. HRP-conjugated anti-rabbit IgG secondary antibody was added to the membrane for 1 h at room temperature with gentle shaking and then developed with enhanced chemiluminescence. Methylene blue staining was used to verified that equal amount mRNA was spotted on the membrane. mRNA m6A quantification by LC–MS/MS The polyadenylated RNA from indicated cells was isolated using Dynabeads™ mRNA Purification Kit (Invitrogen), followed by removal of contaminated rRNA with RiboMinus transcriptome isolation kit (Invitrogen). The isolated mRNAs were subsequently digested into nucleosides, and the amount of m6A was measured by LC–MS/MS following the published procedure (11). The total contents of m6A and A were quantified on the basis of the corresponding standard curves generated using pure standards (Supplementary Figure S4A), from which the m6A/A ratio was calculated. The nucleosides were quantified using the nucleoside to base ion mass transitions of 282 to 150 (m6A), and 268 to 136 (A). Quantification was performed by comparison with the standard curve obtained from pure nucleoside standards running at the same batch of samples. The ratio of m6A to A was calculated based on the calculated concentrations. In vitro m6A methyltransferase activity assay The in vitro methyltransferase activity assay was performed following the published procedure (11). In brief, a standard 50 μl of reaction mixture containing the following components: 1.5 nmol RNA probes (Seq1: ACGAGUCCUGGACUGAAACGGACUUGC, Seq2: ACGAGUCCUGGAUUGAAACGGAUUUGC), purified Flag-METTL3-WT, Flag-METTL3-4KR or SUMOylated Flag-METTL3 proteins in combination with purified Flag-METTL14, 1 mM SAM, 80 mM KCl, 1.5 mM MgCl2, 0.2 U/μl RNasin, 10 mM DTT, 4% glycerol and 15 mM HEPES (pH 7.9). The reaction was incubated at 16°C for 12 h. The methylation of RNA-probe was measured by immunoblotting with the m6A antibody (Abcam), and the 1/10 RNA was extracted for northern-blotting. MeRIP-m6A-Seq, RNA-Seq and data analysis The m6A-Seq was performed by Cloudseq Biotech Inc. (Shanghai, China) according to the published procedure (31) with slight modifications. Briefly, 5 μg of fragmented mRNAs were saved as input control for RNA-seq, 500 μg of fragmented mRNAs were incubated with 5 μg of anti-m6A polyclonal antibody (Synaptic Systems, 202003) in IPP buffer (150 mM NaCl, 0.1% NP-40, 10 mM Tris–HCl, pH 7.4) for 2 h at 4°C. The mixture was then immunoprecipitated by incubation with protein-A beads (Thermo Fisher) at 4°C for an additional 2 h. Then, bound mRNAs were eluted from the beads with N6-methyladenosine (BERRY & ASSOCIATES, PR3732) in IPP buffer and then extracted with Trizol reagent (Thermo Fisher) by following the manufacturer's instruction. Purified mRNAs were used for RNA-seq library generation with NEBNext® Ultra™ RNA Library Prep Kit (NEB). Both the input sample (without immunoprecipitation) and the m6A IP sample were subjected to 150 bp paired-end sequencing on Illumina HiSeq sequencer. Paired-end reads were harvested from Illumina HiSeq 4000 sequencer, and were quality controlled by Q30. After 3′ adaptor-trimming and low quality reads removing by cutadapt software (v1.9.3). The reads were aligned to the reference genome (UCSC HG19) with Hisat2 software (v2.0.4). Methylated sites on RNAs (peaks) were identified MACS software. Differentially methylated sites on RNAs were identified by diffReps. These peaks identified were mapped to transcriptome using home-made scripts. Statistical analysis Experiments were performed at least three times, and representative results were shown. All data are presented as means ± S.E.M. for mouse xenograft model and soft agar colony forming assay. Statistical analysis was calculated with Microsoft Excel analysis tools. Differences between individual groups are analyzed using the t-test (two-tailed and unpaired) with triplicate or quadruplicate sets. A value of P < 0.05 was considered statistically significant and P-value < 0.05 was marked with (*), < 0.01 with (**) or < 0.001 with (***). RESULTS METTL3 is SUMOylated in vitro and in vivo To identify whether METTL3 can be SUMOylated in cells, we transiently transfected HA-METTL3 and the SUMO-conjugating enzyme E2 Flag-Ubc9 together with His-tagged SUMO1, SUMO2 or SUMO3 into 293T cells, respectively. His-SUMO conjugates pulled down by using the method of Ni2+-NTA resin precipitation as described before (26,28) were immunoblotted. The result showed that METTL3 was modified strongly by SUMO1 and moderately by SUMO2, but very weakly by SUMO3 (Figure 1A). Thus, we focused on SUMO1 modification of METTL3 in the following studies. Since Sentrin/SUMO-specific protease 1 (Senp1) is an SUMO1 modification-specific protease (19), we wondered whether Senp1 can remove the SUMO1 modification. Indeed, the significantly increased SUMOylation of exogenous METTL3 by Ubc9 was greatly weakened by cotransfection of the plasmid Senp1 (Figure 1B). Moreover, we confirmed that SUMOylation of METTL3 was enhanced when endogenous Senp1 in HEK293T cells was knocked down by a specific shRNA for SENP1 (Figure 1C). Furthermore, we transfected His-SUMO1, Flag-Ubc9 with or without Senp1 plasmid into HEK293T cells to verify whether endogenous METTL3 can be SUMOylated by SUMO1. SUMOylated bands of METTL3 detected by anti-METTL3 antibody were significantly accumulated with Flag-Ubc9, which were almost completely removed by Senp1 (Figure 1D). In addition, to examine whether some stresses induce SUMOylation of METTL3, 293T cells transfected with His-SUMO1, Flag-Ubc9 and HA-METTL3 were treated with chemotherapy drugs including Camptothecin, Cisplatin, Doxorubicin and Etoposide with indicated concentrations for 12 h before cells were harvested for SUMOylation analysis by Ni2+-NTA pull down, showing that these four chemotherapy drugs greatly induced SUMOylation of METTL3 (Figure 1E). Since SUMO1 modification is not subject to form polymeric chains in vivo (32) and our data showed SUMO1-modified METTL3 with two major bands in sizes of among 100∼130 kDa and additional weak multiple bands (higher than 130 kDa), thus we concluded that METTL3 could be modified by SUMO1 at multiple sites. Figure 1. View largeDownload slide METTL3 is modified by SUMO1. (A) METTL3 is mainly modified by SUMO1 in 293T cells. Lysates from 293T cells transfected with HA-METTL3, Flag-Ubc9 and His-SUMO1, -SUMO2 or -SUMO3 were subjected to precipitation with Ni2+-NTA resin for the SUMOylation assay, and followed by western blotting with indicated antibodies. (B) METTL3 is modified by SUMO1 at multiple sites, which can be removed by SENP1. HA-METTL3 with or without His-SUMO1, Flag-Ubc9 and EBG-Senp1 were transfected into 293T cells and the SUMOylation assay were conducted with Ni2+-NTA resin. (C) Knockdown of Senp1 enhances METTL3 SUMOylation. Senp1 was stably knocked down by shRNA in the lentiviral system in 293T cells. Plasmids as indicated were co-transfected into the stable cell lines. Lysates were used for the Ni2+-NTA resin precipitation and METTL3 SUMOylation was detected by anti-HA antibody. (D) Endogenous METTL3 is modified by SUMO1. His-SUMO1 with or without Flag-Ubc9 and EBG-Senp1 were transfected into 293T cells, followed by the SUMOylation assay for detection of SUMOylated bands with anti-METTL3 antibody. (E) Chemotherapy drugs induce SUMOylation of METTL3. 293T cells were transfected with His-SUMO1, Flag-Ubc9 and HA-METTL3 for 24 h, and then treated with Camptothecin (20 μM), Cisplatin (10 μM), Doxorubicin (2 μM) or Etoposide (10 μM) for 12 h before cells were harvested. Ni2+-NTA pull down was performed to detect SUMO1 modification of METTL3. (F) SUMOylation of endogenous METTL3 were confirmed by IP method. HeLa-pLKO.1, HeLa-shSenp1 and HeLa-shUbc9 cells were lysed for immunoprecipitation with anti-METTL3 antibody or normal IgG, followed by western blotting with anti-SUMO1 and METTL3 antibodies. (G) SUMOylation of endogenous METTL3 occurs naturally in H1299 cells. Lysates from H1299 cells were used for immunoprecipitation with anti-SUMO1 antibody or normal IgG, followed by Western blotting with anti-METTL3 antibody. (H) H1299-shMETTL3 cells stably re-expressing HA-METTL3-WT or 4KR were treated with Etoposide (10 μM) for 12 h, and then harvested for SUMOylation analysis by the IP method. Figure 1. View largeDownload slide METTL3 is modified by SUMO1. (A) METTL3 is mainly modified by SUMO1 in 293T cells. Lysates from 293T cells transfected with HA-METTL3, Flag-Ubc9 and His-SUMO1, -SUMO2 or -SUMO3 were subjected to precipitation with Ni2+-NTA resin for the SUMOylation assay, and followed by western blotting with indicated antibodies. (B) METTL3 is modified by SUMO1 at multiple sites, which can be removed by SENP1. HA-METTL3 with or without His-SUMO1, Flag-Ubc9 and EBG-Senp1 were transfected into 293T cells and the SUMOylation assay were conducted with Ni2+-NTA resin. (C) Knockdown of Senp1 enhances METTL3 SUMOylation. Senp1 was stably knocked down by shRNA in the lentiviral system in 293T cells. Plasmids as indicated were co-transfected into the stable cell lines. Lysates were used for the Ni2+-NTA resin precipitation and METTL3 SUMOylation was detected by anti-HA antibody. (D) Endogenous METTL3 is modified by SUMO1. His-SUMO1 with or without Flag-Ubc9 and EBG-Senp1 were transfected into 293T cells, followed by the SUMOylation assay for detection of SUMOylated bands with anti-METTL3 antibody. (E) Chemotherapy drugs induce SUMOylation of METTL3. 293T cells were transfected with His-SUMO1, Flag-Ubc9 and HA-METTL3 for 24 h, and then treated with Camptothecin (20 μM), Cisplatin (10 μM), Doxorubicin (2 μM) or Etoposide (10 μM) for 12 h before cells were harvested. Ni2+-NTA pull down was performed to detect SUMO1 modification of METTL3. (F) SUMOylation of endogenous METTL3 were confirmed by IP method. HeLa-pLKO.1, HeLa-shSenp1 and HeLa-shUbc9 cells were lysed for immunoprecipitation with anti-METTL3 antibody or normal IgG, followed by western blotting with anti-SUMO1 and METTL3 antibodies. (G) SUMOylation of endogenous METTL3 occurs naturally in H1299 cells. Lysates from H1299 cells were used for immunoprecipitation with anti-SUMO1 antibody or normal IgG, followed by Western blotting with anti-METTL3 antibody. (H) H1299-shMETTL3 cells stably re-expressing HA-METTL3-WT or 4KR were treated with Etoposide (10 μM) for 12 h, and then harvested for SUMOylation analysis by the IP method. Above results revealed that METTL3 could be SUMOylated were all based on the over-expression system and the Ni2+-NTA pull down assay, thus we questioned whether endogenous METTL3 is modified by endogenous SUMO1. To this end, we performed an SUMOylation analysis by using the minor-modification method of immunoprecipitation (IP) as originally described by Barysch et al. (29), to determine whether endogenous METTL3 is SUMOylated in cells. We generated UBC9- and SENP1-knockdown in HeLa cells by using shRNA on the empty lenti-vector pLKO.1 (Supplementary Figure S1), respectively. Those cells were lysed in the denatured lysis buffer as described in the Methods, and immunoprecipitated with anti-METTL3 antibody or normal IgG, followed by Western blotting with anti-SUMO1 and anti-METTL3 antibodies. The result showed that endogenous METTL3 was moderately modified by endogenous SUMO1 in HeLa-pLKO.1 cells. As expectedly, the SUMO1 modification of METTL3 was enhanced by knockdown of Senp1 whereas was almost completely abolished by knockdown of Ubc9 (Figure 1F, upper panels). As shown controls in Input, knockdown of either SENP1 or UBC9 did not affect the protein levels of METTL3 (Figure 1F, lower panels). To further strengthen the concept that endogenous METTL3 is modified by SUMO1 in vivo, reciprocally, an IP with anti-SUMO1 followed by immunoblotting of anti-METTL3 was conducted. Convincingly, we observed METTL3 was naturally modified by SUMO1 with two major bands in H1299 cells (Figure 1G). Moreover, the effect of Etoposide on inducing METTL3 SUMOylation in the stable cell line H1299-shMETTL3-HA-METTL3 (H1299-shMETTL3 cells stably re-expressing HA-METTL3-WT) was confirmed by the same IP method (Figure 1H). Taken together, these results conclusively proved that METTL3 was SUMOylated at multiple sites in vitro and in vivo. K177/211/212/215 are major SUMO-sites of METTL3 To identify the major sites for SUMOylation of human METTL3, seven lysines (Ks) including K27, K132, K163, K164, K207, K513 and K530 predicted by the SUMOplot software (Supplementary Figure S2A) were individually mutated to arginine (R) for SUMOylation identification. The SUMOylation assays revealed that the single (or double) KR mutations did not change the pattern of bands for SUMOylated METTL3, indicating that none of these sites was the major SUMO acceptor site of METTL3 (Figure 2A). It seemed to be difficult to identify the major SUMO acceptor sites of METTL3, so we took a strategy to mutate all the other lysines of METTL3 protein which has 36 lysines in total. Among these lysines, we mutated some of them individually or together for adjacent ones. We co-transfected these single-, double-, triple- and quadruple-lysine mutants with plasmids His-SUMO1/Flag-Ubc9 into 293T cells for the SUMOylation assay. Compared WT (wild-type) and mutants including K12/13R, K62R, K80/81R, K122R, K327R, K345R, K388R (Figure 2B), K235/240/241R, K256/263R, K281/286R, K296/305R and K480R (Figure 2C), mutants K177R and K211/212/215R (3KR) notably reduced the SUMO1 modification levels of METTL3 (Figure 2C and D). However, the single mutations of each lysine of K211/212/215 did not reduce the SUMOylation levels of METTL3 compare to that of WT (Figure 2D), suggesting that only the triple-mutation 3KR at this K-cluster could interfere METTL3 SUMOylation. Furthermore, we generated a new mutant K177/211/212/215R (4KR) and found that SUMOylated bands of 4KR were more significantly reduced compared with those of 3KR (Figure 2D), although it did not completely remove the two major bands of SUMO1-METTL3 and (SUMO1)2-METTL3, which were covalently conjugated with one and two molecule of SUMO1, respectively. To confirm above results, H1299-shMETTL3 cells stably re-expressing HA-METTL3-WT or 4KR were harvested for SUMOylation analysis by the IP method. Cell lysates were used for IP with anti-HA antibody, and followed by immunoblotting analysis with anti-SUMO1 antibody and anti-HA antibodies, showing that the SUMOylation of the mutant METTL3-4KR was obviously reduced compared to that of METTL3-WT in H1299 stable cell lines (Figure 2E). Meanwhile, by using the Ni2+-NTA method, we also confirmed that SUMOylated bands of METTL3-4KR detected by both anti-METTL3 and anti-HA antibodies was significantly reduced compared with METTL3-WT (Supplementary Figure S2B). In addition, the corresponding lysines K177/211/212/215 of human METTL3 are highly conserved among its homologues in different species (Supplementary Figure S2C). Collectively, these results suggested that K177/211/212/215 were major SUMOylation accept sites of human METTL3. Figure 2. View largeDownload slide K177, K211, K212 and K215 are the major SUMOylation sites in METTL3. (A–D) The mutatant 4KR (K177/211/212/215R) greatly reduces SUMOylation of METTL3. HA-tagged wild-type (WT) or different METTL3 mutants and His-SUMO1/Flag-Ubc9 were expressed in 293T cells. Lysates were prepared for Ni2+-NTA pull down, followed by western blotting with indicated antibodies. (E) SUMOylation at K177, K211, K212 and K215 of METTL3 was confirmed in stable H1299 cell lines by the IP method. H1299-shMETTL3 cells re-expressing with HA-METTL3-WT or -4KR were lysed for immunoprecipitation with anti-HA antibody or normal IgG, followed by western blotting with anti-SUMO1 and HA antibodies. Figure 2. View largeDownload slide K177, K211, K212 and K215 are the major SUMOylation sites in METTL3. (A–D) The mutatant 4KR (K177/211/212/215R) greatly reduces SUMOylation of METTL3. HA-tagged wild-type (WT) or different METTL3 mutants and His-SUMO1/Flag-Ubc9 were expressed in 293T cells. Lysates were prepared for Ni2+-NTA pull down, followed by western blotting with indicated antibodies. (E) SUMOylation at K177, K211, K212 and K215 of METTL3 was confirmed in stable H1299 cell lines by the IP method. H1299-shMETTL3 cells re-expressing with HA-METTL3-WT or -4KR were lysed for immunoprecipitation with anti-HA antibody or normal IgG, followed by western blotting with anti-SUMO1 and HA antibodies. SUMOylation of METTL3 does not affect its stability, localization and interaction with METTL14 and WTAP As SUMOylation may alter the stability, localization and activity of target proteins (21,32), it is not easy to predict what aspects SUMOylation of METTL3 influences. Firstly, we wondered whether SUMOylation of METTL3 affects its stability. To investigate whether METTL3 degradation is mainly depend on the proteasomal or lysosomal pathway, we treated 293T cells overexpressing HA-METTL3 with MG132, a proteasome inhibitor or chloroquine, a lysosome inhibitor, respectively. The result showed that METTL3 was accumulated in cells treated with MG132 but not with chloroquine (Figure 3A), which indicated that METTL3 was mainly degraded via the proteasome pathway. To test whether SUMOylation affects the ubiquitination of METTL3, we transfected HA-METTL3, Myc-Ub with or without SUMO1/Ubc9 into 293T cells. The immunoblotting with anti-Myc antibody for IP complexes with anti-HA antibody showed that SUMOylation of METTL3 appeared not to have significant effect on its ubiquitination (Figure 3B). Furthermore, we co-transfected HA-METTL3-WT or -4KR with or without Myc-Ub into 293T cells for the ubiquitination assay, which showed that the ubiquitination levels were comparable between METTL3-WT and METTL3-4KR (Figure 3C), suggesting that SUMOylation of METTL3 did not affect its stability. Figure 3. View largeDownload slide SUMOylation of METTL3 does not influence its stability, localization or interaction with METTL14 and WTAP. (A) METTL3 degrades mainly via the proteasome pathway. HA-METTL3 was transfected into 293T cells. At 42 h after transfection, cells were treated with 40 μM MG132 or 100 μM chloroquine for 6 h and lysates were subjected to western blotting analysis. The relative fold of METTL3 was analyzed by ImageJ (V1.45). (B, C) SUMOylation of METTL3 does not affect its ubiquitination. (B) 293T cells transfected with indicated plasmids were lysed with RIPA buffer and lysates were used for IP with anti-myc antibody, followed by western blotting with indicated antibodies to detect ubiquitination of METTL3. (C) HA-METTL3-WT and HA-METTL3-4KR with or without Myc-Ub were transfected into 293T cells. Lysates were used for IP with anti-Myc antibody, and then immunoblotted as indicated antibodies. (D, E) METTL3 SUMOylation does not change its nuclear localization. HA-METTL3 with or without His-SUMO1, Flag-Ubc9 or EBG-Senp1 were co-transfected into 293T cells. Forty eight hours later, cells were fractionated into cytosolic or nuclear fractions and immunoblotted with indicated antibodies. (E) Western blotting analysis of the distribution in nuclear and cytoplasmic fractions of METTL3 in HeLa-shControl, HeLa-shUbc9 or HeLa-shSenp1 cells. (F) Stable cells HeLa-shControl, HeLa-shUbc9 or HeLa-shSenp1 were immunostained with anti-METTL3 antibodies or DAPI. Scale bar, 12.5 μm. (G, H) SUMOylation of METTL3 does not affect its interaction with METTL14 or WTAP. HA-METTL3-WT and HA-METTL3-4KR were tranfected with or without Flag-METTL14 or Flag-WTAP into 293T cells. Lysates were used for IP with anti-HA antibody, followed by immunoblotting with indicated antibodies. Figure 3. View largeDownload slide SUMOylation of METTL3 does not influence its stability, localization or interaction with METTL14 and WTAP. (A) METTL3 degrades mainly via the proteasome pathway. HA-METTL3 was transfected into 293T cells. At 42 h after transfection, cells were treated with 40 μM MG132 or 100 μM chloroquine for 6 h and lysates were subjected to western blotting analysis. The relative fold of METTL3 was analyzed by ImageJ (V1.45). (B, C) SUMOylation of METTL3 does not affect its ubiquitination. (B) 293T cells transfected with indicated plasmids were lysed with RIPA buffer and lysates were used for IP with anti-myc antibody, followed by western blotting with indicated antibodies to detect ubiquitination of METTL3. (C) HA-METTL3-WT and HA-METTL3-4KR with or without Myc-Ub were transfected into 293T cells. Lysates were used for IP with anti-Myc antibody, and then immunoblotted as indicated antibodies. (D, E) METTL3 SUMOylation does not change its nuclear localization. HA-METTL3 with or without His-SUMO1, Flag-Ubc9 or EBG-Senp1 were co-transfected into 293T cells. Forty eight hours later, cells were fractionated into cytosolic or nuclear fractions and immunoblotted with indicated antibodies. (E) Western blotting analysis of the distribution in nuclear and cytoplasmic fractions of METTL3 in HeLa-shControl, HeLa-shUbc9 or HeLa-shSenp1 cells. (F) Stable cells HeLa-shControl, HeLa-shUbc9 or HeLa-shSenp1 were immunostained with anti-METTL3 antibodies or DAPI. Scale bar, 12.5 μm. (G, H) SUMOylation of METTL3 does not affect its interaction with METTL14 or WTAP. HA-METTL3-WT and HA-METTL3-4KR were tranfected with or without Flag-METTL14 or Flag-WTAP into 293T cells. Lysates were used for IP with anti-HA antibody, followed by immunoblotting with indicated antibodies. Secondly, to detect whether SUMOylation of METTL3 influences its nuclear localization, 293T cells transfected with HA-METTL3, SUMO1/Ubc9 and SENP1 (Supplementary Figure S3A) were extracted for separating cytoplasmic and nuclear protein fractions. The immunoblotting result showed that METTL3 was mainly located in the nucleus and there was no difference in its localization among three cases of transfected 293T cells (Figure 3D), indicating that SUMOylation of METTL3 might not affect its nuclear localization. To further confirm this, by employing stable cell lines HeLa-shUBC9 and HeLa-shSenp1 (Supplementary Figure S1), we extracted the nuclear/cytosol fractions (Figure 3E) and performed immunofluorescent stainings (Figure 3F), and showed that the localizations of METTL3 were all the same pattern that METTL3 was mainly located in the nucleus in either highly or weakly SUMOylated state. Above data demonstrated that SUMOylation of METTL3 did not influence its nuclear localization. The m6A RNA methylation is catalyzed by a large methyltransferase complex containing METTL3, METTL14 and WTAP and other components (12,15,33). More importantly, in vitro studies show that METTL3 and METTL14 interact directly and stabilize each other at the protein levels (11,34), and increasing evidences suggest that METTL3 is the real catalytically active subunit while METTL14 plays a structural role critical for substrate recognition (35). There also exists evidence that WTAP functions as a regulatory subunit in the m6A methyltransferase complex and has a direct interaction with METTL3 (12). As SUMOylation can regulate protein-protein interactions (27), we wanted to see whether SUMOylation of METTL3 alters its interaction with METTL14 and WTAP. Firstly, we have investigated whether METTL14 is SUMOylated by using the Ni2+-NTA pulldown assay and shown that METTL14 was not SUMOylated upon co-transfection with SUMO1/Ubc9/Senp1 in 293T cells (Supplementary Figure S3B). Next, we transfected HA-METTL3-WT or -4KR with or without Flag-METTL14 or Flag-WTAP into 293T cells. Cell lysates were used for IP with anti-HA antibody and followed by immunoblotting analysis, showing that the interactions between METTL3-WT or METTL3-4KR and METTL14 (Figure 3G) or WTAP (Figure 3H) were not obviously affected. This result suggested that SUMOylation of METTL3 did not change its interaction with METTL14 and WTAP. Since METTL3 can interact with translation initiation factors such as CBP80, eIF4E and eIF3B to enhance translation (36), we decided to examine whether SUMOylation influences the interaction between METTL3 and translation initiation machinery. The results revealed that SUMOylation did not influence the interaction between METTL3 and translation initiation factors including CBP80, eIF4E and eIF3B (Supplementary Figure S3C), indicating that probably SUMOylation of METTL3 did not affect influence translation efficiency. Taken together, our above data revealed that SUMOylation of METTL3 did not influence its stability, localization and interaction with METTL14/WTAP and translation initiation machinery. SUMOylation of METTL3 inhibits its m6A RNA methyltransferase activity Since SUMOylation can regulate some SUMO-targeted enzymes activity (18,37), we wondered whether SUMOylated METTL3 changes its m6A RNA methyltransferase activity. Firstly, we examined the effect of METTL3 on the total abundance of m6A methylation in mRNAs. Indeed, knockdown of METTL3 by a specific shRNA significantly reduced the m6A modification level in mRNAs in both 293T and H1299 cells (Figure 4A). Interestingly, we found that the mRNA m6A abundance in 293T-shSenp1 stable cells was notably reduced compared with that of in 293T-shControl (Figure 4B), indicating that SUMO modification might be involved in regulating the mRNA m6A levels by possibly inhibiting the activity of one of the components, such as METTL3, in the large m6A RNA methyltransferase complex. To verify this, we transfected HA-METTL3 with or without SUMO1/Ubc9 plasmids into 293T cells and performed a dot-blot assay to detect the m6A levels in mRNAs with anti-m6A antibody. Over-expression of METTL3 increased the m6A modification level as expectedly whereas this enhanced effect was almost abolished by co-tranfection with SUMO1/Ubc9 (Figure 4C), revealing that SUMOylation of METTL3 might repress its m6A RNA methyltransferase activity. Furthermore, mRNAs from 293T cells transfected with the empty vector pEF5HA, HA-METTL3-WT or HA-METTL3-4KR were extracted for the same dot-blot assay, showing that the m6A modification level in cells transfected with the SUMO-site mutant METTL3-4KR was higher than that in cells transfected METTL3-WT, when the protein expression levels of the WT and mutant METTL3 were comparable (Figure 4D). Consistently with this, stable cell lines H1299-shMETTL3 re-expressing HA-METTL3-4KR showed higher abundance of m6A modification in mRNAs than that of cells re-expressing HA-METTL3-WT (Figure 4E). These results confirmed that the SUMOylation deficient of METTL3 by 4KR-mutation displayed relatively higher m6A methyltranferase activity. Additionally, we compared HA-METTL3-WT and the mutant HA-METTL3-4KR with or without co-transfection of His-SUMO1/Flag-Ubc9 into 293T cells. The dot-blot assay showed that METTL3-WT with co-transfection of SUMO1/Ubc9 reduced the abundance of m6A in mRNAs compared to that of without SUMO1/Ubc9, whereas METTL3-4KR with or without co-transfection of SUMO1/Ubc9 showed much higher m6A RNA abundance but there was little difference between METTL3-4KR and METTL3-4KR co-transfected with SUMO1/Ubc9 (Figure 4F). Figure 4. View largeDownload slide SUMO1 modification of METTL3 represses its RNA m6A methyltransferase activity. (A–E) Polyadenylated mRNAs were purified for the dot-blot assay (upper panels), and cell lysates were used for immunoblotting with indicated antibodies (lower panels). (A) METTL3 is a main component responsible for the abundance of m6A in mRNAs. The abundance of m6A in mRNAs from shControl or shMETTL3 293T and H1299 cells was detected by the Dot-blot assay with anti-m6A antibody, and equal loading of the mRNAs was verified by methylene blue staining (upper panels). METTL3 knockdown efficiency in 293T and H1299 cells was shown (lower panels). (B) The level of m6A in mRNAs is low in the high SUMOylation status in SENP1 knockdown cells. (C) SUMOylation of METTL3 reduces its m6A methyltransferase activity. HA-METTL3 with or without His-SUMO1/Flag-Ubc9 were transfected into 293T cells. (D–F) The SUMO-site mutataion 4KR (K177/211/212/215R) of METTL3 significantly enhances its m6A methyltransferase activity. (D) HA-METTL3-WT or -4KR was transiently transfeced into 293T cells, and (E) HA-METTL3-WT or -4KR was stably re-expressed H1299-shMETTL3 by using the lentiviral system. (F) HA-METTL3-WT or -4KR were transfected with or without His-SUMO1/Flag-Ubc9 into 293T cells. The SUMOylation assays and dot-blot assays were performed as described before. (G) LC–MS/MS quantification of the m6A/A ratio in polyadenylated RNAs purified from H1299-shMETTL3 cells with METTL3-WT or METTL3-4KR. Error bars indicate mean ± S.D. (two technical replicates). (H) The in vitro RNA N6-adenosine methylation activity was tested using purified Flag-METTL3-WT, SUMOlated Flag-METTL3-WT or Flag-METTL3-4KR proteins in combination with purified Flag-METTL14 and RNA-probe (Seq1) with consensus sequence of ‘GGACU’. The methylation of RNA-probe was measured by immunoblotting with the m6A antibody. Figure 4. View largeDownload slide SUMO1 modification of METTL3 represses its RNA m6A methyltransferase activity. (A–E) Polyadenylated mRNAs were purified for the dot-blot assay (upper panels), and cell lysates were used for immunoblotting with indicated antibodies (lower panels). (A) METTL3 is a main component responsible for the abundance of m6A in mRNAs. The abundance of m6A in mRNAs from shControl or shMETTL3 293T and H1299 cells was detected by the Dot-blot assay with anti-m6A antibody, and equal loading of the mRNAs was verified by methylene blue staining (upper panels). METTL3 knockdown efficiency in 293T and H1299 cells was shown (lower panels). (B) The level of m6A in mRNAs is low in the high SUMOylation status in SENP1 knockdown cells. (C) SUMOylation of METTL3 reduces its m6A methyltransferase activity. HA-METTL3 with or without His-SUMO1/Flag-Ubc9 were transfected into 293T cells. (D–F) The SUMO-site mutataion 4KR (K177/211/212/215R) of METTL3 significantly enhances its m6A methyltransferase activity. (D) HA-METTL3-WT or -4KR was transiently transfeced into 293T cells, and (E) HA-METTL3-WT or -4KR was stably re-expressed H1299-shMETTL3 by using the lentiviral system. (F) HA-METTL3-WT or -4KR were transfected with or without His-SUMO1/Flag-Ubc9 into 293T cells. The SUMOylation assays and dot-blot assays were performed as described before. (G) LC–MS/MS quantification of the m6A/A ratio in polyadenylated RNAs purified from H1299-shMETTL3 cells with METTL3-WT or METTL3-4KR. Error bars indicate mean ± S.D. (two technical replicates). (H) The in vitro RNA N6-adenosine methylation activity was tested using purified Flag-METTL3-WT, SUMOlated Flag-METTL3-WT or Flag-METTL3-4KR proteins in combination with purified Flag-METTL14 and RNA-probe (Seq1) with consensus sequence of ‘GGACU’. The methylation of RNA-probe was measured by immunoblotting with the m6A antibody. Furthermore, mRNAs from the H1299-shMETTL3 cells stably re-expressing HA-METTL3-WT or HA-METTL3-4KR were isolated and digested into nucleosides, and then the amount of m6A was measured by LC–MS/MS. The total contents of m6A and A were quantified based on a standard curve generated using pure standards, from which the m6A/A ratio was calculated (Supplementary Figure S4). Consistent with the above results of the dot-blot assays, the results by using the LC–MS/MS method showed that the mRNA m6A levels from re-expression of METTL3-4KR was significantly higher than that of METTL3-WT (Figure 4G). To validate whether SUMOlatyion of METTL3 can directly affect the m6A formation, the in vitro methyltranferase activity assay was performed. Several proteins including Flag-METTL3-WT, SUMOylated Flag-METTL3-WT, Flag-METTL3-4KR and Flag-METTL14 were purified from HEK293T cells. Purified METTL3 proteins in combination with METTL14 were incubated with an RNA-probe oligo (Seq1) containing the consensus sequence of ‘GGACU’ (Supplementary Figure S5A). The methylation of RNA-probe was measured by immunoblotting with anti-m6A antibody. As expectedly, the METTL3-METTL14 complexes in vitro exhibited m6A methyltransferase activity against the RNA-probe Seq1 containing the consensus sequence of ‘GGACU’ but not RNA-probe Seq2 with mutation of the consensus sequence into ‘GGAUU’ (Supplementary Figure S5B and C). METTL3-4KR showed much higher activity compared to that of METTL3-WT, on the contrary, SUMOl modified METTL3-WT protein displayed very little methyltransferase activity (Figure 4H). Thus, our data demonstrated that SUMOylation of METTL3 repressed its m6A RNA methyltransferase activity. SUMOylation of METTL3 promotes tumorigenesis in H1299 cells by decreasing the m6A level in mRNAs and subsequently changing gene expression profile Most recently, a growing number of studies have reported that methylases and demethylases of m6A are correlated with cancer (8,30,38,39). As METTL3 is the core component of m6A the methyltransferase complex, we wondered whether SUMOylation of METTL3 is connected with tumorigenesis. Above stable cell lines H1299-shMETTL3 re-expressing HA-METTL3-WT and HA-METTL3-4KR (Figure 4E) were used for the soft agar colony-forming assay to evaluate cellular transformation of each stable cell line. The results showed that the number of colonies in cells re-expressing METTL3-4KR was less than that of cells re-expressing METTL3-WT (Figure 5A). Furthermore, we also investigated whether SUMOylation of METTL3 affects xenograft tumour growth in vivo. Above stable cell lines were also injected subcutaneously into the flanks of nude mice, and the results showed that the average sizes and weights of tumors in the METTL3-4KR group were also significantly reduced compared to those in the METTL3-WT group at 35 days after injection (Figure 5B), which was consistent with the results of the colony formation assays. We also confirmed that the SUMOylation levels of METTL3-WT in xenograft tumours was higher than that of METTL3-4KR (Figure 5C). Combining with the results of mRNA m6A abundance (Figure 4E), these data indicated that tumor growth was negatively correlated with the total level of m6A in H1299 cells, and the increased m6A level by the SUMO-site mutation of METTL3-4KR potentially suppressed tumor growth. Figure 5. View largeDownload slide SUMOylation of METTL3 promotes tumorigenesis in H1299 cells. (A) The mutation of 4KR in METTL3 reduces the soft-agar colony formation of H1299 cells. H1299-shMETTL3 cells stably re-expressed METTL3-WT or METTL3-4KR cell lines were seeded in 2 ml of medium containing 10% FBS with 0.35% soft agarose at 1000 per well and layered onto 0.6% solidified agarose. The photographs were taken 14 days after seeding, and the number of colonies were counted and analysed. Each value represents the mean±s.e.m. of three independent experiments with triplicates. (B) SUMOylation of METTL3 promotes xenograft tumour growth. Each of H1299-shMETTL3 stable cell lines re-expressed with METTL3-WT or METTL3-4KR (2.5 × 106 cells/each) was injected subcutaneously into male BALB/c nude mice individually. Mice were killed 35 days later, and tumors were dissected (left panels) and assessed by weight (right panels). (C) Xenograft tumour tissues were lysed in NEM-RIPA buffer and immunoprecipitated with SUMO1 antibody, followed by western blotting with anti-METTL3. One-tenth of lysates as Input were immunoblotted with indicated antibodies. Figure 5. View largeDownload slide SUMOylation of METTL3 promotes tumorigenesis in H1299 cells. (A) The mutation of 4KR in METTL3 reduces the soft-agar colony formation of H1299 cells. H1299-shMETTL3 cells stably re-expressed METTL3-WT or METTL3-4KR cell lines were seeded in 2 ml of medium containing 10% FBS with 0.35% soft agarose at 1000 per well and layered onto 0.6% solidified agarose. The photographs were taken 14 days after seeding, and the number of colonies were counted and analysed. Each value represents the mean±s.e.m. of three independent experiments with triplicates. (B) SUMOylation of METTL3 promotes xenograft tumour growth. Each of H1299-shMETTL3 stable cell lines re-expressed with METTL3-WT or METTL3-4KR (2.5 × 106 cells/each) was injected subcutaneously into male BALB/c nude mice individually. Mice were killed 35 days later, and tumors were dissected (left panels) and assessed by weight (right panels). (C) Xenograft tumour tissues were lysed in NEM-RIPA buffer and immunoprecipitated with SUMO1 antibody, followed by western blotting with anti-METTL3. One-tenth of lysates as Input were immunoblotted with indicated antibodies. To validate the above hypothesis, we performed the transcriptome-wide m6A-sequencing (m6A-Seq) and RNA-sequencing (RNA-Seq) assays (Supplementary Table S1) using the same stable cell lines H1299-shMETTL3 re-expressing METTL3-WT or METTL3-4KR. The MeRIP m6A-Seq showed that overall, the SUMO-site mutations of METTL3-4KR led to increase the abundance of m6A modification in target transcripts compared to that of METTL3-WT (Figure 6A), especially in 3′ UTRs and around stop codons (Figure 6B), where the mRNA m6A modification is highly enriched as reported (31,40,41). Compared with the re-expression of METTL3-WT, METTL3-4KR significantly brought about a total of 3285 in increase and 2,156 in decrease of the abundance of m6A peaks (Figure 6C), which were thus termed as hyper- and hypo-methylated m6A peaks, respectively. Moreover, the RNA-Seq showed that there were some parts of transcripts changed in H1299-shMETTL3 cells re-expressing METTL3-4KR when compared to those in METTL3-WT (Figure 6D). The combination analysis of these two sequencing data revealed that at least 90 genes with significant changes at both the m6A peak abundances and the posttranscription levels in the re-expression of METTL3-4KR compared to those in METTL3-WT (Supplementary Table S2), suggesting that the SUMOylation of METTL3 might down-regulate m6A modification in mRNAs and subsequently alter the gene expression profiles in H1299 cells, therefore promoting tumorigenesis. Figure 6. View largeDownload slide SUMOylation of METTL3 down-regulates m6A modification in mRNAs resulting in the alternation of gene expression profile. (A) Cumulative distribution curve for the abundance of m6A modification across the transcriptome of H1299-shMETTL3 cells re-expressing METTL3-WT or METTL3-4KR. (B) Distribution of m6A peaks across around stop codons and 3′ UTRs of the entire set of mRNA transcripts. (C) Comparison of the abundance of m6A peaks across the transcriptome of H1299-shMETTL3 cells re-expressing METTL3-WT or METTL3-4KR. The fold-change ≥2.0 was considered to be significant, which was the abundance of m6A peaks of METTL3-4KR relative to METTL3-WT. IP/Input, was referred to as the abundance of m6A peak in mRNAs detected in MeRIP m6A-Seq (IP) normalized by that detected in RNA-Seq (Input). (D) Heatmap showing the alternation of mRNA expression profiles in H1299-shMETTL3 cells re-expressing METTL3-WT or METTL3-4KR. Figure 6. View largeDownload slide SUMOylation of METTL3 down-regulates m6A modification in mRNAs resulting in the alternation of gene expression profile. (A) Cumulative distribution curve for the abundance of m6A modification across the transcriptome of H1299-shMETTL3 cells re-expressing METTL3-WT or METTL3-4KR. (B) Distribution of m6A peaks across around stop codons and 3′ UTRs of the entire set of mRNA transcripts. (C) Comparison of the abundance of m6A peaks across the transcriptome of H1299-shMETTL3 cells re-expressing METTL3-WT or METTL3-4KR. The fold-change ≥2.0 was considered to be significant, which was the abundance of m6A peaks of METTL3-4KR relative to METTL3-WT. IP/Input, was referred to as the abundance of m6A peak in mRNAs detected in MeRIP m6A-Seq (IP) normalized by that detected in RNA-Seq (Input). (D) Heatmap showing the alternation of mRNA expression profiles in H1299-shMETTL3 cells re-expressing METTL3-WT or METTL3-4KR. DISCUSSION Increasing evidences have proven that m6A methylation plays important roles in regulating RNA metabolism and biological processes (4,42–45). In eukaryotes, m6A RNA methylation is catalyzed by the methyltransferase complex containing METTL3, METTL14, WTAP and other unknown components while is removed by demethylases FTO and ALKBH5 (7,40,41). Although METTL3 is the most important component of the methyltransferase complex, its regulatory mechanisms are still largely unknown. Our data for the first time demonstrated that METTL3 was in vitro and in vivo modified by SUMO1 at multiple sites (Figure 1), of which K177, K211, K212 and K215 were major SUMOylation sites (Figure 2). We have attempted to explore the exact mechanism that how SUMOylation of METTL3 affects its m6A methytransferase activity. As known that SUMOylation can alter the localization and stability of target proteins, activate or inhibit enzymes activity through changing the inter- or intra-molecular interactions of SUMOylated proteins (18). In this study, we showed that SUMOylation of METTL3 did not alter its stability (Figure 3A–C) and localization (Figure 3D–E), and interactions with two key components METTL14 and WTAP of the methyltransferase complex (Figure 3G–H), and translation initiation factors including CBP80, eIF4E, eIF4B (Supplementary Figure S3C). But interestingly, we found that the increased SUMOylation of METTL3 by co-expression of SUMO1/Ubc9 obviously repressed the RNA m6A level in 293T cells (Figure 4C), which was consistent with the result that knockdown of Senp1 significantly attenuated the m6A levels in mRNAs (Figure 4B). Overexpression of METTL3-4KR in 293T cells (Figure 4D) or re-expression of METTL3-4KR in H1299-shMETTL3 cells (Figure 4E–F) obviously increased the abundance of m6A in mRNAs compared with those of METTL3-WT group. This finding was confirmed by using the LC–MS/MS (Figure 4G) and MeRIP-seq methods (Figure 6A–C). Recent studies have stated the methyltransferase domain (MTD, residues 369–590) and the two Cys-Cys-Cys-His (CCCH)-type zinc finger motifs (ZnF, 259–340) of METTL3 along with METTL14 MTD are necessary for RNA m6A modification in vitro methylation activity assays (46). Though the crystal structure of the core METTL3-METTL14 complex comprising the MTase domains have already been elucidated (47), the structure of full-length METTL3 still remains elusive. Most convincingly, we performed the in vitro m6A methylation activity assay and found that SUMO1 modification of METTL3 could directly reduce its methyltransferase activity in combination with protein METTL14 (Figure 4H). As the major SUMOylation accept sites K177/211/212/215 locates in the N-terminal domain which is adjacent to its two ZnF motifs but distant to MTD of METTL3, we speculated that SUMOylation might spatially influence its interaction of ZnF and MTD with substrate mRNAs, thereby ultimately inhibiting its m6A methyltransferase activity. Thus, our data revealed that SUMOylation might be an important mechanism to control the METTL3 methyltransferase activity, however the detailed mechanism by which the SUMOylation affected METTL3 activity was still not clear. It is becoming increasingly clear that SUMO modification plays important roles in the development and progression of cancer (19–21,26–28,48–50). Growing evidences have also revealed that key enzymes for m6A demethylation such as ALKBH5 (8,39) and FTO (38) have important regulatory roles in tumorigenesis. In this study, we found that the SUMO-site mutant METTL3-4KR repressed the anchor-independent growth and xenograft tumor growth in H1299 cells (Figure 5), possibly resulting from higher abundance of m6A methylation in mRNAs, when compared to those of METTL3-WT (Figures 4E, G and 6A–C). These changes of m6A levels in mRNAs could subsequently affect some parts of transcripts (Figure 6D) and the gene expression pattern in H1299 cells (Supplementary Table S2), which influenced tumorigenesis. Thus, for the first time we reported this novel mechanism for SUMOylation of METTL3 promoted tumorigenesis in H1299 cells by controlling the m6A levels in mRNAs. However, we speculated that tumorigenesis regulated by SUMOylation of METTL3 is more probably dependent on cell types, in which lncRNAs and pri-miRNAs besides gene expression profiles and m6A levels in mRNAs, will be investigated further by high-throughput sequencing. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS The authors thank Professor Jianzhao Liu in Zhejiang University for the experiment of mRNA m6A quantification by LC–MS/MS. Author contributions: Y.D., G.H. and X.Z. performed most of the experiments; H.Z., J.D., J.H., Y.G., L.L., R.C., Y.W., R.D. and J.H. helped with all experiments; J.Y., X.Z., G-Q C., J.C. and B.J. analyzed and discussed data; J.Y., X.Z. and Y.D. wrote the manuscript. All authors read and approved the final manuscript. FUNDING National Natural Science Foundation of China [31671345, 81630075, 81472571 to J.Y.; 81602251 to Y.W.; 81721004 to G.-Q.C.]. Funding for open access charge: National Natural Science Foundation of China. Conflict of interest statement. None declared. REFERENCES 1. Machnicka M.A. , Milanowska K. , Osman Oglou O. , Purta E. , Kurkowska M. , Olchowik A. , Januszewski W. , Kalinowski S. , Dunin-Horkawicz S. , Rother K.M. et al. MODOMICS: a database of RNA modification pathways–2013 update . Nucleic Acids Res. 2013 ; 41 : D262 – D267 . Google Scholar CrossRef Search ADS PubMed 2. Chandola U. , Das R. , Panda B. Role of the N6-methyladenosine RNA mark in gene regulation and its implications on development and disease . Brief. Funct. Genomics . 2015 ; 14 : 169 – 179 . Google Scholar CrossRef Search ADS PubMed 3. Alarcon C.R. , Lee H. , Goodarzi H. , Halberg N. , Tavazoie S.F. N6-methyladenosine marks primary microRNAs for processing . Nature . 2015 ; 519 : 482 – 485 . Google Scholar CrossRef Search ADS PubMed 4. Fustin J.M. , Doi M. , Yamaguchi Y. , Hida H. , Nishimura S. , Yoshida M. , Isagawa T. , Morioka M.S. , Kakeya H. , Manabe I. et al. RNA-methylation-dependent RNA processing controls the speed of the circadian clock . Cell . 2013 ; 155 : 793 – 806 . Google Scholar CrossRef Search ADS PubMed 5. Schwartz S. , Agarwala S.D. , Mumbach M.R. , Jovanovic M. , Mertins P. , Shishkin A. , Tabach Y. , Mikkelsen T.S. , Satija R. , Ruvkun G. et al. High-resolution mapping reveals a conserved, widespread, dynamic mRNA methylation program in yeast meiosis . Cell . 2013 ; 155 : 1409 – 1421 . Google Scholar CrossRef Search ADS PubMed 6. Wang X. , Lu Z. , Gomez A. , Hon G.C. , Yue Y. , Han D. , Fu Y. , Parisien M. , Dai Q. , Jia G. et al. N6-methyladenosine-dependent regulation of messenger RNA stability . Nature . 2014 ; 505 : 117 – 120 . Google Scholar CrossRef Search ADS PubMed 7. Meyer K.D. , Jaffrey S.R. The dynamic epitranscriptome: N6-methyladenosine and gene expression control . Nat. Rev. Mol. Cell Biol. 2014 ; 15 : 313 – 326 . Google Scholar CrossRef Search ADS PubMed 8. Zhang S. , Zhao B.S. , Zhou A. , Lin K. , Zheng S. , Lu Z. , Chen Y. , Sulman E.P. , Xie K. , Bogler O. et al. m6A demethylase ALKBH5 maintains tumorigenicity of glioblastoma stem-like cells by sustaining FOXM1 expression and cell proliferation program . Cancer Cell . 2017 ; 31 : 591 – 606 . Google Scholar CrossRef Search ADS PubMed 9. Li L. , Zang L. , Zhang F. , Chen J. , Shen H. , Shu L. , Liang F. , Feng C. , Chen D. , Tao H. et al. Fat mass and obesity-associated (FTO) protein regulates adult neurogenesis . Hum. Mol. Genet. 2017 ; 31 : 591 – 606 . 10. Dominissini D. , Moshitch-Moshkovitz S. , Schwartz S. , Salmon-Divon M. , Ungar L. , Osenberg S. , Cesarkas K. , Jacob-Hirsch J. , Amariglio N. , Kupiec M. et al. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq . Nature . 2012 ; 485 : 201 – 206 . Google Scholar CrossRef Search ADS PubMed 11. Liu J. , Yue Y. , Han D. , Wang X. , Fu Y. , Zhang L. , Jia G. , Yu M. , Lu Z. , Deng X. et al. A METTL3-METTL14 complex mediates mammalian nuclear RNA N6-adenosine methylation . Nat. Chem. Biol. 2014 ; 10 : 93 – 95 . Google Scholar CrossRef Search ADS PubMed 12. Ping X.L. , Sun B.F. , Wang L. , Xiao W. , Yang X. , Wang W.J. , Adhikari S. , Shi Y. , Lv Y. , Chen Y.S. et al. Mammalian WTAP is a regulatory subunit of the RNA N6-methyladenosine methyltransferase . Cell Res. 2014 ; 24 : 177 – 189 . Google Scholar CrossRef Search ADS PubMed 13. Jia G. , Fu Y. , Zhao X. , Dai Q. , Zheng G. , Yang Y. , Yi C. , Lindahl T. , Pan T. , Yang Y.G. et al. N6-methyladenosine in nuclear RNA is a major substrate of the obesity-associated FTO . Nat. Chem. Biol. 2011 ; 7 : 885 – 887 . Google Scholar CrossRef Search ADS PubMed 14. Zheng G. , Dahl J.A. , Niu Y. , Fedorcsak P. , Huang C.M. , Li C.J. , Vagbo C.B. , Shi Y. , Wang W.L. , Song S.H. et al. ALKBH5 is a mammalian RNA demethylase that impacts RNA metabolism and mouse fertility . Mol. Cell . 2013 ; 49 : 18 – 29 . Google Scholar CrossRef Search ADS PubMed 15. Bokar J.A. , Shambaugh M.E. , Polayes D. , Matera A.G. , Rottman F.M. Purification and cDNA cloning of the AdoMet-binding subunit of the human mRNA (N6-adenosine)-methyltransferase . RNA . 1997 ; 3 : 1233 – 1247 . Google Scholar PubMed 16. Hay R.T. SUMO: a history of modification . Mol. Cell . 2005 ; 18 : 1 – 12 . Google Scholar CrossRef Search ADS PubMed 17. Gill G. SUMO and ubiquitin in the nucleus: different functions, similar mechanisms . Genes Dev. 2004 ; 18 : 2046 – 2059 . Google Scholar CrossRef Search ADS PubMed 18. Geiss-Friedlander R. , Melchior F. Concepts in sumoylation: a decade on . Nat. Rev. Mol. Cell Biol. 2007 ; 8 : 947 – 956 . Google Scholar CrossRef Search ADS PubMed 19. Chen C. , Zhu C. , Huang J. , Zhao X. , Deng R. , Zhang H. , Dou J. , Chen Q. , Xu M. , Yuan H. et al. SUMOylation of TARBP2 regulates miRNA/siRNA efficiency . Nat. Commun. 2015 ; 6 : 8899 . Google Scholar CrossRef Search ADS PubMed 20. Zhu C. , Chen C. , Huang J. , Zhang H. , Zhao X. , Deng R. , Dou J. , Jin H. , Chen R. , Xu M. et al. SUMOylation at K707 of DGCR8 controls direct function of primary microRNA . Nucleic Acids Res. 2015 ; 43 : 7945 – 7960 . Google Scholar CrossRef Search ADS PubMed 21. Zhu C. , Chen C. , Chen R. , Deng R. , Zhao X. , Zhang H. , Duo J. , Chen Q. , Jin H. , Wang Y. et al. K259-SUMOylation of DGCR8 promoted by p14ARF exerts a tumor-suppressive function . J.Mol. Cell Biol. 2016 ; 8 : 456 – 458 . Google Scholar CrossRef Search ADS 22. Steffan J.S. , Agrawal N. , Pallos J. , Rockabrand E. , Trotman L.C. , Slepko N. , Illes K. , Lukacsovich T. , Zhu Y.Z. , Cattaneo E. et al. SUMO modification of Huntingtin and Huntington's disease pathology . Science . 2004 ; 304 : 100 – 104 . Google Scholar CrossRef Search ADS PubMed 23. Eckermann K. SUMO and Parkinson's disease . Neuromol. Med. 2013 ; 15 : 737 – 759 . Google Scholar CrossRef Search ADS 24. McMillan L.E. , Brown J.T. , Henley J.M. , Cimarosti H. Profiles of SUMO and ubiquitin conjugation in an Alzheimer's disease model . Neurosci. Lett. 2011 ; 502 : 201 – 208 . Google Scholar CrossRef Search ADS PubMed 25. Henley J.M. , Craig T.J. , Wilkinson K.A. Neuronal SUMOylation: mechanisms, physiology, and roles in neuronal dysfunction . Physiol. Rev. 2014 ; 94 : 1249 – 1285 . Google Scholar CrossRef Search ADS PubMed 26. Huang J. , Yan J. , Zhang J. , Zhu S. , Wang Y. , Shi T. , Zhu C. , Chen C. , Liu X. , Cheng J. et al. SUMO1 modification of PTEN regulates tumorigenesis by controlling its association with the plasma membrane . Nat. Commun. 2012 ; 3 : 911 . Google Scholar CrossRef Search ADS PubMed 27. Qu Y. , Chen Q. , Lai X. , Zhu C. , Chen C. , Zhao X. , Deng R. , Xu M. , Yuan H. , Wang Y. et al. SUMOylation of Grb2 enhances the ERK activity by increasing its binding with Sos1 . Mol. Cancer . 2014 ; 13 : 95 . Google Scholar CrossRef Search ADS PubMed 28. Yu J. , Zhang S.S. , Saito K. , Williams S. , Arimura Y. , Ma Y. , Ke Y. , Baron V. , Mercola D. , Feng G.S. et al. PTEN regulation by Akt-EGR1-ARF-PTEN axis . EMBO J. 2009 ; 28 : 21 – 33 . Google Scholar CrossRef Search ADS PubMed 29. Barysch S.V. , Dittner C. , Flotho A. , Becker J. , Melchior F. Identification and analysis of endogenous SUMO1 and SUMO2/3 targets in mammalian cells and tissues using monoclonal antibodies . Nat. Protoc. 2014 ; 9 : 896 – 909 . Google Scholar CrossRef Search ADS PubMed 30. Yang Z. , Li J. , Feng G. , Gao S. , Wang Y. , Zhang S. , Liu Y. , Ye L. , Li Y. , Zhang X. MicroRNA-145 modulates N6-methyladenosine levels by targeting the 3′-untranslated mRNA region of the N6-methyladenosine binding YTH domain family 2 protein . J. Biol. Chem. 2017 ; 292 : 3614 – 3623 . Google Scholar CrossRef Search ADS PubMed 31. Meyer K.D. , Saletore Y. , Zumbo P. , Elemento O. , Mason C.E. , Jaffrey S.R. Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons . Cell . 2012 ; 149 : 1635 – 1646 . Google Scholar CrossRef Search ADS PubMed 32. Gareau J.R. , Lima C.D. The SUMO pathway: emerging mechanisms that shape specificity, conjugation and recognition . Nat. Rev. Mol. Cell Biol. 2010 ; 11 : 861 – 871 . Google Scholar CrossRef Search ADS PubMed 33. Schwartz S. , Mumbach M.R. , Jovanovic M. , Wang T. , Maciag K. , Bushkin G.G. , Mertins P. , Ter-Ovanesyan D. , Habib N. , Cacchiarelli D. et al. Perturbation of m6A writers reveals two distinct classes of mRNA methylation at internal and 5′ sites . Cell Rep. 2014 ; 8 : 284 – 296 . Google Scholar CrossRef Search ADS PubMed 34. Wang Y. , Li Y. , Toth J.I. , Petroski M.D. , Zhang Z. , Zhao J.C. N6-methyladenosine modification destabilizes developmental regulators in embryonic stem cells . Nat. Cell Biol. 2014 ; 16 : 191 – 198 . Google Scholar CrossRef Search ADS PubMed 35. Wang P. , Doxtader K.A. , Nam Y. Structural basis for cooperative function of Mettl3 and Mettl14 methyltransferases . Mol. Cell . 2016 ; 63 : 306 – 317 . Google Scholar CrossRef Search ADS PubMed 36. Lin S. , Choe J. , Du P. , Triboulet R. , Gregory R.I. The m(6)A methyltransferase METTL3 promotes translation in human cancer cells . Mol. Cell . 2016 ; 62 : 335 – 345 . Google Scholar CrossRef Search ADS PubMed 37. Li R. , Wei J. , Jiang C. , Liu D. , Deng L. , Zhang K. , Wang P. Akt SUMOylation regulates cell proliferation and tumorigenesis . Cancer Res. 2013 ; 73 : 5742 – 5753 . Google Scholar CrossRef Search ADS PubMed 38. Li Z. , Weng H. , Su R. , Weng X. , Zuo Z. , Li C. , Huang H. , Nachtergaele S. , Dong L. , Hu C. et al. FTO plays an oncogenic role in acute myeloid leukemia as a N6-methyladenosine RNA demethylase . Cancer Cell . 2017 ; 31 : 127 – 141 . Google Scholar CrossRef Search ADS PubMed 39. Zhang C. , Zhi W.I. , Lu H. , Samanta D. , Chen I. , Gabrielson E. , Semenza G.L. Hypoxia-inducible factors regulate pluripotency factor expression by ZNF217- and ALKBH5-mediated modulation of RNA methylation in breast cancer cells . Oncotarget . 2016 ; 7 : 64527 – 64542 . Google Scholar PubMed 40. Fu Y. , Dominissini D. , Rechavi G. , He C. Gene expression regulation mediated through reversible m(6)A RNA methylation . Nat. Rev. Genet. 2014 ; 15 : 293 – 306 . Google Scholar CrossRef Search ADS PubMed 41. Zhao B.S. , Roundtree I.A. , He C. Post-transcriptional gene regulation by mRNA modifications . Nat. Rev. Mol. Cell Biol. 2017 ; 18 : 31 – 42 . Google Scholar CrossRef Search ADS PubMed 42. Chen T. , Hao Y.J. , Zhang Y. , Li M.M. , Wang M. , Han W. , Wu Y. , Lv Y. , Hao J. , Wang L. et al. m(6)A RNA methylation is regulated by microRNAs and promotes reprogramming to pluripotency . Cell Stem Cell . 2015 ; 16 : 289 – 301 . Google Scholar CrossRef Search ADS PubMed 43. Geula S. , Moshitch-Moshkovitz S. , Dominissini D. , Mansour A.A. , Kol N. , Salmon-Divon M. , Hershkovitz V. , Peer E. , Mor N. , Manor Y.S. et al. Stem cells. m6A mRNA methylation facilitates resolution of naive pluripotency toward differentiation . Science . 2015 ; 347 : 1002 – 1006 . Google Scholar CrossRef Search ADS PubMed 44. Du H. , Zhao Y. , He J. , Zhang Y. , Xi H. , Liu M. , Ma J. , Wu L. YTHDF2 destabilizes m(6)A-containing RNA through direct recruitment of the CCR4-NOT deadenylase complex . Nat. Commun. 2016 ; 7 : 12626 . Google Scholar CrossRef Search ADS PubMed 45. Zhao X. , Yang Y. , Sun B.F. , Shi Y. , Yang X. , Xiao W. , Hao Y.J. , Ping X.L. , Chen Y.S. , Wang W.J. et al. FTO-dependent demethylation of N6-methyladenosine regulates mRNA splicing and is required for adipogenesis . Cell Res. 2014 ; 24 : 1403 – 1419 . Google Scholar CrossRef Search ADS PubMed 46. Wang P. , Doxtader K.A. , Nam Y. Structural basis for cooperative function of Mettl3 and Mettl14 methyltransferases . Mol. Cell . 2016 ; 63 : 306 – 317 . Google Scholar CrossRef Search ADS PubMed 47. Wang X. , Feng J. , Xue Y. , Guan Z. , Zhang D. , Liu Z. , Gong Z. , Wang Q. , Huang J. , Tang C. et al. Corrigendum: Structural basis of N6-adenosine methylation by the METTL3-METTL14 complex . Nature . 2017 ; 542 : 260 . Google Scholar CrossRef Search ADS PubMed 48. Seeler J.S. , Dejean A. SUMO and the robustness of cancer . Nat. Rev. Cancer . 2017 ; 17 : 184 – 197 . Google Scholar CrossRef Search ADS PubMed 49. Deng R. , Zhao X. , Qu Y. , Chen C. , Zhu C. , Zhang H. , Yuan H. , Jin H. , Liu X. , Wang Y. et al. Shp2 SUMOylation promotes ERK activation and hepatocellular carcinoma development . Oncotarget . 2015 ; 6 : 9355 – 9369 . Google Scholar PubMed 50. Yuan H. , Deng R. , Zhao X. , Chen R. , Hou G. , Zhang H. , Wang Y. , Xu M. , Jiang B. , Yu J. SUMO1 modification of KHSRP regulates tumorigenesis by preventing the TL-G-Rich miRNA biogenesis . Mol. Cancer . 2017 ; 16 : 157 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

journal article

Open Access Collection

Antisense oligonucleotides correct the familial dysautonomia splicing defect in IKBKAP transgenic mice

Sinha, Rahul;Kim, Young Jin;Nomakuchi, Tomoki;Sahashi, Kentaro;Hua, Yimin;Rigo, Frank;Bennett, C Frank;Krainer, Adrian R

2018 Nucleic Acids Research

doi: 10.1093/nar/gky249pmid: 29672717

Abstract Familial dysautonomia (FD) is a rare inherited neurodegenerative disorder caused by a point mutation in the IKBKAP gene that results in defective splicing of its pre-mRNA. The mutation weakens the 5′ splice site of exon 20, causing this exon to be skipped, thereby introducing a premature termination codon. Though detailed FD pathogenesis mechanisms are not yet clear, correcting the splicing defect in the relevant tissue(s), thus restoring normal expression levels of the full-length IKAP protein, could be therapeutic. Splice-switching antisense oligonucleotides (ASOs) can be effective targeted therapeutics for neurodegenerative diseases, such as nusinersen (Spinraza), an approved drug for spinal muscular atrophy. Using a two-step screen with ASOs targeting IKBKAP exon 20 or the adjoining intronic regions, we identified a lead ASO that fully restored exon 20 splicing in FD patient fibroblasts. We also characterized the corresponding cis-acting regulatory sequences that control exon 20 splicing. When administered into a transgenic FD mouse model, the lead ASO promoted expression of full-length human IKBKAP mRNA and IKAP protein levels in several tissues tested, including the central nervous system. These findings provide insights into the mechanisms of IKBKAP exon 20 recognition, and pre-clinical proof of concept for an ASO-based targeted therapy for FD. INTRODUCTION Familial dysautonomia (FD), a rare genetic disorder found almost exclusively in the Ashkenazi Jewish population (1,2), is an autosomal recessive condition caused by a single point mutation in intron 20 (IVS20+6T→C) of the IKBKAP gene (3). FD, also known as Riley–Day syndrome and hereditary sensory autonomic neuropathy type-III (HSAN-III), is characterized by poor development and progressive degeneration of sensory and autonomic neurons (4). Notable symptoms include anhidrosis, decreased taste, depressed deep-tendon reflexes, postural hypotension, loss of pain and temperature perception, alacrima, debilitating gastroesophageal reflux, and scoliosis (4). The extent and severity of the symptoms vary among patients, but even with advanced management, the disease leads to premature death due to respiratory and cardiac arrest, with only half of the patients surviving to 40 years of age (4). The IKBKAP gene consists of 37 exons spanning a region of approximately 68 kb that encodes the 150-kDa IKAP protein. The FD mutation weakens the 5′splice site of intron 20, leading to the skipping of exon 20 during pre-mRNA splicing (3). Skipping of exon 20 causes a frameshift and introduces a premature termination codon (PTC) in exon 21 (5). The presence of a PTC in the skipped mRNA also makes it potentially susceptible to degradation via the nonsense-mediated mRNA decay (NMD) pathway (6). The mutation consequently results in reduced levels of full-length IKAP. IKAP plays an important role in the development and survival of peripheral neurons. In particular, IKAP depletion impairs the survival of autonomic and sensory neurons, and innervation of their target tissues (7). The precise molecular role of IKAP in the context of neurodevelopment has been elusive. IKAP and Elp1, the yeast (Saccharomyces cerevisiae), plant (Arabidopsis thaliana), and worm (Caenorhabditis elegans) ortholog of human IKAP, play an important role in tRNA modification, among their several proposed cellular functions (8–13). The tRNA modification defect caused by the mutation in FD patient cells can be rescued by restoring the correct IKBKAP mRNA splicing pattern (13). The tRNA modification defect can cause protein aggregation and protein mis-folding that can lead to neuronal toxicity (14). However, how the loss of functional IKAP leads to neuronal death in FD patients remains to be elucidated. Though it is clear that FD involves a splicing defect, it is not fully understood why the disease manifests in a tissue-specific manner, considering that disruption of IKBKAP splicing is ubiquitous across different tissues. For example, while the peripheral afferent neurons are severely affected, the somatic motor nervous system is largely intact (15). Homozygous mutant cells derived from FD patients, as well as various patient tissues, express both included and skipped versions of the IKBKAP mRNA (8). The relative levels of IKBKAP splice variants differ among different tissues in patients, with the levels of the full-length mRNA being the lowest in central and peripheral nervous systems (8,16). This observation suggests that the phenotypic restriction to neuronal tissues may result from tissue-specific quantitative differences in splicing and/or NMD (8,16). Unfortunately, to date, prevention by prenatal screening remains the only weapon against FD. Most attempts to develop FD therapeutics have focused on correcting the splicing defect by increasing the levels of exon 20 inclusion through treatment with various small molecules, with varying degrees of success. Notable among these small molecules are rectifier of aberrant mRNA splicing (RECTAS) (13), epigallocatechin gallate (17), protease inhibitors (18), phosphatidylserine (5,19,20), tocotrienols (21), and kinetin (22,23). In clinical trials, tocotrienols did not demonstrate significant clinical efficacy (21), whereas kinetin administered orally moderately improved IKBKAP splicing in the white blood cells of the treated patients (22). Though kinetin is currently in a phase-2 clinical trial (NCT02274051), an effective treatment for FD remains unavailable. Synthetic antisense oligonucleotides (ASOs) provide an avenue towards therapy for various genetic disorders, including those caused by splicing mutations (24). We previously developed an ASO drug for treatment of spinal muscular atrophy (SMA) by correcting SMN2 exon 7 splicing (25–28); this ASO, dubbed nusinersen (Spinraza™), is currently the only approved treatment for SMA. Typical splice-switching ASOs carry either a 2′-O-methoxyethylribose-phosphorothioate or 2′-O-methoxyethylribose-phosphate backbone (referred to as PS-MOE-ASOs and PO-MOE ASOs, respectively) or a neutral phosphorodiamidate morpholino oligomer (PMO) backbone, instead of the ribose-phosphate backbone present in natural RNA. These modifications confer not only very high resistance to both exo- and endonucleases, but also higher affinity for the target sequences (24). Moreover, these modified ASOs prevent cleavage of the target RNA in the resulting heteroduplex by RNase H, and the duplexes are not recognized by the RNA interference machinery (24). Here, we applied a two-step ASO-screening approach to identify inhibitory cis-acting elements in exon 20 and flanking intron sequences of IKBKAP mRNA using MOE-ASOs. We used a tiling screen with overlapping MOE-ASOs to scan the entire sequence of exon 20 and the flanking upstream and downstream proximal intronic regions in the IKBKAP pre-mRNA. We identified lead MOE-ASOs that efficiently restored correct splicing of mutant IKBKAP in patient-derived fibroblasts. The lead MOE-ASO increased exon 20 inclusion and IKAP levels in transgenic FD mouse tissues. The systematic ASO walk with an IKBKAP minigene uncovered splicing silencer elements in the IKBKAP pre-mRNA, and provided information about the mechanisms of recognition of exon 20 by the spliceosome. MATERIALS AND METHODS Oligonucleotide synthesis MOE-ASOs were synthesized using an Applied Biosystems 380B automated DNA synthesizer, as described (29). We dissolved the ASOs in water and diluted them in saline before use. A list of oligonucleotide sequences is provided in Supplementary Table S1. List of ASOs (Name- Sequence: Purpose) ASO 7–26 (PO-MOE) - GTCGCAAACAGTACAATGGC: Inclusion of IKBKAP exon 20 ASO 7–26S (PS-MOE) - GTCGCAAACAGTACAATGGC: Inclusion of IKBKAP exon 20 Control ASO (PO-MOE & PS-MOE) - TTAGTTTAATCACGCTCG: Non-targeting ASO for in vivo injection Plasmids We amplified IKBKAP genomic fragments spanning exons 19–21 and 19–22 using specific primers with restriction sites, and human genomic DNA (Promega) as a template (Supplementary Table S2). We subcloned these fragments into pcDNA3.1 (Invitrogen) to generate the minigenes wt19–21 and wt19–22. We then introduced the major FD mutation (IVS20+6T→C) by site-directed mutagenesis to create the minigenes mt19–21 and mt19–22. Cell culture and transfections We cultured HEK-293 cells in Dulbecco's modified Eagle's medium (Invitrogen) supplemented with 10% (v/v) fetal bovine serum (Invitrogen). We grew the normal fibroblast cell line IMR-90 (Sigma) and patient fibroblast cell line GM04899 (Coriell Cell Repository) in minimal essential medium (Invitrogen) supplemented with non-essential amino acids (Invitrogen) and 20% (v/v) fetal bovine serum. We used electroporation (Gene Pulser II apparatus, Bio-Rad) to co-transfect 3 μg of the minigenes and 7 pmol of the PO-MOE ASOs into 7 × 105 HEK-293 cells resuspended in 70 μl of Opti-MEM (Invitrogen) for both the coarse walk and microwalk, and plated the cells in six-well plates, as described (30). We used 12 μl of Lipofectamine 2000 transfection reagent (Invitrogen) to transfect different amounts of PO-MOE ASOs, ranging from 0.01–1 nmol, in 40–50% confluent patient fibroblasts grown in 10-cm dishes, according to the manufacturer's recommendations. For cycloheximide treatment of patient fibroblasts, cycloheximide was dissolved in DMSO and added directly to the culture medium to achieve a final concentration of 100 μg/ml. For kinetin treatment of patient fibroblasts, kinetin solution was added to the culture medium. RT-PCR cDNA was synthesized from total RNA extracted from HEK-293 cells, patient fibroblasts (GM04899), and transgenic mice tissues, as described (29). The cDNA from human cells or mouse tissues was amplified using either vector-specific (pcDNA3.1) or human-specific primers (Supplementary Table S3), respectively, as described (29). To calculate exon 20 inclusion levels, the 32P-labeled radioactive reverse transcription PCR (RT-PCR) amplicons were separated by native PAGE, followed by phosphorimage analysis on a FUJIFILM FLA-5100 instrument (Fuji Medical Systems USA, Inc.). We quantified the band intensities using Multi Gauge software Version 2.3 (FUJIFILM), and normalized the values for the G+C content according to the DNA sequence. Western blotting Cells were harvested and lysed using RIPA buffer (150 mM NaCl, 50 mM Tris (pH 8.0), 1% NP-40, 0.1% SDS, 0.5% sodium deoxycholate) and protease inhibitor cocktail (Roche). Tissues were ground in a liquid-nitrogen filled mortar, and protein was extracted using RIPA buffer and protease inhibitor cocktail (Roche). The cell and tissue lysates were cleared by centrifugation, and the protein concentration was measured by Bradford assay (Biorad). Twenty microgram of protein was resolved by 8% SDS-PAGE, transferred to a nitrocellulose membrane, and probed with mouse monoclonal anti-IKAP (1:1000; Abnova) and rabbit polyclonal anti–beta-tubulin (1:3000; Genescript) antibodies. The membranes were incubated with infrared-dye conjugated secondary antibodies (1:10 000; LI-COR Biosciences), and protein bands were visualized by quantitative fluorescence using Odyssey software (LI-COR Biosciences). Molecular weight markers confirmed the sizes of the bands. ASO delivery in mice ICV injection in neonate mice and ICV infusion in adult mice were carried out as described (29). For SC administration, we injected ASO in saline under the dorsal skin, using a 10-μl micro syringe (Hamilton) and a 33-gauge needle. Statistical analyses All statistical analyses were run using Graphpad Prism 5.0 software. The statistical comparisons of IKBKAP splicing and IKAP protein levels were performed by Student's t-test or one-way analysis of variance (ANOVA) with Tukey's post-test. RESULTS MOE-ASO walk reveals several splicing enhancer and silencer regions in IKBKAP The genomic region that spans the IKBKAP gene is ∼68 kb. For ease of manipulation, we created IKBKAP minigenes by cloning genomic fragments comprising either exon 19 to exon 21 (wt19–21) or exon 19 to exon 22 (wt19–22). We introduced the major mutation found in FD (IVS20+6T→C) into the wild-type minigenes to obtain the corresponding mutant minigenes, mt19–21 and mt19–22. We also introduced an in-frame ATG as the first codon, within a Kozak sequence, at the 5′ end, downstream of the cytomegalovirus (CMV) promoter, as well as a stop codon at the 3′ end, upstream of a poly(A) signal from the pcDNA3.1 vector (Figure 2). The minigenes were transfected into HEK-293 cells and the splicing patterns of the transiently expressed RNAs were analyzed by 32P-labeled radioactive reverse transcription PCR (RT-PCR) after 72 hr. We observed that the mutant versions of the minigenes (mt19–21 and mt19–22), but not their wild-type counterparts, consistently showed predominant skipping of exon 20, thus recapitulating the splicing defect observed in FD patients (Figure 3). We designed a total of 49 overlapping 15-mer PO-MOE ASOs, which together spanned the entire 74-nucleotide (nt) IKBKAP exon 20, as well as 100-nt proximal intronic regions upstream and downstream of exon 20 (Figure 1A & Supplemental Table S1). Each pair of consecutive ASOs had a 10-nt overlapping region, such that the intronic and exonic regions were screened at 5-nt resolution. An ASO with the same chemistry but unrelated sequence was used as a negative control (See Materials and Methods). Figure 1. View largeDownload slide Schematic representation of the effects of all tested ASOs on IKBKAP exon 20 inclusion. (A) A 274-nucleotide region including and flanking exon 20 was tiled by overlapping 15-mer ASOs at 5-nucleotide intervals. Each underlined nucleotide in the pre-mRNA marks the start of the sequence targeted by an ASO. Exon 20 sequence is shown in upper case and the flanking intronic sequences in lower case. Each horizontal line represents an ASO and is color-coded based on its effect on exon 20 inclusion (blue: neutral; green: positive; and red: negative). Thicker bars showed stronger effects. The ISS-20, ISS-40, and ISE-20 regions are indicated below the corresponding intronic sequences. The 3′ and 5′ splice sites are labeled in blue and the FD point mutation is marked in red. (B) The high-resolution microwalk in the ISS-40 region (underlined) with 20-mer ASOs is shown with the same scheme as in (A). Figure 1. View largeDownload slide Schematic representation of the effects of all tested ASOs on IKBKAP exon 20 inclusion. (A) A 274-nucleotide region including and flanking exon 20 was tiled by overlapping 15-mer ASOs at 5-nucleotide intervals. Each underlined nucleotide in the pre-mRNA marks the start of the sequence targeted by an ASO. Exon 20 sequence is shown in upper case and the flanking intronic sequences in lower case. Each horizontal line represents an ASO and is color-coded based on its effect on exon 20 inclusion (blue: neutral; green: positive; and red: negative). Thicker bars showed stronger effects. The ISS-20, ISS-40, and ISE-20 regions are indicated below the corresponding intronic sequences. The 3′ and 5′ splice sites are labeled in blue and the FD point mutation is marked in red. (B) The high-resolution microwalk in the ISS-40 region (underlined) with 20-mer ASOs is shown with the same scheme as in (A). To test whether some of these ASOs could promote inclusion of exon 20 in the context of the major FD mutation in cells, we co-transfected each ASO individually with the mt19–21 minigene into HEK-293 cells by electroporation, and later assayed the splicing pattern of expressed RNAs by RT-PCR. The lanes with wt19–21 and mt19–21 minigenes alone were used as points of reference for exon 20 inclusion levels, whereas the lane corresponding to mt19–21 with the control ASO served as a control for non-specific effects of PO-MOE ASOs on splicing (Figure 3A–C). Six consecutive ASOs, which target a 40-nt intronic region immediately downstream of the 5′ splice site of exon 20, markedly increased inclusion of exon 20, suggesting the presence of multiple silencer elements or inhibitory secondary structure within this region, which we termed ISS-40 (Figures 1A and 3C). Consistent with our observation, Ohe et al. recently identified an inhibitory element in this region (31). The enhancement of IKBKAP exon 20 splicing was reduced as the ASOs targeted regions farther away from the 5′ splice site. Three additional ASOs, which target a 20-nt region in intron 19 (ISS-20), also enhanced exon 20 inclusion, by ∼2-fold (Figures 1A and 3A). In contrast, most ASOs targeting exon 20 resulted in nearly complete exon skipping, suggesting the presence of multiple exonic splicing enhancer (ESE) elements (Figures 1A and 3B). The skipping caused by the ASOs targeting the extreme 5′ or 3′ end of the exon is likely due in part to the fact that they occlude the 3′ and 5′ splice site, respectively. Besides these exonic ASOs, several other intronic ASOs also caused increased skipping, most likely because they targeted important cis-acting splicing elements. This was the case for ASOs complementary to the polypyrimidine-tract (Figure 3A), or to the 5′ splice site of intron 20 (Figure 3C). We also identified at least two ASOs that apparently target an intronic splicing enhancer (ISE-20) and decreased the levels of full-length mRNA isoform by half, compared to the untreated control (Figure 3C). ASO-walk with an NMD-responsive minigene As mentioned above, skipping of exon 20 causes a frameshift that introduces a PTC in exon 21, thereby making the mRNA potentially susceptible to degradation by the NMD surveillance pathway (7). In our initial ASO walk, we utilized wt and mt19–21 minigenes to cleanly assess the effects of ASOs on exon 20 splicing, as the exon-skipped isoform of mt19–21 is not a potential NMD target. In contrast, the exon-skipped isoform of minigene mt19–22 mRNA is a potential NMD target. If the exon 20-skipped isoform is efficiently targeted for NMD, the exon 20 inclusion rate after ASO treatment would be substantially different between mt19–22 and mt19–22 mRNA. To test this possibility, we carried out a similar ASO walk, but this time utilizing wt and mt-19–22 minigenes (Figure 3D–F). The individual ASOs affected the exon 20 splicing pattern of the mt19–22 minigene similarly to that of the mt19–21 minigene (Figure 3G). This observation suggests that the exon-skipped isoform of mt19–22 is not an efficient target for NMD. To confirm this finding, we corrected the frameshift that occurs due to skipping of exon 20, by deleting a single nucleotide in exon 21 of the mt19–22 minigene, to make the mt19–22FC minigene (Figure 2B). The restored reading frame abolishes the PTC, which is expected to make the skipped mRNA isoform more stable. We transfected the different minigenes into HEK-293 cells and analyzed the expressed RNA by RT-PCR. Expression of the exon-skipped mRNA from the mt19–22FC minigene was 1.3 fold higher than that from the mt19–22 minigene, suggesting that the mt19–22 exon 20-skipped mRNA isoform is only weakly susceptible to NMD (Figure 4A). It was recently reported that the NMD efficiency of a PTC-containing gene expressed from a plasmid can be variable, depending on the transfection method and cell type (32). To investigate the regulation of exon 20-skipped IKBKAP isoform mRNA levels by NMD in a more natural context, we used the patient skin fibroblast line GM04899 (Coriell Cell Repository), which was isolated from an individual homozygous for the major FD mutation. The expression of the exon-skipped mRNA levels in GM04899 was measured with or without NMD inhibition by cycloheximide. The exon 20-skipped IKBKAP mRNA levels in GM04899 cells increased ∼1.4-fold in the presence of cycloheximide, similar to the expression change seen between mt19–22 and mt19–22FC (Figure 4B). Figure 2. View largeDownload slide Schematic representation of the minigene constructs. (A) Both wild-type (WT) and the mutant (MT) versions of the 19–21 series of minigenes are shown. The major FD mutation (IVS+6T→C) was introduced in the case of MT minigene by site directed mutagenesis and is shown by a red arrow. The lengths of each individual exon and intron are depicted at the top and bottom, respectively. (B) Same as in (A) except this panel shows the 19–22 series of minigene constructs. In the case of the mt19–22FC minigene, the red circle shows the deleted T nucleotide from exon-21, which occurs at position +4 in the natural context. The deletion of a T nucleotide suppresses the frameshift that is caused by skipping of exon-20 in the skipped splice variant. In (A) and (B), the broken arrows represent the location of the complementary sequences of the minigene-specific primers, which were used for 32P-labeled radioactive reverse transcription PCR (RT-PCR) (Supplemental Table S3a). Figure 2. View largeDownload slide Schematic representation of the minigene constructs. (A) Both wild-type (WT) and the mutant (MT) versions of the 19–21 series of minigenes are shown. The major FD mutation (IVS+6T→C) was introduced in the case of MT minigene by site directed mutagenesis and is shown by a red arrow. The lengths of each individual exon and intron are depicted at the top and bottom, respectively. (B) Same as in (A) except this panel shows the 19–22 series of minigene constructs. In the case of the mt19–22FC minigene, the red circle shows the deleted T nucleotide from exon-21, which occurs at position +4 in the natural context. The deletion of a T nucleotide suppresses the frameshift that is caused by skipping of exon-20 in the skipped splice variant. In (A) and (B), the broken arrows represent the location of the complementary sequences of the minigene-specific primers, which were used for 32P-labeled radioactive reverse transcription PCR (RT-PCR) (Supplemental Table S3a). Figure 3. View largeDownload slide ASO screening using 19–21 minigenes. HEK293 cells were co-transfected with the mt19–21 (A–C) or mt19–22 (D–F) minigene, along with individual 15-mer PO-MOE-ASOs by electroporation; two days later, the extent of exon 20 inclusion of the minigene reporter was quantified via RT-PCR, as indicated below each lane. The underlined % inclusion values indicate where an ASO is suspected to target a splicing-regulatory element. (A) Coarse ASO walk for intron 19 using 19–21 minigenes. As a reference, the first and the second lanes show exon 20 % inclusion of the wt19–21 and mt19–21 minigenes, respectively, without ASO transfection. The third lane shows exon 20 % inclusion when the mt19–21 minigene was co-transfected with a control ASO (CTRL ASO) of unrelated sequence. The remaining lanes show the splicing patterns of mt19–21 co-transfected with ASOs targeting intron 19. The ASOs targeting the ISS-20 region (Figure 1) are labeled in grey. (B) Coarse ASO walk for exon 20 using 19–21 minigenes. The first three lanes are the same as in (A). The remaining lanes show the splicing patterns of mt19–21 co-transfected with ASOs targeting exon 20. (C) Coarse ASO walk for intron 20 using 19–21 minigenes. The first three lanes are the same as in (A). In the remaining lanes, ASOs targeting intron 20 were co-transfected with mt19–21. The ASOs targeting the ISS-40 and the ISE-20 regions (Figure 1) are labeled in grey and black, respectively. (D) Coarse ASO walk for intron 19 using 19–22 minigenes. Same as in (A), except that wt19–22 and mt19–22 were used. (E) Coarse ASO walk for exon 20 using 19–22 minigenes. Same as in (B), except that wt19–22 and mt19–22 were used. (F) Coarse ASO walk for intron 20 using 19–22 minigenes. Same as in (C), except that wt19–22 and mt19–22 were used. (G) The exon 20 % inclusion values of the wt and mt19–21 minigenes are shown in blue bars; the corresponding values of the wt and mt19–22 minigenes are shown in orange bars. Figure 3. View largeDownload slide ASO screening using 19–21 minigenes. HEK293 cells were co-transfected with the mt19–21 (A–C) or mt19–22 (D–F) minigene, along with individual 15-mer PO-MOE-ASOs by electroporation; two days later, the extent of exon 20 inclusion of the minigene reporter was quantified via RT-PCR, as indicated below each lane. The underlined % inclusion values indicate where an ASO is suspected to target a splicing-regulatory element. (A) Coarse ASO walk for intron 19 using 19–21 minigenes. As a reference, the first and the second lanes show exon 20 % inclusion of the wt19–21 and mt19–21 minigenes, respectively, without ASO transfection. The third lane shows exon 20 % inclusion when the mt19–21 minigene was co-transfected with a control ASO (CTRL ASO) of unrelated sequence. The remaining lanes show the splicing patterns of mt19–21 co-transfected with ASOs targeting intron 19. The ASOs targeting the ISS-20 region (Figure 1) are labeled in grey. (B) Coarse ASO walk for exon 20 using 19–21 minigenes. The first three lanes are the same as in (A). The remaining lanes show the splicing patterns of mt19–21 co-transfected with ASOs targeting exon 20. (C) Coarse ASO walk for intron 20 using 19–21 minigenes. The first three lanes are the same as in (A). In the remaining lanes, ASOs targeting intron 20 were co-transfected with mt19–21. The ASOs targeting the ISS-40 and the ISE-20 regions (Figure 1) are labeled in grey and black, respectively. (D) Coarse ASO walk for intron 19 using 19–22 minigenes. Same as in (A), except that wt19–22 and mt19–22 were used. (E) Coarse ASO walk for exon 20 using 19–22 minigenes. Same as in (B), except that wt19–22 and mt19–22 were used. (F) Coarse ASO walk for intron 20 using 19–22 minigenes. Same as in (C), except that wt19–22 and mt19–22 were used. (G) The exon 20 % inclusion values of the wt and mt19–21 minigenes are shown in blue bars; the corresponding values of the wt and mt19–22 minigenes are shown in orange bars. Figure 4. View largeDownload slide Assessing the regulation of exon-20-skipped IKBKAP mRNA by NMD. (A) Empty vector (V), wt19–22, mt19–22, or mt19–22FC minigenes were transfected into HEK293 cells, and the levels of exon-20-skipped IKBKAP mRNA (Δ20) were compared by radioactive RT-PCR. GAPDH mRNA was used as an internal reference. (B) The levels of exon-20-skipped IKBKAP mRNA (Δ20) were compared between patient-derived GM04899 fibroblasts treated with DMSO or 100 μg/ml cycloheximide for 1 h (n = 3 independent treatments, **P< 0.01 versus DMSO, Student's t-test). ACTB mRNA was used as an internal reference. Figure 4. View largeDownload slide Assessing the regulation of exon-20-skipped IKBKAP mRNA by NMD. (A) Empty vector (V), wt19–22, mt19–22, or mt19–22FC minigenes were transfected into HEK293 cells, and the levels of exon-20-skipped IKBKAP mRNA (Δ20) were compared by radioactive RT-PCR. GAPDH mRNA was used as an internal reference. (B) The levels of exon-20-skipped IKBKAP mRNA (Δ20) were compared between patient-derived GM04899 fibroblasts treated with DMSO or 100 μg/ml cycloheximide for 1 h (n = 3 independent treatments, **P< 0.01 versus DMSO, Student's t-test). ACTB mRNA was used as an internal reference. High-resolution micro-walk in the ISS-40 region Because the ASOs targeting ISS-40 had the strongest stimulatory effects on inclusion of exon 20 in the context of the major FD mutation, we focused on this region to search for an optimal ASO. We designed 10 new 20-mer overlapping PO-MOE ASOs, complementary to the first 30-nucleotide stretch of the ISS-40 at 1-nt resolution, starting from the +6 position in intron 20 (Figure 1B & Supplemental Table S1). We co-transfected these ASOs with the mt19–21 minigene into HEK-293 cells, followed by RT-PCR to analyze the transiently expressed mRNAs. This microwalk led to the discovery of ASO 7–26, which strikingly, almost completely restored exon 20 inclusion levels (up to 96%) with the mutant minigene (Figure 5). Figure 5. View largeDownload slide High-resolution microwalk for the ISS-40 region using 19–21 minigenes. (A) HEK293 cells were co-transfected with the mt19–21 minigene and individual 20-mer PO-MOE-ASOs tiling the first 30 nucleotides of the ISS-40 region. The exon-20 inclusion pattern was analyzed by RT-PCR. The first two lanes with wt19–21 and mt19–21 transfected alone serve as references. Co-transfection of mt19–21 with an unrelated ASO in the third lane serves as a control (CTRL ASO). Co-transfections with the 15-mer ASOs (labeled in grey) that had positive effects in the coarse ASO walk (Figure 3C) were included for comparison. The exon 20 % inclusion values are shown below each lane. The top candidate ASO (7–26) is marked in bold letters. (B) Exon 20 % inclusion values of wt19–21 (WT), mt19–21 alone (NT), and m19–21 co-transfected with an unrelated ASO (CTRL ASO) are shown in grey bars; the values of the mt19–21 minigene co-transfected with 20-mer and 15-mer ASOs are shown in empty bars and black bars, respectively. Figure 5. View largeDownload slide High-resolution microwalk for the ISS-40 region using 19–21 minigenes. (A) HEK293 cells were co-transfected with the mt19–21 minigene and individual 20-mer PO-MOE-ASOs tiling the first 30 nucleotides of the ISS-40 region. The exon-20 inclusion pattern was analyzed by RT-PCR. The first two lanes with wt19–21 and mt19–21 transfected alone serve as references. Co-transfection of mt19–21 with an unrelated ASO in the third lane serves as a control (CTRL ASO). Co-transfections with the 15-mer ASOs (labeled in grey) that had positive effects in the coarse ASO walk (Figure 3C) were included for comparison. The exon 20 % inclusion values are shown below each lane. The top candidate ASO (7–26) is marked in bold letters. (B) Exon 20 % inclusion values of wt19–21 (WT), mt19–21 alone (NT), and m19–21 co-transfected with an unrelated ASO (CTRL ASO) are shown in grey bars; the values of the mt19–21 minigene co-transfected with 20-mer and 15-mer ASOs are shown in empty bars and black bars, respectively. Effect of ASO 7–26 in FD-patient-derived fibroblasts To investigate the effect of the lead ASO 7–26 on the endogenous mutant IKBKAP mRNA, we transfected GM04899 cells. We used Lipofectamine 2000 to transfect increasing amounts of the ASO, ranging from 0 to 100 nM (29). Even though the basal level of exon 20 inclusion was higher in FD fibroblasts (∼64%) than in the above experiments with mutant minigenes, we observed that ASO 7–26 almost completely suppressed the splicing defect, and also resulted in a statistically significant increase in IKAP protein levels, when assayed 3 days after transfection. As previously described, treatment with the plant hormone kinetin for 3 days was equally effective in increasing full-length mRNA levels; however, in this case we did not observe a corresponding increase in IKAP protein levels in patient fibroblasts (Figure 6). Kinetin treatment for a duration longer than 72 h may be required to see the effect at the protein level. Figure 6. View largeDownload slide Effect of ASO 7–26 in patient fibroblasts. (A) Exon-20 splicing patterns of endogenous IKBKAP mRNA were measured by RT-PCR in patient-derived GM04899 skin fibroblasts treated with 100 μM kinetin (right panel) or transfected with increasing doses of ASO 7–26 (left panel); RNA was extracted 3 days after the treatments. The IKBKAP mRNA level in the normal fibroblast cell line IMR-90 was used as reference. (B) Western blot showing IKAP levels after treatment with ASO 7–26 (left panel) or kinetin (right panel). (A) and (B), multiple lanes for each condition represent independent experiments. (C) Quantification of exon 20 % inclusion, based on the data shown in (A) and additional experiments (n = 3 independent transfections, ***P< 0.001 versus NT, one-way ANOVA). (D) Quantification of (B) showing IKAP levels normalized to β-actin. The IKAP levels in cells treated with ASO 7–26 were normalized to that of control-ASO-treated cells (n = 3 independent transfections, ***P< 0.001 versus CTRL ASO, Student's t-test); the IKAP levels of kinetin-treated cells were normalized to that of solvent-treated cells (n = 3 independent treatments, n.s. P> 0.05 versus solvent-treated, Student's t-test). In (C) and (D), NT = no-treatment, CTRL ASO = unrelated negative control ASO, and Kn = kinetin treatment. Error bars indicate standard deviation. Figure 6. View largeDownload slide Effect of ASO 7–26 in patient fibroblasts. (A) Exon-20 splicing patterns of endogenous IKBKAP mRNA were measured by RT-PCR in patient-derived GM04899 skin fibroblasts treated with 100 μM kinetin (right panel) or transfected with increasing doses of ASO 7–26 (left panel); RNA was extracted 3 days after the treatments. The IKBKAP mRNA level in the normal fibroblast cell line IMR-90 was used as reference. (B) Western blot showing IKAP levels after treatment with ASO 7–26 (left panel) or kinetin (right panel). (A) and (B), multiple lanes for each condition represent independent experiments. (C) Quantification of exon 20 % inclusion, based on the data shown in (A) and additional experiments (n = 3 independent transfections, ***P< 0.001 versus NT, one-way ANOVA). (D) Quantification of (B) showing IKAP levels normalized to β-actin. The IKAP levels in cells treated with ASO 7–26 were normalized to that of control-ASO-treated cells (n = 3 independent transfections, ***P< 0.001 versus CTRL ASO, Student's t-test); the IKAP levels of kinetin-treated cells were normalized to that of solvent-treated cells (n = 3 independent treatments, n.s. P> 0.05 versus solvent-treated, Student's t-test). In (C) and (D), NT = no-treatment, CTRL ASO = unrelated negative control ASO, and Kn = kinetin treatment. Error bars indicate standard deviation. ASO 7–26S rescues IKBKAP exon 20 splicing in transgenic mice The next logical step in pre-clinical development was to test whether ASO 7–26 can correct the IKBKAP splicing defect in an animal model. PS-MOE-ASOs, such as nusinersen, are used in animal models, as they are more stable and more efficiently internalized, compared to PO-MOE ASOs. Thus, we synthesized ASO 7–26S, which has the same nucleotide sequence as ASO 7–26, but has a PS backbone. To test ASO 7–26S, we obtained transgenic mice that carry the entire human IKBKAP gene with the major FD mutation, in addition to being homozygous wild type at the mouse Ikbkap locus (a generous gift from Drs James Pickel and Susan Slaugenhaupt). These transgenic mice do not show any overt disease phenotype, due to the presence of the wild-type mouse Ikbkap gene (33). However, the mRNA expressed from the mutant human IKBKAP transgene does show a pattern of skipping similar to that of FD patients (33), making this strain very useful for testing ASO target engagement in vivo. Therefore, we tested ASO 7–26S in this transgenic mouse strain, and assessed its effects at the level of IKBKAP splicing by RT-PCR with human-specific primers, after ASO 7–26S intracerebroventricular (ICV) administration into the cerebrospinal fluid (CSF) using described procedures (29,34). There was a linear dose-response to ASO 7–26S in IKBKAP exon 20 inclusion in the brain and spinal cord, both in adult mice after a week of ICV infusion (Figure 7A and B), and in neonates after a single ICV injection (Figure 7C and D). We measured the effect of the ASO at the level of IKAP protein by Western blotting experiments using an anti-IKAP antibody specific for human IKAP. In previous studies, the truncated IKAP protein that is hypothetically translated from the exon 20-skipped IKBKAP mRNA could not be detected by Western blotting (5,23). Our results were consistent with that observation (Figure 7E). ICV injection of ASO 7–26S also resulted in a dose-dependent increase in IKAP levels (Figure 7F and G). Figure 7. View largeDownload slide Effect of ASO 7–26S in the brain and spinal cord of transgenic mice. (A) Exon 20 splicing patterns of human IKBKAP mRNA in thoracic spinal cord of adult transgenic mice were analyzed via RT-PCR after a week of ICV infusion of ASO 7–26S with increasing doses. Multiple lanes for each condition represent independent experiments. (B) Quantification of exon 20 % inclusion in (A). (n = 3 independent ICV infusions, ***P< 0.001 versus saline, one-way ANOVA). (C) Human IKBKAP mRNA exon 20 splicing pattern was measured in mouse neonatal brain and spinal cord isolated at P8, after ICV injection of ASO 7–26S at P1. Each lane shows a representative RT-PCR. (D) Quantification of exon 20 % inclusion in (C). (n = 4 independent ICV injections, *P< 0.05, ***P< 0.001 versus control ASO (CTRL ASO), one-way ANOVA). (E) The specificity of anti-IKAP antibody for human IKAP protein was assessed by Western blotting using protein extracts from wild-type (WT) or transgenic (Tg) mouse brain. (F) Human IKAP protein level was measured in mouse neonatal brain and spinal cord isolated at P8, after ICV injection of ASO 7–26S at P1. (G) Quantification of (F) showing IKAP protein levels normalized to Tubulin. (n = 5–10 independent ICV injections, ***P< 0.001 versus control ASO, one-way ANOVA). In (B), (D) and (G), error bars = standard deviation. Figure 7. View largeDownload slide Effect of ASO 7–26S in the brain and spinal cord of transgenic mice. (A) Exon 20 splicing patterns of human IKBKAP mRNA in thoracic spinal cord of adult transgenic mice were analyzed via RT-PCR after a week of ICV infusion of ASO 7–26S with increasing doses. Multiple lanes for each condition represent independent experiments. (B) Quantification of exon 20 % inclusion in (A). (n = 3 independent ICV infusions, ***P< 0.001 versus saline, one-way ANOVA). (C) Human IKBKAP mRNA exon 20 splicing pattern was measured in mouse neonatal brain and spinal cord isolated at P8, after ICV injection of ASO 7–26S at P1. Each lane shows a representative RT-PCR. (D) Quantification of exon 20 % inclusion in (C). (n = 4 independent ICV injections, *P< 0.05, ***P< 0.001 versus control ASO (CTRL ASO), one-way ANOVA). (E) The specificity of anti-IKAP antibody for human IKAP protein was assessed by Western blotting using protein extracts from wild-type (WT) or transgenic (Tg) mouse brain. (F) Human IKAP protein level was measured in mouse neonatal brain and spinal cord isolated at P8, after ICV injection of ASO 7–26S at P1. (G) Quantification of (F) showing IKAP protein levels normalized to Tubulin. (n = 5–10 independent ICV injections, ***P< 0.001 versus control ASO, one-way ANOVA). In (B), (D) and (G), error bars = standard deviation. Next, we compared the IKBKAP mRNA splicing patterns in various tissues after administering ASO 7–26S in neonate mice, either by ICV injection, or into peripheral tissues by subcutaneous (SC) injection (Figure 8A and B). In all cases, the ASO was injected at P1 and RNA was assayed at P8. We observed a clear compartmentalization of the ASO effect, depending upon the mode of delivery. As expected, ICV administration primarily resulted in increased full-length IKBKAP mRNA in brain and spinal cord, whereas SC administration affected expression mainly in liver, skeletal muscle, heart, and kidney. However, we also detected slight effects of ICV and SC injections in peripheral tissues and CNS, respectively, presumably reflecting CSF clearance and incomplete closure of the blood-brain barrier in neonates (35). To test whether the increase in IKBKAP exon 20 inclusion resulted in increased IKAP protein levels, we measured the protein in various tissues by Western blotting. SC administration of ASO 7–26S also increased IKAP protein in the liver, heart, and skeletal muscle, but not in kidney (Figure 8C and D). Figure 8. View largeDownload slide Effect of ASO 7–26S in peripheral tissues of transgenic mice. (A) Splicing pattern of human IKBKAP mRNA exon 20 was analyzed via RT-PCR in various mouse neonate tissues (brain, total spinal cord, liver, heart, quadriceps, and kidney) at P8, after either ICV (20 μg) or subcutaneous (400 μg/ g body weight) injection of ASO 7–26S at P1. Sal = Subcutaneous saline treated; SC = Subcutaneous injection; ICV = Intracerebroventricular injection. (B) Bar charts of (A) showing exon 20 % inclusion of human IKBKAP mRNA in different tissues following ICV or subcutaneous injections (n = 5 independent ICV or SC injections, ***P<0.001 versus saline [Sal], one-way ANOVA). (C) Human IKAP protein levels in various mouse neonate tissues (brain, total spinal cord, liver, heart, quadriceps, and kidney), after SC injection of ASO 7–26S (400 μg/g body weight) were assessed by Western blotting. Representative immunoblots of IKAP and Tubulin are shown. (D) Quantification of (C) was performed by measuring IKAP protein levels normalized to Tubulin (n = 3 independent SC injections, n.s. P> 0.05, *P< 0.05, ***P< 0.001 versus saline treated [Sal] for each tissue, Student's t-test). In (B) and (D), error bars = standard deviation. Figure 8. View largeDownload slide Effect of ASO 7–26S in peripheral tissues of transgenic mice. (A) Splicing pattern of human IKBKAP mRNA exon 20 was analyzed via RT-PCR in various mouse neonate tissues (brain, total spinal cord, liver, heart, quadriceps, and kidney) at P8, after either ICV (20 μg) or subcutaneous (400 μg/ g body weight) injection of ASO 7–26S at P1. Sal = Subcutaneous saline treated; SC = Subcutaneous injection; ICV = Intracerebroventricular injection. (B) Bar charts of (A) showing exon 20 % inclusion of human IKBKAP mRNA in different tissues following ICV or subcutaneous injections (n = 5 independent ICV or SC injections, ***P<0.001 versus saline [Sal], one-way ANOVA). (C) Human IKAP protein levels in various mouse neonate tissues (brain, total spinal cord, liver, heart, quadriceps, and kidney), after SC injection of ASO 7–26S (400 μg/g body weight) were assessed by Western blotting. Representative immunoblots of IKAP and Tubulin are shown. (D) Quantification of (C) was performed by measuring IKAP protein levels normalized to Tubulin (n = 3 independent SC injections, n.s. P> 0.05, *P< 0.05, ***P< 0.001 versus saline treated [Sal] for each tissue, Student's t-test). In (B) and (D), error bars = standard deviation. DISCUSSION Correction of the IKBKAP splicing defect is an attractive therapeutic strategy, as nearly all cases of FD are caused by the same intronic mutation that affects splicing of the IKBKAP pre-mRNA (7), and the disease phenotype is dependent on the levels of IKAP protein (36). We previously developed ASOs that increase the splicing of SMN2 exon 7 to relieve the disease phenotype of spinal muscular atrophy (SMA). The resulting drug, nusinersen (SpinrazaTM) was approved by the FDA, EMA, and other regulatory agencies to treat all SMA types. Nusinersen is a PS-MOE 18-mer ASO that is administered to patients by lumbar puncture, twice a month for two months, and every four months thereafter; it distributes widely within the CNS, where it has a very long half-life. Thus, our strategy to improve the splicing of IKBKAP using a similarly designed ASO may have therapeutic potential. The two-step ASO-screening is an effective method to identify lead ASOs that modulate splicing (29). We used a similar strategy to identify a lead ASO that increases splicing of IKBKAP exon 20. This screen identified several regions in intron 19 and intron 20 that promote exon inclusion. Our data suggest that the best region to target with an ASO is the region in intron 20 we termed ISS-40. In contrast, ASOs targeting various regions in exon 20 caused exon-skipping, suggesting the presence of various non-redundant ESE elements. The mechanism by which ASO 7–26 achieves almost complete IKBKAP exon 20 inclusion in transfected cells is unclear. We speculate that the ASO blocks a potent silencer element or structure, or perhaps more than one silencer element, in the region complementary to its sequence, which spans from +7 to +26 in intron 20 (29,37). Recently, Ohe et al. showed that the IVS20 +13 ∼ +29 region, which overlaps with our ASO target site, contains a splicing silencer element that enhances IKBKAP exon 20 splicing when deleted (31). Interestingly, this region also contains a binding site for RBM24, which enhances IKBKAP exon 20 inclusions in muscle cells and other tissues with high levels of RBM24 expression (31). This finding suggests that the tissue-specific IKBKAP exon 20 splicing defect caused by the major FD mutation may be affected by differential expression and binding of splicing regulators to intron 20. ASO 7–26 is a 20-mer PO-MOE ASO, which improved splicing more effectively than a 15-mer PO-MOE ASO targeting an overlapping region. In general, the 20-mer MOE-ASOs had a stronger positive effect on exon 20 splicing than the 15-mer ASOs targeting the same region, presumably because the longer ASOs were better able to block the activities of multiple silencer elements in the region, and/or because they hybridized more strongly to the target RNA (38). Kinetin's potential as a treatment option for FD is under current investigation in a phase-2 clinical trial. We tested how kinetin compares to our lead ASO in FD patient fibroblasts. Both the lead ASO and kinetin increased IKBKAP exon 20 inclusion after 3 days of treatment, but only ASO treatment led to increased IKAP protein. In a previous study by Hims et al. an increase in IKAP levels was observed in patient lymphoblast cell lines only after prolonged treatment with kinetin for 1-2 weeks (39). The absence of IKAP increase on day 3 of kinetin treatment in our hands is not likely due to global changes in transcription or translation, as expression of a control housekeeping gene did not change throughout the experiment. This observation suggests that kinetin may delay the expression of IKAP protein in a gene-specific manner, via an unknown mechanism. Further investigation may reveal an interesting aspect of IKAP-specific translational or post-translational regulatory mechanisms. The skipping of IKBKAP exon 20 leads to a frameshift that introduces a PTC in exon 21, and the resulting mRNA is a potential substrate for NMD (6,40). However, although a small proportion of mis-spliced IKBKAP mRNA may be targeted to NMD (40,41), another study suggested that it may not be efficiently degraded by NMD (5). To test whether exon 20 skipping targets the IKBKAP mRNA to NMD, we compared the mRNA levels expressed from minigene mt19–22, which harbors a PTC, and minigene mt19–22FC, which does not (Figure 4). The difference in the mRNA levels between the minigenes was small. We also observed a relatively small increase after NMD inhibition in the endogenous exon 20-skipped IKBKAP mRNA isoform in patient-derived fibroblasts. These results suggest that the exon-20-skipped IKBKAP mRNA is not efficiently targeted by NMD. However, our NMD analysis of exon-20-skipped endogenous IKBKAP mRNA was performed only in patient-derived fibroblasts. Thus, further investigation is required to test whether NMD of the exon-20-skipped isoform is regulated in a tissue-specific manner. The ASOs that we identified in the in vitro screening significantly increased human IKBKAP exon 20 splicing in transgenic mice. We did not achieve complete splicing restoration in mice, as we observed in patient-derived fibroblasts. However, we demonstrated that a single ICV injection or infusion increased IKBKAP full-length mRNA and IKAP protein levels in the brain and spinal cord, and that subcutaneous injections increased exon 20 splicing and IKAP protein levels in peripheral tissues. The response to ASO treatment was tissue-specific, such that tissues with higher baseline IKBKAP exon 20 splicing showed stronger splicing enhancement upon ASO injection. Previous studies showed that the FD mutation causes defective IKBKAP splicing to varying degrees in different tissues (7). Future studies to elucidate the CNS-specific splicing regulators of IKBKAP mRNA may facilitate the development of more effective ASOs that restore correct IKBKAP splicing. Whether the 3-fold increase in the level of IKAP protein in postnatal brain and spinal cord after ASO treatment can rescue the FD phenotype will require further investigation, as the mouse model we employed lacks an FD phenotype (33). However, the severity of the FD symptoms is inversely correlated with the levels of IKAP protein (8). For instance, Zeltner and colleagues assessed the relationship between IKAP levels and neuronal differentiation in induced pluripotent stem cells (iPSC) derived from homozygous FD patients (42). Repairing one allele of IKBKAP by genome editing rescued the IKAP levels to those of a heterozygous carrier. Compared to the homozygous FD mutant cells, a higher percentage of the genetically rescued iPSC cells differentiated into peripheral neurons and expressed higher levels of peripheral neuronal markers, demonstrating the positive effects of increased IKAP levels in neurodevelopment. Although FD symptoms can be seen in infants, FD patients experience progressive neurodegeneration (15). Thus, increasing IKAP levels may potentially help to alleviate the devastating symptoms of FD. A logical next step will be to test whether our ASO can alleviate FD symptoms in a mouse model with the FD phenotype. Dietrich and colleagues developed the Ikbkapflox/Δ20mouse, which shows ∼90% reduction of mouse IKAP expression (36). This mouse model does not carry the major human FD mutation, but shares some pathologic phenotypes characteristic of FD patients. In this model, one Ikbkap allele has loxP sites in the introns flanking exon 20, and the other allele lacks exon 20 (36). Morini et al. recently developed the TgFD9;Ikbkapflox/Δ20mouse by introducing the full human IKBKAP gene with the FD mutation into the Ikbkapflox/Δ20 mouse. TgFD9;Ikbkapflox/Δ20 mice show sensory and autonomic deficits seen in FD patients (43). Testing ASO 7–26S in such a mouse model to attempt to rescue the FD-like phenotypes will be a key step in developing this antisense approach towards a potential clinical treatment for FD. In summary, this study is the first to utilize ASOs to correct defective IKBKAP gene expression in vitro and in vivo. Our results clearly indicate that the ASO engages its target in multiple tissues, including the CNS. As the FD disease phenotype correlates with the reduction in IKAP level, the increase in full-length IKAP protein levels resulting from restoration of IKBKAP exon 20 inclusion following ASO treatment could have a positive impact on FD symptoms. Our results with transgenic FD mice set the stage for phenotypic-rescue experiments in appropriate mouse models and for eventual clinical trials. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We are grateful to James Pickel and Susan Slaugenhaupt for generously sharing their IKBKAP transgenic mouse strain, and to Greta Dornberg for many stimulating discussions. FUNDING National Institutes of Health [R37 GM42699]; Familial Dysautonomia Foundation. Funding for open access charge: National Institutes of Health [R37 GM42699]. Conflict of interest statement. F.R. and C.F.B. are employees of Ionis Pharmaceuticals and own stock options. REFERENCES 1. Slaugenhaupt S.A. , Gusella J.F. Familial dysautonomia . Curr. Opin. Genet. Dev. 2002 ; 12 : 307 – 311 . Google Scholar CrossRef Search ADS PubMed 2. Maayan C. , Kaplan E. , Shachar S. , Peleg O. , Godfrey S. Incidence of familial dysautonomia in Israel 1977–1981 . Clin. Genet. 1987 ; 32 : 106 – 108 . Google Scholar CrossRef Search ADS PubMed 3. Anderson S.L. , Coli R. , Daly I.W. , Kichula E.A. , Rork M.J. , Volpi S.A. , Ekstein J. , Rubin B.Y. Familial dysautonomia is caused by mutations of the IKAP gene . Am. J. Hum. Genet. 2001 ; 68 : 753 – 758 . Google Scholar CrossRef Search ADS PubMed 4. Axelrod F.B. , Gold-von Simson G. Hereditary sensory and autonomic neuropathies: types II, III, and IV . Orphanet. J. Rare Dis. 2007 ; 2 : 39 . Google Scholar CrossRef Search ADS PubMed 5. Keren H. , Donyo M. , Zeevi D. , Maayan C. , Pupko T. , Ast G. Phosphatidylserine increases IKBKAP levels in familial dysautonomia cells . PLoS One . 2010 ; 5 : e15884 . Google Scholar CrossRef Search ADS PubMed 6. He F. , Jacobson A. Nonsense-mediated mRNA decay: degradation of defective transcripts is only part of the story . Annu. Rev. Genet. 2015 ; 49 : 339 – 366 . Google Scholar CrossRef Search ADS PubMed 7. Dietrich P. , Dragatsis I. Familial dysautonomia: mechanisms and models . Genet Mol Biol . 2016 ; 39 : 497 – 514 . Google Scholar CrossRef Search ADS PubMed 8. Lefcort F. , Mergy M. , Ohlen S.B. , Ueki Y. , George L. Animal and cellular models of familial dysautonomia . Clin. Auton. Res. 2017 ; 27 : 235 . Google Scholar CrossRef Search ADS PubMed 9. Xu H. , Lin Z. , Li F. , Diao W. , Dong C. , Zhou H. , Xie X. , Wang Z. , Shen Y. , Long J. Dimerization of elongator protein 1 is essential for Elongator complex assembly . Proc. Natl. Acad. Sci. U.S.A. 2015 ; 112 : 10697 – 10702 . Google Scholar CrossRef Search ADS PubMed 10. Huang B. , Johansson M.J.O. , Bystrom A.S. An early step in wobble uridine tRNA modification requires the Elongator complex . RNA . 2005 ; 11 : 424 – 436 . Google Scholar CrossRef Search ADS PubMed 11. Esberg A. , Huang B. , Johansson M.J.O. , Byström A.S. Elevated levels of two tRNA species bypass the requirement for elongator complex in transcription and exocytosis . Mol. Cell . 2006 ; 24 : 139 – 148 . Google Scholar CrossRef Search ADS PubMed 12. Karlsborn T. , Tükenmez H. , Chen C. , Byström A.S. Familial dysautonomia (FD) patients have reduced levels of the modified wobble nucleoside mcm5s2U in tRNA . Biochem. Biophys. Res. Commun. 2014 ; 454 : 441 – 445 . Google Scholar CrossRef Search ADS PubMed 13. Yoshida M. , Kataoka N. , Miyauchi K. , Ohe K. , Iida K. , Yoshida S. , Nojima T. , Okuno Y. , Onogi H. , Usui T. et al. Rectifier of aberrant mRNA splicing recovers tRNA modification in familial dysautonomia . Proc. Natl. Acad. Sci. U.S.A. 2015 ; 112 : 2764 – 2769 . Google Scholar CrossRef Search ADS PubMed 14. Laguesse S. , Creppe C. , Nedialkova D.D. , Prévot P.P. , Borgs L. , Huysseune S. , Franco B. , Duysens G. , Krusy N. , Lee G. et al. A dynamic unfolded protein response contributes to the control of cortical neurogenesis . Dev. Cell . 2015 ; 35 : 553 – 567 . Google Scholar CrossRef Search ADS PubMed 15. Norcliffe-Kaufmann L. , Slaugenhaupt S.A. , Kaufmann H. Familial dysautonomia: History, genotype, phenotype and translational research . Prog. Neurobiol. 2017 ; 152 : 131 – 148 . Google Scholar CrossRef Search ADS PubMed 16. Slaugenhaupt S.A. , Blumenfeld A. , Gill S.P. , Leyne M. , Mull J. , Cuajungco M.P. , Liebert C.B. , Chadwick B. , Idelson M. , Reznik L. et al. Tissue-specific expression of a splicing mutation in the IKBKAP gene causes familial dysautonomia . Am. J. Hum. Genet. 2001 ; 68 : 598 – 605 . Google Scholar CrossRef Search ADS PubMed 17. Anderson S.L. , Liu B. , Qiu J. , Sturm A.J. , Schwartz J.A. , Peters A.J. , Sullivan K.A. , Rubin B.Y. Nutraceutical-mediated restoration of wild-type levels of IKBKAP-encoded IKAP protein in familial dysautonomia-derived cells . Mol. Nutr. Food Res. 2012 ; 56 : 570 – 579 . Google Scholar CrossRef Search ADS PubMed 18. Hervé M. , Ibrahim E.C. Proteasome inhibitors to alleviate aberrant IKBKAP mRNA splicing and low IKAP/hELP1 synthesis in familial dysautonomia . Neurobiol. Dis. 2017 ; 103 : 113 – 122 . Google Scholar CrossRef Search ADS PubMed 19. Naftelberg S. , Abramovitch Z. , Gluska S. , Yannai S. , Joshi Y. , Donyo M. , Ben-Yaakov K. , Gradus T. , Zonszain J. , Farhy C. et al. Phosphatidylserine ameliorates neurodegenerative symptoms and enhances axonal transport in a mouse model of familial dysautonomia . PLoS Genet. 2016 ; 12 : 1 – 27 . Google Scholar CrossRef Search ADS 20. Bochner R. , Ziv Y. , Zeevi D. , Donyo M. , Abraham L. , Ashery-Padan R. , Ast G. Phosphatidylserine increases IKBKAP levels in a humanized knock-in IKBKAP mouse model . Hum. Mol. Genet. 2013 ; 22 : 2785 – 2794 . Google Scholar CrossRef Search ADS PubMed 21. Cheishvili D. , Maayan C. , Holzer N. , Tsenter J. , Lax E. , Petropoulos S. , Razin A. Tocotrienol treatment in familial dysautonomia: open-label pilot study . J. Mol. Neurosci. 2016 ; 59 : 382 – 391 . Google Scholar CrossRef Search ADS PubMed 22. Axelrod F.B. , Liebes L. , Gold-Von Simson G. , Mendoza S. , Mull J. , Leyne M. , Norcliffe-Kaufmann L. , Kaufmann H. , Slaugenhaupt S.A. Kinetin improves IKBKAP mRNA splicing in patients with familial dysautonomia . Pediatr. Res. 2011 ; 70 : 480 – 483 . Google Scholar CrossRef Search ADS PubMed 23. Shetty R.S. , Gallagher C.S. , Chen Y.T. , Hims M.M. , Mull J. , Leyne M. , Pickel J. , Kwok D. , Slaugenhaupt S.A. Specific correction of a splice defect in brain by nutritional supplementation . Hum. Mol. Genet. 2011 ; 20 : 4093 – 4101 . Google Scholar CrossRef Search ADS PubMed 24. Sridharan K. , Gogtay N.J. Therapeutic nucleic acids: current clinical status . Br. J. Clin. Pharmacol. 2016 ; 82 : 659 – 672 . Google Scholar CrossRef Search ADS PubMed 25. Finkel R.S. , Chiriboga C.A. , Vajsar J. , Day J.W. , Montes J. , De Vivo D.C. , Yamashita M. , Rigo F. , Hung G. , Schneider E. et al. Treatment of infantile-onset spinal muscular atrophy with nusinersen: a phase 2, open-label, dose-escalation study . Lancet . 2016 ; 6736 : 2 – 11 . 26. Hua Y. , Sahashi K. , Rigo F. , Hung G. , Horev G. , Bennett C.F. , Krainer A.R. Peripheral SMN restoration is essential for long-term rescue of a severe spinal muscular atrophy mouse model . Nature . 2011 ; 478 : 123 – 126 . Google Scholar CrossRef Search ADS PubMed 27. Passini M. a , Bu J. , Richards A.M. , Kinnecom C. , Sardi S.P. , Stanek L.M. , Hua Y. , Rigo F. , Matson J. , Hung G. et al. Antisense oligonucleotides delivered to the mouse CNS ameliorate symptoms of severe spinal muscular atrophy . Sci. Transl. Med. 2011 ; 3 : 72ra18 . Google Scholar CrossRef Search ADS PubMed 28. Finkel R.S. , Kuntz N. , Mercuri E. Primary efficacy and safety results from the phase 3 ENDEAR study of nusinersen in infants diagnosed with spinal muscular atrophy (SMA) . 43rd Annual Congress of the British Paediatric Neurology Association . 2017 . 29. Hua Y. , Vickers T.A. , Okunola H.L. , Bennett C.F. , Krainer A.R. Antisense masking of an hnRNP A1/A2 intronic splicing silencer corrects SMN2 splicing in transgenic mice . Am. J. Hum. Genet. 2008 ; 82 : 834 – 848 . Google Scholar CrossRef Search ADS PubMed 30. Hua Y. , Vickers T.A. , Baker B.F. , Bennett C.F. , Krainer A.R. Enhancement of SMN2 exon 7 inclusion by antisense oligonucleotides targeting the exon . PLoS Biol. 2007 ; 5 : e73 . Google Scholar CrossRef Search ADS PubMed 31. Ohe K. , Yoshida M. , Nakano-Kobayashi A. , Hosokawa M. , Sako Y. , Sakuma M. , Okuno Y. , Usui T. , Ninomiya K. , Nojima T. et al. RBM24 promotes U1 snRNP recognition of the mutated 5′ splice site in the IKBKAP gene of familial dysautonomia . RNA . 2017 ; 23 : 1393 – 1403 . Google Scholar CrossRef Search ADS PubMed 32. Gerbracht J.V , Boehm V. , Gehring N.H. Plasmid transfection influences the readout of nonsense-mediated mRNA decay reporter assays in human cells . Scientific Rep. 2017 ; 7 : 10616 . Google Scholar CrossRef Search ADS 33. Hims M.M. , Shetty R.S. , Pickel J. , Mull J. , Leyne M. , Liu L. , Gusella J.F. , Slaugenhaupt S.A. , Angelica M.D. , Fong Y. A humanized IKBKAP transgenic mouse models a tissue specific human splicing defect . Genomics . 2007 ; 90 : 389 – 396 . Google Scholar CrossRef Search ADS PubMed 34. Hua Y. , Sahashi K. , Hung G. , Rigo F. , Passini M.A. , Bennett C.F. , Krainer A.R. Antisense correction of SMN2 splicing in the CNS rescues necrosis in a type III SMA mouse model . Genes Dev. 2010 ; 24 : 1634 – 1644 . Google Scholar CrossRef Search ADS PubMed 35. Ek C.J. , Habgood M.D. , Dziegielewska K.M. , Saunders N.R. Structural characteristics and barrier properties of the choroid plexuses in developing brain of the opossum (Monodelphis Domestica) . J. Comp. Neurol. 2003 ; 460 : 451 – 464 . Google Scholar CrossRef Search ADS PubMed 36. Dietrich P. , Alli S. , Shanmugasundaram R. , Dragatsis I. IKAP expression levels modulate disease severity in a mouse model of familial dysautonomia . Hum. Mol. Genet. 2012 ; 21 : 5078 – 5090 . Google Scholar CrossRef Search ADS PubMed 37. Warf M.B. , Berglund J.A. The role of RNA structure in regulating pre-mRNA splicing . Trends Biochem. Sci. 2010 ; 35 : 169 – 178 . Google Scholar CrossRef Search ADS PubMed 38. Stein C.A. The experimental use of antisense oligonucleotides: a guide for the perplexed . J. Clin. Invest. 2001 ; 108 : 641 – 644 . Google Scholar CrossRef Search ADS PubMed 39. Hims M.M. , Ibrahim E.C. , Leyne M. , Mull J. , Liu L. , Lazaro C. , Shetty R.S. , Gill S. , Gusella J.F. , Reed R. et al. Therapeutic potential and mechanism of kinetin as a treatment for the human splicing disease familial dysautonomia . J. Mol. Med. (Berl). 2007 ; 85 : 149 – 161 . Google Scholar CrossRef Search ADS PubMed 40. Slaugenhaupt S.A. , Mull J. , Leyne M. , Cuajungco M.P. , Gill S.P. , Hims M.M. , Quintero F. , Axelrod F.B. , Gusella J.F. Rescue of a human mRNA splicing defect by the plant cytokinin kinetin . Hum. Mol. Genet. 2004 ; 13 : 429 – 436 . Google Scholar CrossRef Search ADS PubMed 41. Boone N. , Loriod B. , Bergon A. , Sbai O. , Formisano-Tréziny C. , Gabert J. , Khrestchatisky M. , Nguyen C. , Féron F. , Axelrod F.B. et al. Olfactory stem cells, a new cellular model for studying molecular mechanisms underlying familial dysautonomia . PLoS One . 2010 ; 5 : e15590 . Google Scholar CrossRef Search ADS PubMed 42. Zeltner N. , Fattahi F. , Dubois N.C. , Saurat N. , Lafaille F. , Shang L. , Zimmer B. , Tchieu J. , Soliman M.A. , Lee G. et al. Capturing the biology of disease severity in a PSC-based model of familial dysautonomia . Nat. Med. 2016 ; 22 : 1 – 11 . Google Scholar CrossRef Search ADS PubMed 43. Morini E. , Dietrich P. , Salani M. , Downs H.M. , Wojtkiewicz G.R. , Alli S. , Brenner A. , Nilbratt M. , LeClair J.W. , Oaklander A.L. et al. Sensory and autonomic deficits in a new humanized mouse model of familial dysautonomia . Hum. Mol. Genet. 2016 ; 25 : 1116 – 1128 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

journal article

Open Access Collection

Coupling of replisome movement with nucleosome dynamics can contribute to the parent–daughter information transfer

Bameta, Tripti;Das, Dibyendu;Padinhateeri, Ranjith

2018 Nucleic Acids Research

doi: 10.1093/nar/gky207pmid: 29850895

Abstract Positioning of nucleosomes along the genomic DNA is crucial for many cellular processes that include gene regulation and higher order packaging of chromatin. The question of how nucleosome-positioning information from a parent chromatin gets transferred to the daughter chromatin is highly intriguing. Accounting for experimentally known coupling between replisome movement and nucleosome dynamics, we propose a model that can obtain de novo nucleosome assembly similar to what is observed in recent experiments. Simulating nucleosome dynamics during replication, we argue that short pausing of the replication fork, associated with nucleosome disassembly, can be a event crucial for communicating nucleosome positioning information from parent to daughter. We show that the interplay of timescales between nucleosome disassembly (τp) at the replication fork and nucleosome sliding behind the fork (τs) can give rise to a rich ‘phase diagram’ having different inherited patterns of nucleosome organization. Our model predicts that only when τp ≥ τs the daughter chromatin can inherit nucleosome positioning of the parent. INTRODUCTION The fate of a cell is controlled not just by the DNA sequence alone but also by the organization and the kinetics of proteins along the DNA. In most eukaryotes, a huge fraction of the genomic DNA (e.g. >80% in yeast gene regions) is covered by histone proteins leading to formation of a chromatin that appears like a ‘string of beads’ (1,2). Advances made in the last many years have confirmed that nucleosomes and their organization play an important role in nearly all cellular processes. For example, nucleosomes are known to cover transcription factor binding sites and restrict proteins from accessing those crucial sites along the genome and, hence, regulate gene expression (3–7). There are very different nucleosome organizations in coding regions and promoter regions of genes, indicating the importance of the high diversity in nucleosome organization (3,8–10). Precise nucleosome organization is also crucial for higher order packaging of DNA as the polymorphic chromatin structure depends on linker length distribution (11,12). Since the precise positioning of nucleosomes is important, the natural question is, how do cells transfer this information about nucleosome positioning from one generation to another? How do daughter cells know about the nature of nucleosome positioning in the parent cells? This is an intriguing question for which we do not know the precise answer. One hypothesis argues that the DNA sequence determines the nucleosome positioning along the genome, and hence, the information is transferred with the DNA (8,13). However, various experiments have indicated that the DNA sequence alone would not determine the nucleosome positioning in the genome (9,14)—ATP-dependent chromatin remodelling, statistical positioning and other factors play equally important role (15–19). Moreover, different cell types (neuronal, muscle, epithelial cells etc) have exactly the same DNA, but they have very different organization of the chromatin, gene expression pattern and function (2). Another major drawback of the sequence-dictated model of self-organization of nucleosomes is that attaining an ‘equilibrium’ (steady state) nucleosome organization may take long time (20), and hence, regulation of genes prior to attaining a desired nucleosome distribution may fail. An alternative hypothesis is that nucleosome positioning needs to be inherited, somehow, during replication so that the daughter cells can appropriately regulate their gene expression in an independent manner (21). This hypothesis is partially strengthened by recent experiments (22) which show that nucleosome positions are conserved at inactive sites behind the replication fork. How does the de novo nucleosome assembly happen during DNA replication? Experiments have been giving us major insights into the de novo nucleosome assembly in the various gene regions (22–33). For example, Lucchini et al. have shown that nucleosomes are properly organized shortly after passage of the replication machinery and propose that the nucleosome positioning is the initial step of chromosome maturation (24). Recently, Alabert et al have shown that not just nucleosome positioning, but nucleosome modifications are also inherited from the parent to daughter (25). Additionally, Blythe et al have shown that chromatin accessibility is also conserved throughout the cell cycle (26). Moreover, experiments from different groups over the years have shown that DNA replication is coupled with nucleosome assembly (27–29). In a recent publication (27), Smith and Whitehouse have shown that nascent chromatin plays a role in termination of Okazaki fragment synthesis. This indicates the importance of nucleosome positioning immediately behind the fork during replication. In another paper, Yadav and Whitehouse showed that the nucleosomes behind the replication fork also get repositioned via ATP-dependent chromatin remodeling machines, and such remodeling is essential for obtaining certain features associated to nucleosome organization (30). Requirement of ATP-dependent chromatin remodelling enzymes to reorganize nucleosomes, after replication, is also proposed by Fennessy et al. (31). Recently, in yeast, Vassuer et al. studied the maturation of nucleosome organization following genome replication (32) and analysed the role of transcription in the maturation of nucleosome organization to their mid log position of active gene region. They showed that soon after replication, in downstream TSS, the nucleosome organization is not proper and it takes time to maturate. Ramachandran and Henikoff found that after replication nucleosome occupancy at active gene regions may differ from steady-state pattern owing to the competition between nucleosomes and various regulatory factors that bind DNA (22). However, they also found that nucleosome occupancy in the inactive region is very similar to the nucleosome of the parent chromatin, suggesting that inheritance of nucleosome positioning after replication at certain locations along the genome. There has been hardly any theoretical/computational study investigating the de novo nucleosome assembly. To the best of our knowledge, only the work of Osberg et al. (34) investigates some aspects of the de novo nucleosome assembly. However, they do not address the question of inheritance of precise nucleosome positioning from parent chromatin to the daughter. In this work, we investigate the nucleosome organization immediately after replication, accounting for various experimentally known facts. We present a kinetic model incorporating replisome (replication fork) movement, nucleosome disassembly ahead of the fork, and nucleosome deposition and repositioning (sliding) of nucleosomes behind the fork. We show that pausing of the fork during disassembly of nucleosomes on parental chromatin and sliding/repositioning of nucleosomes on daughter chromatin behind the fork are crucial events dictating the nucleosome positioning after replication. We systematically explore the parameter space in the model and point out the parameter regime where inheritance of nucleosome positioning may be observed. We also study the competition between nucleosomes and non-histone proteins, and how they affect the nucleosome positioning during replication. MATERIALS AND METHODS Model for nucleosome kinetics during replication Here we present a model to study the nucleosome re-organization following gene replication. In this model we start by considering an initial (parental) chromatin—DNA bound with nucleosomes—having a specific nucleosome organization. The DNA is considered as a one-dimensional lattice with each base pair marked with an index i. The nucleosome is modelled as a hard-core particle sitting on the lattice, occupying a space of k = 150 lattice sites (see Figure 1). At t = 0, the replisome starts replication process from the replication origin (i = 0), and it moves with a bare rate $$v$$r (rate of fork movement unhindered by nucleosomes) in the forward direction. As the replisome moves forward, it may encounter a nucleosome. Given that the nucleosome is a stable complex, there can be delays in fork progression as the remodeling enzymes try to disassembly the nucleosome ahead of the fork. This delay in fork progression will be referred to as a ‘pause’ event. This is a pause in fork progression and not in other processes. We consider τp as the typical timescale of this pause event (33,35). In other words, 1/τp is the eviction rate of nucleosomes at the replication fork. The replisome, as it moves, creates new double-stranded DNA (dsDNA) behind it; whenever the length of the newly synthesized dsDNA is larger than the size of a nucleosome (>150 bp), a new nucleosome can occupy that space with an intrinsic rate of kon. The effective nucleosome binding rate is proportional to the freely available space (ℓf) on the dsDNA for nucleosomes to bind, i.e, $$k_{\rm on}^{\rm eff}=k_{\rm on}\times \ell _f$$ (see Supporting Information (SI)). As the replisome moves further, the process repeats. At this point, it is important to note that, as mentioned earlier, recent experiments have indicated that the nucleosome deposition behind the fork happens soon after the fork movement (22,25,27) and is crucial for efficient replication (27). Figure 1. View largeDownload slide Schematic diagram representing our model for nucleosome disassembly/assembly dynamics during replication. The replisome complex (red pentagon) at the fork moves with a velocity $$v$$r in the direction shown by the arrow. As it encounters a nucleosome (blue oval) ahead, the fork movement pauses for a mean time τp which is the timescale for the obstructing nucleosome to get disassembled (disassembly rate =1/τp). When sufficient length of double stranded DNA (≥150 bp) is made, a nucleosome gets assembled behind the fork with an intrinsic rate kon, and the newly assembled nucleosomes will get slid for a time period of τs at a rate rs. The sliding happens in such a way that the nucleosome will get slid to the middle of the available free region. Figure 1. View largeDownload slide Schematic diagram representing our model for nucleosome disassembly/assembly dynamics during replication. The replisome complex (red pentagon) at the fork moves with a velocity $$v$$r in the direction shown by the arrow. As it encounters a nucleosome (blue oval) ahead, the fork movement pauses for a mean time τp which is the timescale for the obstructing nucleosome to get disassembled (disassembly rate =1/τp). When sufficient length of double stranded DNA (≥150 bp) is made, a nucleosome gets assembled behind the fork with an intrinsic rate kon, and the newly assembled nucleosomes will get slid for a time period of τs at a rate rs. The sliding happens in such a way that the nucleosome will get slid to the middle of the available free region. It has also been shown that the the newly deposited nucleosomes get slid/repositioned with the help of appropriate ATP-dependent chromatin remodelers, and this is crucial for the formation of proper nucleosome positioning (30). In the model, taking cues from recent experiments (18,19), we assume that a nucleosome gets slid back and forth until it settles down at the middle of the available free dsDNA. To achieve this repositioning, we do the following exercise: each nucleosome has a rate of sliding given by rs = rs0|(i − i0)| toward the mid position i0, from the current location i, with a step of size 10 bp. Here, rs0 is the intrinsic rate of sliding and i0 is the mid position of the locally available free (linker) dsDNA at that instant; i0 will evolve as the nearest nucleosome or the fork is displaced. However, the nucleosome does not slide for ever; it stops sliding after a time τs. The sliding could stop because the ATPase that facilitates sliding can disassemble or stop functioning after a certain time τs. What is given above is our basic model that describes nucleosome assembly dynamics. However, we have also extended the model to introduce binding of non-histone proteins such as gene regulatory factors (GRFs) or proteins that bind near replication origins. These proteins are considered as sterically interacting particles like a nucleosome, but with different sizes and different parameter values. For example, GRFs will have size lesser than a nucleosome while their sliding rate will be zero. Using the same simulation set up we have developed here, we investigate the role of different non-histone protein factors and how they affect nucleosome organization post replication. Parameters and their numerical values There are five parameters (rates or timescales) in the model. However, many of them are constrained by known experimental data. The bare rate of replication ($$v$$r) and the pausing timescale of the replication fork are constrained by the time it takes to complete replication over a stretch. These rates are taken such a way that 1.5−2 kb of dsDNA is replicated in a minute (33). Similarly, nucleosome density of the parent constrains the nucleosome deposition rate, and the fork velocity. In our simulations we have used a nucleosome density range of 60–90% as known in vivo (4,8). Apart from the above constraints, various experiments published in the literature also give us relevant ranges of parameter values. The forward movement rate of the replication fork is estimated in the range 10−500 bp/s (36). The binding rate of nucleosomes (kon) is estimated to be ≈0.1−10 (bp s)−1 (20,37). The fork pausing timescale, τp, which happens due to delay in nucleosome disassembly ahead of the fork, can be estimated to many seconds/tens of seconds (20,33,37). This is also comparable to the known timescale of similar pausing during transcription (38,39). In (33), nucleosome disassembly timescale ahead of the replication fork (the pausing timescale) is estimated as 7 s, assuming a uniform disassembly rate everywhere. However, there will be heterogeneity (due to DNA sequence/nucleosome stability) and it may vary over a range ∼7 s depending on the location/cell-type. Hence we have done our simulations for a wide range of τp values. We do not know the sliding parameters precisely. Hence in this work, we vary the sliding duration parameter (τs) over a wide range and examine how this would affect nucleosome organization during replication. The other sliding parameter rs0 is also varied from 0.05 to 1 bp−1 s−1. Details of simulation In this paper, given the events and rates, we simulate the system using kinetic Monte Carlo methods (Gillespie algorithm) (40). We start from a specified parental nucleosome profile (occupancy pattern), and simulate replication, as per the events discussed above, and produce nucleosome organization in the daughter chromatin. We repeat this many times (typically 5000) and compute average occupancy of nucleosomes on the daughter cells. Occupancy at any position i is defined as the probability that the site is covered by a nucleosome. Rates used for each figure is given in the text. RESULTS A minimal model and its limitations The simplest (or minimal) model for replication is to consider only two processes, namely the replisome movement and the nucleosome deposition. That is, imagine a one-dimensional problem of a replication fork moving at a rate $$v$$r and nucleosomes being deposited behind the fork with a rate kon. This problem was considered by Osberg et al. (34). As a start, we also simulated replication with only these two processes and the results are presented in the SI text (Supplementary Figure S1). Our main findings from this simple study are (i) the average density of nucleosomes, within this minimal model, is determined by the ratio of $$v$$r to kon (ii) within this model, the density of the nucleosomes (the fraction of DNA covered by nucleosomes) has to be between 75% and 100% (iii) the occupancy pattern in this simple model will always be uniform, one will never obtain a heterogeneous (space-dependent) nucleosome organization on an average (see Figure 2A). The last two points are major limitations of the minimal model. Within this model, there is no mechanism that transfers the positional information from the parent to the daughter. Figure 2. View largeDownload slide The grey regions represent nucleosome positions on the parental DNA having heterogeneous linker lengths. The resultant daughter nucleosome occupancy for various replication rules, averaged over 5000 realizations, starting from the same parent. (A) Using the minimal replication model having only two parameters: $$v$$r = 500 bp/s and kon = 0.1 bp−1 s−1. (B) same as (A) but with nucleosome sliding added to the minimal replication model having sliding parameters τs = 1 s and rs0 = 1 bp−1 s−1. (C) same as (A) but with fork pausing added to the minimal model having τp = 10 s. (D) The full model; that is, when both pausing and sliding events are considered in addition to the fork velocity and nucleosome deposition—this replicates nucleosome positioning quite accurately. The parameter values are $$v$$r = 500 bp/s, kon = 0.1 bp−1 s−1, τs = 1 s, τp = 10 s. Figure 2. View largeDownload slide The grey regions represent nucleosome positions on the parental DNA having heterogeneous linker lengths. The resultant daughter nucleosome occupancy for various replication rules, averaged over 5000 realizations, starting from the same parent. (A) Using the minimal replication model having only two parameters: $$v$$r = 500 bp/s and kon = 0.1 bp−1 s−1. (B) same as (A) but with nucleosome sliding added to the minimal replication model having sliding parameters τs = 1 s and rs0 = 1 bp−1 s−1. (C) same as (A) but with fork pausing added to the minimal model having τp = 10 s. (D) The full model; that is, when both pausing and sliding events are considered in addition to the fork velocity and nucleosome deposition—this replicates nucleosome positioning quite accurately. The parameter values are $$v$$r = 500 bp/s, kon = 0.1 bp−1 s−1, τs = 1 s, τp = 10 s. Heterogeneous nucleosome organization : role of fork pausing and nucleosome sliding In the simulation of the minimal model, we did not account for the experimentally observed (30) nucleosome repositioning (sliding). We also assumed that nucleosomes ahead of the replication fork get disassembled infinitely fast, resulting in unhindered (no pause) movement of the fork. However, in reality the replication might pause until the nucleosome ahead of the fork is removed. Given that nucleosome insertion behind the fork is strongly coupled with the movement of the fork (27,30), we hypothesise that the timescale of such pausing, and hence, the pausing in movement of the replication machinery, can be important in determining the nucleosome organization behind the fork. Therefore, as discussed in the model section, we introduce both sliding of nucleosomes behind the fork and pausing of the fork due to the removal of nucleosomes ahead of the fork. Each nucleosome, after deposition behind the fork, will be slid for a time τs as discussed in the Model section. As the fork reaches a nucleosome on the parent strand, the fork will pause until a time τp which is the time needed for clearing the way for the machinery to go forward by removing the nucleosome ahead. Since we do not know the precise values of these two parameters, we will vary them systematically and investigate the parameter regime under which one can observe experimentally sensible results. We, first, take the bare sliding rate as rs0 = 1.0 bp−1 s−1. The precise value of rs0 may not be important as we discuss later. We start our simulation with only three moves: replisome movement, nucleosome deposition and nucleosome sliding (i.e. minimal model + nucleosome sliding; assume pausing is negligible). The results are given in Figure 2B. One can see that, with sliding and no pausing, the resulting average occupancy is homogeneous in space, and looks very different from the parental nucleosome positioning. This means that sliding cannot produce heterogenous nucleosome positioning. Then, we simulate another limit with no sliding but with pausing (i.e. minimal model + nucleosome pausing; assume sliding is negligible). The results are in Figure 2C. Here, we find that the introduction of pausing brings some signature of the parental nucleosome organization. However, the occupancy pattern is not very similar to that of the parent. Further, we simulate the model by introducing all the four events: fork movement, nucleosome deposition, sliding and pausing events simultaneously. First, we take the pausing timescale longer than the sliding timescale (τp = 10 s, τs = 1 s). In this parameter regime, the parental nucleosome occupancy is nicely replicated in the daughter (Figure 2D). Note that even the heterogeneity in spacing is inherited in the next generation. For example, near position 200, the gap between two nucleosomes in the parent is small (≈50 bp), and near position 800, the gap is large (≈100 bp). One can see that in the daughter cell (even after averaging over many cells) the gap variation is reproduced (Figure 2D). In Figure 3A, we present a natural scenario where nucleosome positioning on chromosome-1 of Saccharomyces cerevisiae starting at location 2708 bp is replicated. We started with data obtained from Kaplan et al.'s study (8) (blue curve Figure 3A) as the parental nucleosome positioning profile, and performed the replication simulation on an ensemble of configurations; the resulting nucleosome occupancy of the daughter chromatin is shown as red curve in Figure 3A (also see Figure 3B). Comparing the parental and daughter nucleosome occupancy, we note the following points: the daughter occupancy is not exactly the same as the parent; however, there is a good amount of similarity where the daughter occupancy profile captures essential signatures of the parent. For example, the peak positions (high occupancy regions) are largely similar, even though the height of the peaks (and depth of the troughs) do not match well. This is qualitatively comparable to some of the recent experimental studies where there are some signatures of inheritance but the inheritance is not perfect (32). Figure 3. View largeDownload slide Comparison of nucleosome organization in parent and daughter chromatin. (A) Nucleosome occupancy in parent (blue) and daughter (red) for S. cerevisiae (chromosome 1: 2708–7234). (B) Same data as in (A) shown as heat map. The yellow regions represent the higher nucleosome occupancy and red regions represent lower nucleosome occupancy. (C) Nucleosome organization in PHO5 promoter region when the promoter is in ‘off’ (inactive) state. Parental nucleosomes are shown as grey bars and the TATA box (green bar) is covered with a nucleosome. Nucleosome occupancy after replication (red curve) results in covered TATA box with ≈90% probability. On the other hand, if we switch off pausing and sliding events, the TATA box will be covered with probability 75% (magenta curve). The parameters used to generate daughter cell nucleosome occupancy are $$v$$r = 500 bp/s, kon = 0.1 bp−1 s−1, τs = 5 s, τp = 10 s. Figure 3. View largeDownload slide Comparison of nucleosome organization in parent and daughter chromatin. (A) Nucleosome occupancy in parent (blue) and daughter (red) for S. cerevisiae (chromosome 1: 2708–7234). (B) Same data as in (A) shown as heat map. The yellow regions represent the higher nucleosome occupancy and red regions represent lower nucleosome occupancy. (C) Nucleosome organization in PHO5 promoter region when the promoter is in ‘off’ (inactive) state. Parental nucleosomes are shown as grey bars and the TATA box (green bar) is covered with a nucleosome. Nucleosome occupancy after replication (red curve) results in covered TATA box with ≈90% probability. On the other hand, if we switch off pausing and sliding events, the TATA box will be covered with probability 75% (magenta curve). The parameters used to generate daughter cell nucleosome occupancy are $$v$$r = 500 bp/s, kon = 0.1 bp−1 s−1, τs = 5 s, τp = 10 s. Further, we examined the promoter region of PHO5 gene which is known to show diverse behaviour (7,41). For example, if the TATA protein binding site is covered with a nucleosome, the promoter will mostly be in the ‘off’ (inactive) state; on the other hand if TATA site is exposed, then the promoter will mostly be in the ‘on’ (active) state. Based on the recent experimental data (41), we started with a nucleosome occupancy pattern that represents the inactive (off) state of the promoter (see Figure 3C)—that is, TATA site is covered. After replication, if the nucleosome positioning is not faithfully inherited, it may lead to unwanted spurious gene expression. In our simulations, we find that with large enough pausing, the nucleosome positioning can be inherited keeping the TATA box covered with a probability 0.9, and hence the promoter is inactive. In comparison, in the absence of pausing and sliding, the inheritance is poor—it leads to reduced coverage of TATA box (see Figure 3C). In our simulations, we rarely got configurations that are devoid of nucleosomes implying that such nucleosome free states are only possible with active remodeling (7,41). Going beyond single genes, to understand how parameters values affect the inheritance, we have systematically studied the inheritance of nucleosome positioning by taking a few different values of τp and τs. In Figure 4A, we have compared nucleosome occupancies in parent and daughter chromatins for different values of τp and τs. We observe that whenever both τp and τs are non-zero, and τp ≥ τs the daughter cell inherits the parent positioning reasonably well. To compare the nucleosome occupancies, we define deviation, χ, as a measure of the difference in nucleosome occupancy between the daughter and the parent, \begin{equation*} \chi =\sqrt{\dfrac{1}{L} \sum _{i=1}^{L} \left(m_i-d_i\right)^2}, \end{equation*} (1) where mi and di are occupancy of ith site in parent and daughter strand, respectively. If the nucleosome occupancy pattern between the parent and daughter is identical, then we expect the χ → 0; if the occupancy patterns are very different we expect a large value of χ close to 1. In Figure 4B, the deviation (χ) is plotted for different values of τp and τs as a heat-map with small values of χ represented by a dark violet color and large values of χ represented by a yellow color(see the colourbar on the side). This further verifies that for the parameter regime, 0 < τs ≤ τp, the deviation is small. That is, for 0 < τs ≤ τp the daughter somewhat faithfully inherits parental nucleosome occupancy. In SI Text (Supplementary Figure S2) we present similar results for a different set of parameter values, and it suggests that the phenomena of nucleosome positioning inheritance due to the pausing is independent of the precise parameter values we use. Please note that even for the best inheritance, the deviation is non-zero suggesting that the inheritance is not perfect. However, the process lays down a pattern of nucleosome positioning similar to the parent and this may help the post-replication maturating events in achieving a proper steady-state nucleosome organization. Figure 4. View largeDownload slide Comparison of nucleosome organization between parent and daughter chromatin for various pairs of (τp, τs) values. (A) Blue sharp curves represent the parental nucleosome organization and the red curves represent daughter nucleosome organization averaged over an ensemble of realizations. (B) Heat-map for the quantity ‘deviation’ (χ) as defined in Eq. (1), χ increases as color varies from violet to yellow. For 0 < τs ≤ τp there is less deviation from parent to daughter nucleosomal organization. The parameters used to generate daughter cell nucleosome occupancy are $$v$$r = 500 bp/s, kon = 0.1 bp−1 s−1, rs0 = 1 bp−1 s−1. For a different parameter value of rs0 = 0.05 bp−1 s−1, the results are shown in SI Supplementary Figure S2. Figure 4. View largeDownload slide Comparison of nucleosome organization between parent and daughter chromatin for various pairs of (τp, τs) values. (A) Blue sharp curves represent the parental nucleosome organization and the red curves represent daughter nucleosome organization averaged over an ensemble of realizations. (B) Heat-map for the quantity ‘deviation’ (χ) as defined in Eq. (1), χ increases as color varies from violet to yellow. For 0 < τs ≤ τp there is less deviation from parent to daughter nucleosomal organization. The parameters used to generate daughter cell nucleosome occupancy are $$v$$r = 500 bp/s, kon = 0.1 bp−1 s−1, rs0 = 1 bp−1 s−1. For a different parameter value of rs0 = 0.05 bp−1 s−1, the results are shown in SI Supplementary Figure S2. Role of strongly positioned nucleosomes and barrier-like proteins In certain parts of chromatin, it is known that there are regions where nucleosomes are ‘strongly’ positioned, while other regions have weakly positioned nucleosomes (8,42,43). Even though the DNA sequence may influence the regions with strong positioning, it is well known that factors beyond the sequence also affect nucleosome stability. For example, action (or the lack of action) of certain remodellers, histone variants (H2A.Z, H3.3), various nucleosome-binding proteins (like H1 or HMG family proteins) and histone modifications are all known to affect the stability and positioning strength of nucleosomes (44–49). Does stability/positioning-strength of nucleosomes have any role in transferring the nucleosome positioning information into the daughter cells? We investigate the effect of strong vs weak nucleosome positioning and how they influence the occupancy pattern in daughter chromatin. Strongly positioned nucleosomes are defined as those nucleosomes that are more difficult to be disassembled ahead of the fork – that is, nucleosomes having a higher value of τp are strongly positioned, while low τp would imply weakly positioned nucleosomes. We simulate such a system with heterogeneous (high and low) τp values 0.01 s (weak) and 10 s (strong) keeping τs (=1 s) fixed. In a long stretch of DNA, we consider two special regions with strongly positioned nucleosomes. In Figure 5A, the two grey-shaded regions (each of length 365 bp) contain two strongly positioned nucleosomes each, while the rest of the DNA has weakly positioned nucleosomes. All nucleosomes are arranged with a uniform linker length of 65bp. The resulting nucleosome positioning in the daughter cells (averaged over 5000 cells) is shown as a red curve. We observe that strongly positioned parental nucleosomes give rise to regions in daughter chromatin with high nucleosome occupancy inheriting the strong positioning. Also note that there is a statistical positioning on either side of the strongly positioned nucleosomes implying that the strongly positioned nucleosomes can influence the positioning of the neighboring nucleosomes like in the case of the well-known statistical positioning near a strong ‘barrier’ (9,14,15). In SI text (Supplementary Figure S3), we show that a similar inheritance of nucleosome positioning is applicable even when just one nucleosome is strongly positioned (also see Supplementary Figure S4). Figure 5. View largeDownload slide (A) The simulations are performed with parental nucleosomes in the grey shaded region (two nucleosomes in each grey region) that are strongly positioned (τp = 10 s) and other nucleosomes that are weakly positioned (τp = 0.01 s) with uniform linker length of 65 bp. The daughter nucleosomal organization (occupancy) for such a heterogeneous fork pausing times is shown as the red curve. Other parameters are kept constant as mentioned below. (B) Nucleosomal positioning information transfer in the vicinity of gene regulatory factors (GRF). Top panel blue curve represents the parental nucleosome organization and the green solid bar shows presence of GRF. The middle panel shows nucleosome positioning in the daughter chromatin when the replication is performed from left to right and the bottom panel shows nucleosome positioning in the daughter chromatin when the replication is performed from right to left. We have also performed similar simulation for symmetric nucleosome organization on either sides of GRF on the parent gene (see SI text figure Supplementary Figure S5 (A)) with asymmetric parameters on either side of the barrier (Supplementary Figure S5(B)). (C) Nucleosome occupancy reflecting competition between nucleosomes and non-histone proteins near origin recognition complex(ORC) binding site. The parental curve (blue, in steady state) differs from the daughter curve (red, immediately after replication). The difference arises because nucleosomes compete with non-histone proteins binding at nucleosome depleted region (NDR) which represents ORC binding site. (D) Similar scenario as in (C) but with GRF-nucleosome competition near transcription start site (TSS). Here too the parental curve (blue, in steady state) differs from the daughter curve (red, immediately after replication) because nucleosomes compete with GRFs. In (C) and (D), nucleosomes and non-histone proteins bind at NDR with equal probability. Unless specified otherwise, in all the four graphs (A–D) the common parameters used are: $$v$$r = 500 bp/s, kon = 0.1 bp−1 s−1, τp = 10 s, rs0 = 1 bp−1 s−1, τs = 1 s(A, B) or 5 s (C, D). Figure 5. View largeDownload slide (A) The simulations are performed with parental nucleosomes in the grey shaded region (two nucleosomes in each grey region) that are strongly positioned (τp = 10 s) and other nucleosomes that are weakly positioned (τp = 0.01 s) with uniform linker length of 65 bp. The daughter nucleosomal organization (occupancy) for such a heterogeneous fork pausing times is shown as the red curve. Other parameters are kept constant as mentioned below. (B) Nucleosomal positioning information transfer in the vicinity of gene regulatory factors (GRF). Top panel blue curve represents the parental nucleosome organization and the green solid bar shows presence of GRF. The middle panel shows nucleosome positioning in the daughter chromatin when the replication is performed from left to right and the bottom panel shows nucleosome positioning in the daughter chromatin when the replication is performed from right to left. We have also performed similar simulation for symmetric nucleosome organization on either sides of GRF on the parent gene (see SI text figure Supplementary Figure S5 (A)) with asymmetric parameters on either side of the barrier (Supplementary Figure S5(B)). (C) Nucleosome occupancy reflecting competition between nucleosomes and non-histone proteins near origin recognition complex(ORC) binding site. The parental curve (blue, in steady state) differs from the daughter curve (red, immediately after replication). The difference arises because nucleosomes compete with non-histone proteins binding at nucleosome depleted region (NDR) which represents ORC binding site. (D) Similar scenario as in (C) but with GRF-nucleosome competition near transcription start site (TSS). Here too the parental curve (blue, in steady state) differs from the daughter curve (red, immediately after replication) because nucleosomes compete with GRFs. In (C) and (D), nucleosomes and non-histone proteins bind at NDR with equal probability. Unless specified otherwise, in all the four graphs (A–D) the common parameters used are: $$v$$r = 500 bp/s, kon = 0.1 bp−1 s−1, τp = 10 s, rs0 = 1 bp−1 s−1, τs = 1 s(A, B) or 5 s (C, D). Another aspect of such local nucleosome positioning influenced by various proteins happens in the context of gene-regulatory factors (GRF). We consider a situation where there is certain non-histone GRF present in the parental gene. It is known that when a bound GRF is highly stable, it can act like a ‘barrier’ and cause statistical positioning (9,14,15,50) of nucleosomes. Typically, it is known that the coding region will have the statistical positioning of nucleosomes, while the regions upstream to TSS often show different kinds of nucleosome organization (9). How the nucleosome positioning is inherited near a GRF is an interesting question, and recent works have probed this experimentally (22,30,32). Here, we examine the prediction of our model given certain nucleosome organization reminiscent of GRF locations on the parent DNA. On the parent DNA, on the left side of the GRF we start with the statistical positioning of nucleosomes, and on the right side with uniformly positioned nucleosomes (flat occupancy) with mean density ≈85% (see top panel of Figure 5B). We start with 5000 parent copies of the same gene, each having nucleosomes organized near the GRF in such a way that the mean of the occupancy of the parents as given in the top panel of Figure 5B. Each of these 5000 copy is replicated once, and we look at the nucleosome positioning on each of the gene and compute the average occupancy, which is plotted as red continuous curve in Figure 5B. When we carry out the replication from left to right with regard to the GRF (in the parent, the left side has statistical positioning, the right side has uniform occupancy), we find that on the left side the statistical positioning gets replicated fairly well (see middle panel of Figure 5B). However, on the right side, even though there was a flat positioning in the parent, the daughter chromatin has nucleosomes with non-uniform oscillatory occupancy in space. The physical reason for this is the following: on the left side, daughter gene inherits the parental occupancy via pausing and sliding; Whereas, on the right side, due to the effect of the GRF barrier, one obtains oscillatory positioning—it is well known that nucleosomes near a barrier will have spatial oscillations in occupancy. This also indicates that physical barriers will have influence near the barrier site, even with pausing and sliding. In our simulations, since the GRF is bound immediately behind the replication fork, the nucleosome depositing after the GRF ‘feels’ (via steric exclusion) the GRF barrier, and hence, the generation of the oscillatory pattern. Please note that ATPase activity (here, sliding of nucleosomes) is an important factor in producing the oscillatory pattern as known in other contexts (9,14). When we carry out the replication from right to left with respect to GRF, we get the result as shown in Figure 5B, bottom panel. Since the machinery that is moving towards GRF is unaware of the presence of GRF until it reaches the location, the replicated chromatin will have very little influence of the barrier. However, after the GRF, the statistical positioning is reproduced. Within the short span of sliding, the nucleosome very close to the GRF feels the barrier and hence, one obtains a single peak on the right side (Figure 5B, bottom panel). We observe that the nucleosome organization immediately after the replication in the vicinity of GRF is tied to the replication fork progression direction (see Figure 5B). This positioning may change long after replication under the influence of other events such as transcription or action of various remodellers (32). These local remodelling events may destroy the spontaneous peak formed in Figure 5B and lead to parent-like nucleosome positioning as a result of these extra events. So far we have assumed that non-histone proteins like GRFs bind in the nucleosome depleted region (NDR) with large affinity and occupy their precise locations on the DNA. However, many of the recent experiments indicate that nucleosomes compete with non-histone protein binding and this may result in gain of nucleosomes in NDR region (22). To test this, we introduced the competition between nucleosome binding and binding of non-histone proteins (binding factors near replication origin and GRFs near promoters) in the following way. Whenever a non-histone protein binding region is replicated, that newly replicated region is free to be occupied by nucleosome and non-histone protein with probabilities (1 − α) and (α), respectively. Typical transcription factor binding free energy (≈5−15 kcal/mol) can be comparable to the nucleosome binding free energy at certain sequences (14,51,52). Experiments have also shown that, at biologically relevant concentrations, typical transcription factors can have binding rates comparable to that of nucleosomes (51). Hence we consider α = 0.5 here. In Figure 5C, we present our results of nucleosome positioning near origin of replication and find that the inheritance is poor when the non-histone protein binding probability is small. This is similar to the experimental observations made by Ramachandran and Heinikoff in their recent paper (22). We also find that the competition mostly leads to nucleosome gain in the nucleosome depleted regions and it influences the inheritance. A similar picture is also obtained at promoter region where a GRF is competing with nucleosomes (see Figure 5D). This suggests that the inheritance of nucleosome positioning also depends on other factors such as action of non-histone proteins. Therefore, in some context, transcription may also play an important role in ‘maturation’ of nucleosome positioning as indicated in (32). DISCUSSION In this paper, we have addressed the question of inheritance of nucleosome organization from parent to daughter, instantly after replication, by simulating a plausible physical model. We have used various known information from published experiments and constructed a model to study the effect of different replication-related processes on nucleosome organization in daughter cells. We have first studied a bare minimum model of the fork movement and nucleosome deposition behind the fork, which can only produce a homogeneous nucleosome distribution in the daughter cell irrespective of parental organization. Since the bare minimum model has no mechanism to transfer information of the heterogeneous parental nucleosome organization to the next generation, we have introduced another physically important process, which is the pausing of the replication fork on encountering a nucleosome on the parental chromatin. This interaction of the fork with nucleosomes have given some signature of parental organization in the daughter strand, but the signature has not been precise enough. Consequently, we introduced sliding of the newly deposited nucleosomes as reported in recent experiments (53,54). Using computer simulation we explore the parameter-space and show that when one has a finite pausing and sliding with comparable timescales, one gets replicated daughter chromatin that has similar nucleosome occupancy as that of the parental chromatin. Our model argues that strongly positioned nucleosomes act as ‘barriers’ that will make the replication fork pause for a short period (a period comparable to the nucleosome sliding timescale) at the site of the strongly positioned nucleosomes, and this pause will help transferring the positioning information from parent to daughter. Nucleosome positioning inheritance at ‘inactive’ gene regions In the first part of our paper, we have only accounted for events that would typically occur in an ‘inactive’ gene region, namely, nucleosome disassembly and related pausing ahead of the fork, DNA replication, nucleosome deposition behind the fork and nucleosome sliding. With these events, we find that the nucleosome positioning can be inherited given that the pausing timescale is sufficiently big (τp ≥ τs). As discussed in the context of Figure 3, a randomly selected typical gene region with no extra activity due to non-histone proteins, and an inactive (off) gene promoter region (e.g. PHO5) can inherit nucleosome positioning from the parental chromatin. This is consistent with recent experimental observation that ‘nucleosome positions are conserved at inactive sites behind replication fork’ (22). Our results do not imply that, with pausing, the inheritance is perfect. There is always some finite amount of deviation (e.g. Figures 3A and 5B). What our work suggests is that if pausing happens, it allows the chromatin to pass some information about the nucleosome positioning to the daughter chromatin. The duration of pausing will crucially depend on the local nucleosome stability and hence it maybe highly heterogeneous. As our results show, if the nucleosomes are not stable, the pausing and inheritance will be negligible. Interestingly, nearly all the experiments that study nucleosome positioning behind the fork report non-homogeneous (having peaks and troughs) nucleosome occupancy pattern immediately after replication (22,32). As we show in our work, a minimal model would not give rise to this heterogeneity (Figure 2). Our results suggest that nucleosome pausing would lead to inheritance of the inhomogeneity seen in the parental chromatin. Therefore, one of the interesting predictions of our model is that pausing would play a role in giving rise to heterogeneity in nucleosome organization. Lack of nucleosome positioning inheritance at ‘active’ gene regions At ‘active’ gene regions, non-histone proteins play important role—for example, gene regulatory factors. In the second part of the paper, we extended our model incorporating binding of different non-histone proteins that compete with nucleosomes to occupy certain specific sites along the genome. Our results show that this competition will lead to imperfect inheritance of nucleosome positioning (Figure 5C and D). This is also consistent with (22) where they find a lack of inheritance in nucleosome positioning at such active sites. This again suggests that many other factors could influence the positioning of nucleosomes in the daughter chromatin. Depending on the gene location and the factors involved the precise nature of nucleosome positioning inheritance may vary. Strength and limitations of the model The strength of our model is that it incorporates various known experimental features such as nucleosome deposition behind the replication fork, sliding of newly deposited nucleosomes, and physically plausible events like nucleosome pausing. In our work we do not distinguish between replication of leading versus lagging strand. We find that if fork pauses for sufficiently long time and there is sufficient time for remodelling machines to position nucleosomes, the results that we obtain should be similar for both the strands. Since the mechanism of replication is different for the lagging strand, we tried to mimic that in a modified simulation relevant for the lagging strand (see SI text section 5 and Supplementary Figure S6). We find that results do not change significantly. However, the model has various limitations or drawbacks: the first drawback is that we have not considered the extended size of replisome, which is ≈55 bp long (55). One reason we did not put in the size of a replisome is that, during the pause, it may happen that the replisome would partially unwrap or partially disassemble the nucleosome (which can be of a few tens of basepairs comparable to the size of the replisome) before pausing close to the dyad; this will offset the effect due to the finite size of the replisome and we will end up with a scenario that is very similar to what we have obtained here. In other words, we have not considered the size of a replisome, while we have assumed that the nucleosome at the fork will occupy full 150 bp; however, the reality might be that the nucleosome may unwrap occupying only <100 bp, while the rest of the space might be occupied by the replisome. In the case of transcription, it has been reported that the RNA Polymerase pauses inside the unwrapped nucleosome near the dyad region (38). This would be mathematically equivalent of what we did and it will not change our results. The second limitation is that we have considered nucleosomes as stable hard-core particles—that is, particles with strong steric repulsion disfavouring any amount of overlap. However, partial unwrapping of DNA from nucleosomes has been observed experimentally (56); this feature as discussed in earlier works (34,57,58) is not included in the current model and may be addressed in a future work. Another limitation is that the rates of processes in vivo might be very different from what we have taken for our simulations. However, we have explored the parameter-space, and found that our results will not depend on the precise value of rates; rather, the results will be true for a range of rates. Suggestions for new experiments to test our predictions Our work predicts that strongly positioned nucleosomes will induce a pause in the progression of the replication fork, and this pause will help in transferring nucleosome positioning information from parent to daughter. One way to test our predictions is to do experiments with and without strongly positioned nucleosomes in the parent chromatin. One possibility would be to make appropriate modifications to histones that would stabilise/destabilise the nucleosomes. This may be achieved by using appropriate histone variants or by using suitable chemical modifications along the histone tails. It can be tested whether a less stable (more stable) nucleosome positioning is poorly (better) inherited or not. Another way would be to stabilise nucleosomes on the parent chromatin by inserting artificial sequences (like the 601 sequence). Since the sliding machinery is known to slide nucleosomes away even from 601-like strongly positioning sequences (18), the pause will contribute to the inheritance of nucleosome positioning at such strongly positioned locations. CONCLUSIONS Our study can be a first step in the direction of understanding the mechanism of inheritance of epigenetic information from parent to daughter, and it introduces strong physical arguments with predictive power. With our model we have been able to reproduce the parental nucleosome organization in the daughter cell with reasonable precision after disruption due to replication. While in some regions, the remodeling after replication (e.g. nucleosome rearrangement related to transcription (32)) might play important role, for some other regions (like heterochromatin or regions where the gene is ‘off’) the positioning of nucleosomes after replication may not change much. Hence, the inheritance of precise nucleosome positioning in these regions can be crucial; an erroneous gene activation due to incorrect epigenetic information transfer during replication may lead to various abnormalities and diseases (59,60). Our results will certainly be important for these latter regions. Even for regions that may change their nucleosome positioning after transcription, it is important to have a proper nucleosome positioning at all times as incorrect nucleosome positioning may expose promoters leading to unwanted transcription. We know that there are many other factors, such as DNA sequence and chemical modifications of histones, which also play significant roles in deciding the nucleosome organization. Further study is required to quantify the significance of these factors at various stages of the cell cycle. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We acknowledge useful discussions with Iestyn Whitehouse, Geeta Narlikar and Swati Patankar. We also thank DST-INSPIRE(TB), CSIR(DD), IIT Bombay(RP) for financial support. FUNDING Indian Institute of Technology Bombay [13IRAWD005]; Council of Scientific and Industrial Research [03(1326)/14/EMR-II]; Department of Science and Technology, Ministry of Science and Technology [IFA-13 PH-64]. Funding for open access charge: Department of Science and Technology, India. Conflict of interest statement. None declared. REFERENCES 1. Kornberg R.D. Chromatin structure: a repeating unit of histones and DNA . Science . 1974 ; 184 : 868 – 871 . Google Scholar CrossRef Search ADS PubMed 2. Alberts B. , Bray D. , Lewis J. , Raff M. , Roberts K. , Watson J. Molecular Biology of the Cell, Garland . 2002 ; 4th edn . Google Scholar PubMed PubMed 3. Lee C.-K. , Shibata Y. , Rao B. , Strahl B.D. , Lieb J.D. Evidence for nucleosome depletion at active regulatory regions genome-wide . Nat. Genet. 2004 ; 36 : 900 – 905 . Google Scholar CrossRef Search ADS PubMed 4. Lee W. , Tillo D. , Bray N. , Morse R.H. , Davis R.W. , Hughes T.R. , Nislow C. A high-resolution atlas of nucleosome occupancy in yeast . Nat. Genet. 2007 ; 39 : 1235 – 1244 . Google Scholar CrossRef Search ADS PubMed 5. Lorch Y. , Griesenbeck J. , Boeger H. , Maier-Davis B. , Kornberg R.D. Selective removal of promoter nucleosomes by the RSC chromatin-remodeling complex . Nat. Struct. Mol. Biol. 2011 ; 18 : 881 – 885 . Google Scholar CrossRef Search ADS PubMed 6. Bai L. , Morozov A.V. Gene regulation by nucleosome positioning . Trends Genet. 2010 ; 26 : 476 – 483 . Google Scholar CrossRef Search ADS PubMed 7. Kharerin H. , Bhat P.J. , Marko J.F. , Padinhateeri R. Role of transcription factor-mediated nucleosome disassembly in PHO5 gene expression . Scientific Rep. 2016 ; 6 : 20319 . Google Scholar CrossRef Search ADS 8. Kaplan N. , Moore I.K. , Fondufe-Mittendorf Y. , Gossett A.J. , Tillo D. , Field Y. , LeProust E.M. , Hughes T.R. , Lieb J.D. , Widom J. et al. The DNA-encoded nucleosome organization of a eukaryotic genome . Nature . 2009 ; 458 : 362 – 366 . Google Scholar CrossRef Search ADS PubMed 9. Zhang Z. , Wippo C.J. , Wal M. , Ward E. , Korber P. , Pugh B.F. A packing mechanism for nucleosome organization reconstituted across a eukaryotic genome . Science . 2011 ; 332 : 977 – 980 . Google Scholar CrossRef Search ADS PubMed 10. Parmar J.J. , Das D. , Padinhateeri R. Theoretical estimates of exposure timescales of protein binding sites on DNA regulated by nucleosome kinetics . Nucleic Acids Res. 2016 ; 44 : 1630 – 1641 . Google Scholar CrossRef Search ADS PubMed 11. Correll S.J. , Schubert M.H. , Grigoryev S.A. Short nucleosome repeats impose rotational modulations on chromatin fibre folding . EMBO J. 2012 ; 31 : 2416 – 2426 . Google Scholar CrossRef Search ADS PubMed 12. Collepardo-Guevara R. , Schlick T. Chromatin fiber polymorphism triggered by variations of DNA linker lengths . Proc. Natl. Acad. Sci. U.S.A. 2014 ; 111 : 8061 – 8066 . Google Scholar CrossRef Search ADS PubMed 13. van der Heijden T. , van Vugt J.J. , Logie C. , van Noort J. Sequence-based prediction of single nucleosome positioning and genome-wide nucleosome occupancy . Proc. Natl. Acad. Sci. U.S.A. 2012 ; 109 : E2514 – E2522 . Google Scholar CrossRef Search ADS PubMed 14. Parmar J.J. , Marko J.F. , Padinhateeri R. Nucleosome positioning and kinetics near transcription-start-site barriers are controlled by interplay between active remodeling and DNA sequence . Nuc. Acids Res. 2014 ; 42 : 128 – 136 . Google Scholar CrossRef Search ADS 15. Kornberg R.D. , Stryer L. Statistical distributions of nucleosomes: nonrandom locations by a stochastic mechanism . Nucleic Acids Res. 1988 ; 16 : 6677 – 6690 . Google Scholar CrossRef Search ADS PubMed 16. Milani P. , Chevereau G. , Vaillant C. , Audit B. , Haftek-Terreau Z. , Marilley M. , Bouvet P. , Argoul F. , Arneodo A. Nucleosome positioning by genomic excluding-energy barriers . Proc. Natl. Acad. Sci. U.S.A. 2009 ; 106 : 22257 – 22262 . Google Scholar CrossRef Search ADS PubMed 17. Sadeh R. , Allis C.D. Genome-wide ‘re’-modeling of nucleosome positions . Cell . 2011 ; 147 : 263 – 266 . Google Scholar CrossRef Search ADS PubMed 18. Yang J.G. , Madrid T.S. , Sevastopoulos E. , Narlikar G.J. The chromatin-remodeling enzyme ACF is an ATP-dependent DNA length sensor that regulates nucleosome spacing . Nat. Struct. Mol. Biol. 2006 ; 13 : 1078 – 1083 . Google Scholar CrossRef Search ADS PubMed 19. Racki L.R. , Yang J.G. , Naber N. , Partensky P.D. , Acevedo A. , Purcell T.J. , Cooke R. , Cheng Y. , Narlikar G.J. The chromatin remodeller ACF acts as a dimeric motor to space nucleosomes . Nature . 2009 ; 462 : 1016 – 1021 . Google Scholar CrossRef Search ADS PubMed 20. Padinhateeri R. , Marko J.F. Nucleosome positioning in a model of active chromatin remodeling enzymes . Proc. Natl. Acad. Sci. U.S.A. 2011 ; 108 : 7799 – 7803 . Google Scholar CrossRef Search ADS PubMed 21. Radman-Livaja M. , Verzijlbergen K.F. , Weiner A. , van Welsem T. , Friedman N. , Rando O.J. , van Leeuwen F. Patterns and mechanisms of ancestral histone protein inheritance in budding yeast . PLoS Biol. 2011 ; 9 : e1001075 . Google Scholar CrossRef Search ADS PubMed 22. Ramachandran S. , Henikoff S. Transcriptional regulators compete with nucleosomes post-replication . Cell . 2016 ; 165 : 580 – 592 . Google Scholar CrossRef Search ADS PubMed 23. Probst A.V. , Dunleavy E. , Almouzni G. Epigenetic inheritance during the cell cycle . Nat. Rev. Mol. Cell Biol. 2009 ; 10 : 192 – 206 . Google Scholar CrossRef Search ADS PubMed 24. Lucchini R. , Wellinger R.E. , Sogo J. Nucleosome positioning at the replication fork . EMBO J. 2001 ; 20 : 7294 – 7302 . Google Scholar CrossRef Search ADS PubMed 25. Alabert C. , Barth T.K. , Reverón-Gómez N. , Sidoli S. , Schmidt A. , Jensen O.N. , Imhof A. , Groth A. Two distinct modes for propagation of histone PTMs across the cell cycle . Genes Dev. 2015 ; 29 : 585 – 590 . Google Scholar CrossRef Search ADS PubMed 26. Blythe S.A. , Wieschaus E.F. Establishment and maintenance of heritable chromatin structure during early Drosophila embryogenesis . eLife . 2016 ; 5 : e20148 . Google Scholar CrossRef Search ADS PubMed 27. Smith D.J. , Whitehouse I. Intrinsic coupling of lagging-strand synthesis to chromatin assembly . Nature . 2012 ; 483 : 434 – 438 . Google Scholar CrossRef Search ADS PubMed 28. Mejlvang J. , Feng Y. , Alabert C. , Neelsen K.J. , Jasencakova Z. , Zhao X. , Lees M. , Sandelin A. , Pasero P. , Lopes M. et al. New histone supply regulates replication fork speed and PCNA unloading . J. Cell Biol. 2014 ; 204 : 29 . Google Scholar CrossRef Search ADS PubMed 29. Weintraub H. A possible role for histone in the synthesis of DNA . Nature . 1972 ; 240 : 449 – 453 . Google Scholar CrossRef Search ADS PubMed 30. Yadav T. , Whitehouse I. Replication-coupled nucleosome assembly and positioning by ATP-dependent chromatin-remodeling enzymes . Cell Rep. 2016 ; 15 : 715 – 723 . Google Scholar CrossRef Search ADS PubMed 31. Fennessy R.T. , Owen-Hughes T. Establishment of a promoter-based chromatin architecture on recently replicated DNA can accommodate variable inter-nucleosome spacing . Nucleic Acids Res. 2016 ; 44 : 7189 – 7203 . Google Scholar PubMed 32. Vasseur P. , Tonazzini S. , Ziane R. , Camasses A. , Rando O.J. , Radman-Livaja M. Dynamics of nucleosome positioning maturation following genomic replication . Cell Rep. 2016 ; 16 : 2651 – 2665 . Google Scholar CrossRef Search ADS PubMed 33. Alabert C. , Jasencakova D. , Groth A. Masai H , Foiani M Chromatin Replication and Histone Dynamics . DNA Replication. Advances in Experimental Medicine and Biology . 2017 ; 1042 : Singapore Springer 311 – 333 . Google Scholar CrossRef Search ADS 34. Osberg B. , Nuebler J. , Korber P. , Gerland U. Replication-guided nucleosome packing and nucleosome breathing expedite the formation of dense arrays . Nucleic Acids Res. 2014 ; 42 : 13633 – 13645 . Google Scholar CrossRef Search ADS PubMed 35. Hall M.A. , Shundrovsky A. , Bai L. , Fulbright R.M. , Lis J.T. , Wang M.D. High-resolution dynamic mapping of histone-DNA interactions in a nucleosome . Nat. Struct. Mol. Biol. 2009 ; 16 : 124 – 129 . Google Scholar CrossRef Search ADS PubMed 36. Raghuraman M.K. , Winzeler E.A. , Collingwood D. , Hunt S. , Wodicka L. , Conway A. , Lockhart D.J. , Davis R.W. , Brewer B.J. , Fangman W.L. Replication dynamics of the yeast genome . Science . 2001 ; 294 : 115 – 121 . Google Scholar CrossRef Search ADS PubMed 37. Brown C.R. , Mao C. , Falkovskaia E. , Jurica M.S. , Boeger H. Linking stochastic fluctuations in chromatin structure and gene expression . PLoS Biol. 2013 ; 11 : 1 – 15 . Google Scholar CrossRef Search ADS 38. Jin J. , Bai L. , Johnson D.S. , Fulbright R.M. , Kireeva M.L. , Kashlev M. , Wang M.D. Synergistic action of RNA polymerases in overcoming the nucleosomal barrier . Nat. Struct. Mol. Biol. 2010 ; 17 : 745 – 752 . Google Scholar CrossRef Search ADS PubMed 39. Hodges C. , Bintu L. , Lubkowska L. , Kashlev M. , Bustamante C. Nucleosomal Fluctuations Govern the Transcription Dynamics of RNA Polymerase II . Science (New York, N.Y.) . 2009 ; 325 : 626 – 628 . Google Scholar CrossRef Search ADS PubMed 40. Gillespie D.T. Exact stochastic simulation of coupled chemical reactions . J. Phys. Chem. 1977 ; 81 : 2340 – 2361 . Google Scholar CrossRef Search ADS 41. Small E.C. , Xi L. , Wang J.-P. , Widom J. , Licht J.D. Single-cell nucleosome mapping reveals the molecular basis of gene expression heterogeneity . Proc. Natl. Acad. Sci. U.S.A. 2014 ; 111 : E2462 – E2471 . Google Scholar CrossRef Search ADS PubMed 42. Nikolaou C. , Althammer S. , Beato M. , Guigó R. Structural constraints revealed in consistent nucleosome positions in the genome of S. cerevisiae . Epigenet. Chromatin . 2010 ; 3 : 20 . Google Scholar CrossRef Search ADS 43. Feng J. , Dai X. , Xiang Q. , Dai Z. , Wang J. , Deng Y. , He C. New insights into two distinct nucleosome distributions: comparison of cross-platform positioning datasets in the yeast genome . BMC Genomics . 2010 ; 11 : 33 . Google Scholar CrossRef Search ADS PubMed 44. Deal R.B. , Henikoff J.G. , Henikoff S. Genome-wide kinetics of nucleosome turnover determined by metabolic labeling of histones . Science . 2010 ; 328 : 1161 – 1164 . Google Scholar CrossRef Search ADS PubMed 45. Teif V.B. , Ettig R. , Rippe K. A lattice model for transcription factor access to nucleosomal DNA . Biophys. J. 2010 ; 99 : 2597 – 2607 . Google Scholar CrossRef Search ADS PubMed 46. Reeves R. Nuclear functions of the HMG proteins . Biochim. Biophys. Acta (BBA) - Gene Regul. Mech. 2010 ; 1799 : 3 – 14 . Google Scholar CrossRef Search ADS 47. Papamichos-Chronakis M. , Watanabe S. , Rando O.J. , Peterson C.L. Global regulation of H2A.Z localization by the INO80 chromatin-remodeling enzyme is essential for genome integrity . Cell . 2011 ; 144 : 200 – 213 . Google Scholar CrossRef Search ADS PubMed 48. Ranjan A. , Mizuguchi G. , FitzGerald P.C. , Wei D. , Wang F. , Huang Y. , Luk E. , Woodcock C.L. , Wu C. Nucleosome-free region dominates histone acetylation in targeting SWR1 to promoters for H2A.Z replacement . Cell . 2013 ; 154 : 1232 – 1245 . Google Scholar CrossRef Search ADS PubMed 49. Becker P.B. , Workman J.L. Nucleosome remodeling and epigenetics . Cold Spring Harbor Perspect. Biol. 2013 ; 5 : a017905 . Google Scholar CrossRef Search ADS 50. Möbius W. , Gerland U. Quantitative test of the barrier nucleosome model for statistical positioning of nucleosomes up- and downstream of transcription start sites . PLoS Comput. Biol. 2010 ; 6 : e1000891 . Google Scholar CrossRef Search ADS PubMed 51. Perez-Howard G.M. , Weil P.A. , Beechem J.M. Yeast TATA binding protein interaction with DNA: fluorescence determination of oligomeric state, equilibrium binding, on-rate, and dissociation kinetics . Biochemistry . 1995 ; 34 : 8005 – 8017 . Google Scholar CrossRef Search ADS PubMed 52. Morozov A.V. , Fortney K. , Gaykalova D.A. , Studitsky V.M. , Widom J. , Siggia E.D. Using DNA mechanics to predict in vitro nucleosome positions and formation energies . Nucleic Acids Res. 2009 ; 37 : 4707 – 4722 . Google Scholar CrossRef Search ADS PubMed 53. Clapier C.R. , Cairns B.R. The biology of chromatin remodeling complexes . Annu. Rev. Biochem. 2009 ; 78 : 273 – 304 . Google Scholar CrossRef Search ADS PubMed 54. Leschziner A.E. Electron microscopy studies of nucleosome remodelers . Curr. Opin. Struct. Biol. 2011 ; 21 : 709 – 718 . Google Scholar CrossRef Search ADS PubMed 55. Baker T.A. , Bell S.P. Polymerases and the replisome: machines within machines . Cell . 1998 ; 92 : 295 – 305 . Google Scholar CrossRef Search ADS PubMed 56. Li G. , Widom J. Nucleosomes facilitate their own invasion . Nat. Struct. Mol. Biol. 2004 ; 11 : 763 – 769 . Google Scholar CrossRef Search ADS PubMed 57. Chereji R.V. , Morozov A.V. Ubiquitous nucleosome crowding in the yeast genome . Proc. Natl. Acad. Sci. U.S.A. 2014 ; 111 : 5236 – 5241 . Google Scholar CrossRef Search ADS PubMed 58. Möbius W. , Osberg B. , Tsankov A.M. , Rando O.J. , Gerland U. Toward a unified physical model of nucleosome patterns flanking transcription start sites . Proc. Natl. Acad. Sci. U.S.A. 2013 ; 110 : 5719 – 5724 . Google Scholar CrossRef Search ADS PubMed 59. Egger G. , Liang G. , Aparicio A. , Jones P.A. Epigenetics in human disease and prospects for epigenetic therapy . Nature . 2004 ; 429 : 457 – 463 . Google Scholar CrossRef Search ADS PubMed 60. Calvanese V. , Lara E. , Kahn A. , Fraga M.F. The role of epigenetics in aging and age-related diseases . Ageing Res. Rev. 2009 ; 8 : 268 – 276 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

journal article

Open Access Collection

An origin of the immunogenicity of in vitro transcribed RNA

Mu, Xin;Greenwald, Emily;Ahmad, Sadeem;Hur, Sun

2018 Nucleic Acids Research

doi: 10.1093/nar/gky177pmid: 29534222

Abstract The emergence of RNA-based therapeutics demands robust and economical methods to produce RNA with few byproducts from aberrant activity. While in vitro transcription using the bacteriophage T7 RNA polymerase is one such popular method, its transcripts are known to display an immune-stimulatory activity that is often undesirable and uncontrollable. We here showed that the immune-stimulatory activity of T7 transcript is contributed by its aberrant activity to initiate transcription from a promoter-less DNA end. This activity results in the production of an antisense RNA that is fully complementary to the intended sense RNA product, and consequently a long double-stranded RNA (dsRNA) that can robustly stimulate a cytosolic pattern recognition receptor, MDA5. This promoter-independent transcriptional activity of the T7 RNA polymerase was observed for a wide range of DNA sequences and lengths, but can be suppressed by altering the transcription reaction with modified nucleotides or by reducing the Mg2+ concentration. The current work thus not only offers a previously unappreciated mechanism by which T7 transcripts stimulate the innate immune system, but also shows that the immune-stimulatory activity can be readily regulated. INTRODUCTION Recent advances in RNA technology led to the emergence of RNA-based therapeutics (1–3). Promising therapeutic potential was shown for a wide range of RNAs, such as small interfering RNAs (siRNAs), aptamers, catalytic ribozymes and mRNAs. Accordingly, there is an increasing demand for robust and cost-effective methods to prepare RNA on a large scale. While chemical synthesis can be utilized for relatively small RNAs, it is unsuitable for longer RNAs (>∼100–150 nt) due to the exponentially decreasing yield with the increasing length of RNA. Enzymatic production using phage RNA polymerase, such as the T7 RNA polymerase (T7 pol), has been a popular method to prepare RNAs of various lengths. Advantages of the T7 pol include its robust activity and the ease of protein production. However, T7 pol is known to generate various kinds of aberrant byproducts (4–8), and its transcripts are often known to stimulate the vertebrate innate immune system (9,10). While the immunogenicity of T7 transcripts can be beneficial for certain applications (such as cancer immunotherapy (11)), the current lack of understanding of the precise mechanism and the source of the immunogenicity limits harnessing such properties for therapeutic purposes. RIG-I and MDA5 are two major cytosolic sensors that activate the innate immune system in response to viral dsRNAs (12). Upon viral dsRNA recognition, RIG-I and MDA5 activate antiviral signaling pathways that lead to the transcriptional up-regulation of the type I and III interferons (IFNs). Studies have shown that RIG-I and MDA5 have distinct RNA specificities, through which they recognize largely different groups of viruses (13–15). RIG-I recognizes dsRNA termini, in particular the 5′ triphosphate group (5’ppp), while MDA5 recognizes long (>∼0.5–1 kb) dsRNA in a manner that depends on the duplex length, not 5’ppp (16,17). More detailed structural and biochemical analyses showed that MDA5 forms a filament along the length of dsRNA, and the filament formation is essential for the antiviral signal activation and dsRNA length detection (18,19). MDA5 also hydrolyzes ATP only upon binding to dsRNA, although ATP hydrolysis acts to regulate the MDA5 activity in seemingly complex ways (18). It has long been thought that the immune-stimulatory activity of T7 pol transcripts is due to the fact that they harbor 5’ppp, which can stimulate RIG-I (20–22). However, data suggest that removal of 5’ppp by a phosphatase does not completely suppress the immunogenicity (23). The potential MDA5-stimulatory activity of T7 transcripts have not been examined to the best of our knowledge. Here we demonstrate that T7 pol often generates a high level of unintended double-stranded RNA (dsRNA) that are highly stimulatory for MDA5 and made of the intended sense transcript and its fully complementary antisense transcript. The current work offers a previously unappreciated mechanism by which T7 transcripts stimulate the innate immune system, and a method to suppress such dsRNA byproduct formation. MATERIALS AND METHODS Plasmids and proteins Mammalian expression plasmids for RIG-I and MDA5 and the bacterial expression plasmid for MDA5ΔN (residue 287–1025) were described previously (24). Protocols to express and purify MDA5ΔN was reported in (25). Briefly, the protein was expressed in BL21(DE3) at 20°C for 16–20 h following induction with 0.5 mM IPTG. Cells were lysed by high pressure homogenization using an Emulsiflex C3 (Avestin), and the protein was purified by a combination of Ni-NTA and heparin affinity chromatography and size exclusion chromatography (SEC) in 20 mM HEPES, pH 7.5, 150 mM NaCl and 2 mM DTT. T7 polymerase was expressed in BL21(DE3) using the plasmid pT7–911Q. The T7 pol protein was purified by Ni-NTA followed by size exclusion chromatography (SEC) in 50 mM Tris, pH 7.5, 100 mM NaCl and 1 mM EDTA, and stored in 50% glycerol. T7 transcription and native gel electrophoresis Sequences for the DNA templates are shown in Supplementary Table S1. Unless mentioned otherwise, templates for in vitro transcription were prepared by PCR reactions. Transcription was performed in the reaction containing 250 mM HEPES (pH 7.5), 30 mM MgCl2, 2 mM Spermidine, 40 mM DTT, 0.1 mg/ml BSA, 5 mM NTP (5 mM each), 0.5 mg/ml T7 pol and 0.25 mg/ml DNA template. Reaction was incubated at 37°C for 4 h and DNA template was digested with DNase I. Trasnscript was purified by Phenol:chloroform extraction, ethanol precipitation and using the QIAquick PCR purification kit (Qiagen). Transcripts (160 ng each) were analyzed by 1× TBE 6% polyacrylamide gel. RNAs were visualized by SybrGold staining (Thermofisher) or by acridine orange (AO, in 25 μg/ml) staining (Sigma Aldrich). Fluorescent gel images were obtained using the FLA9000 gel scanner (GE Healthcare). For the AO fluorescent images, ssRNA-specific images were obtained using the 473 and 792 nm filters for excitation and emission, respectively. For dsRNA-specific fluorescent images, the 532 and 554 nm filters were used for excitation and emission, respectively. 5′- and 3′-RACE For 5′-RACE, transcripts were subject to reverse transcription using a primer specific to the RNA of interest. Prior to RT, the primer and RNA were incubated at 95°C for 5 min in the presence of 2 mM EDTA, and cooled down to 4°C for 2 min. High Capacity cDNA reverse transcription kit (Applied Biosystems) was used according to the manufacturer's protocol. After RT, RNA was removed with RNase A, and cDNA was purified using QIAquick PCR purification kit. The 3′ end of the first strand cDNA was extended with poly A tails using the Terminal dNTP transferase (NEB). The second strand cDNA was synthesized by using the oligo d(T)-anchor primer. For 3′-RACE, 3′ end of the RNA was first extended with poly A tails using the poly A polymerase (NEB). The first strand and second strand cDNAs were synthesized using the oligo d(T)-anchor primer and a primer bearing the internal sequence of the RNA of interest. For both 5′- and 3′-RACE, the final product was PCR amplified and subjected to Sanger sequencing (Quitara Biosciences). Primers used in the analysis are shown in Supplementary Table S2. IFNβ promoter-driven dual luciferase assay 293T cells were maintained in 48-well plates in Dulbecco's modified Eagle medium (Cellgro) supplemented with 10% heat-inactivated fetal calf serum and 1% penicillin/streptomycin. At ∼90% confluence, cells were transfected with the pFLAG-CMV4 plasmids encoding RIG-I (10 ng) or MDA5 (5 ng), the IFNβ- promoter driven firefly luciferase reporter plasmid (100 ng) and a constitutively expressed Renilla luciferase reporter plasmid (pRL-CMV, 10 ng) using lipofectamine2000 (Life) according to the manufacturer's protocol. The medium was changed 6–8 h after the first transfection and the cells were additionally transfected with in vitro transcribed RNA (0.5 μg) or high molecular weight polyIC (0.5 μg, Invivogen). In vitro transcribed RNAs were filtered using 0.1 μm spin-cup filter (Ultrafree-MC VV centrifugal filter, Merck Millipore) at 6000 rpm for 5 min at room temperature to remove any potential RNA aggregates. Cells were lysed ∼20 h post-stimulation and IFNβ promoter activity was measured using the Dual Luciferase Reporter assay (Promega) and a Synergy2 plate reader (BioTek). Firefly luciferase activity was normalized against Renilla luciferase activity. ATPase assay The ATP hydrolysis activity of MDA5 was measured using Green Reagent (Enzo Life Sciences). MDA5ΔN (500 nM) was pre-incubated with RNA (2 ng/μl) in buffer A (20 mM HEPES pH 7.5, 150 mM NaCl, 1.5 mM MgCl2 and 2 mM DTT), and the reaction was initiated by addition of 2 mM ATP at 37°C. Aliquots (10 μl) were withdrawn before and 15 min after ATP addition, and were quenched with 100 mM EDTA on ice. The Green Reagent (90 μl) was added to the quenched reaction at a ratio of 9:1, and OD650 was measured using a Synergy2 plate reader (BioTek). Electron microscopy MDA5ΔN (450 nM) was incubated with RNA (2 ng/μl) in buffer A for 10 min at RT followed by addition of 1 mM ADP•AlFx on ice. ADP•AlFx was prepared by mixing ADP, AlCl3 and NaF in a molar ratio of 1:1:3. Prepared filaments were adsorbed to carbon-coated grids (Ted Pella) and stained with 0.75% uranyl formate as described (26). Images were collected using a Tecnai G2 Spirit BioTWIN transmission electron microscope at 49,000x magnification. RNase III / RNase If digestion 100 ng/μl of RNA was incubated with indicated concentrations of RNase III (NEB) or RNase If (NEB) in 10 mM Tris, pH 8.5 and 2 mM MgCl2 at 37°C for 30 min. The reaction was terminated using 1/10 volume of proteinase K (NEB) and incubated at 25°C for 20 min. RNA was purified with Direct-zol RNA MiniPrep Kit (Zymo research) followed by QIAquick PCR purification kit (Qiagen). RESULTS In vitro T7 transcription often generates dsRNA byproduct from a template designed for ssRNA To examine the immune-stimulatory activity of T7 transcripts, we performed in vitro transcription of four independent RNAs, and tested their RIG-I/MDA5-stimulatory activities using the interferon-promoter driven dual luciferase assay in 293T cells. These transcripts were generated from templates that are designed to produce 512 nt ssRNAs with limited secondary structures (512A, 512B, 512C and 512D, see Supplementary Table S1). Since 293T cells express a low level of RIG-I and little or no MDA5, we transiently expressed RIG-I or MDA5, stimulated the cells by transfecting RNAs of interest, and measured the signaling activity of RIG-I or MDA5 by the level of the luciferase activity. A known stimulator, polyinosinic-polycytidylic acid (polyIC), was used for comparison. Consistent with previous reports about the immune-stimulatory activity of T7 transcripts, we observed that all four T7 transcripts robustly stimulated both RIG-I and MDA5 (Figure 1A). Treatment of these RNAs with Calf Intestinal Phosphatase (CIP), which removes 5’ppp, largely suppressed the RIG-I stimulatory activities, but not the MDA5-stimulatory activities (Figure 1A). Figure 1. View largeDownload slide T7 pol generates MDA5-stimulatory dsRNAs from templates designed to encode ssRNAs. (A) IFNβ promoter-driven dual luciferase assay (mean ± S.D., n = 3 biological replicates). * and ns indicate P< 0.05 and P> 0.05, respectively (one-tailed, unpaired t test, comparing values with and without CIP). 293T cells were transiently transfected with the plasmids expressing human RIG-I or MDA5 (or empty vector control, EV), and stimulated with T7 transcripts for 512A, 512B, 512C and 512D. The RIG-I- and MDA5-stimulatory activities were measured by the dual luciferase activity. The stimulatory activity was compared before and after CIP treatment to examine the effect of 5’ppp. (B) Native PAGE analysis of T7 transcripts for 512A, 512B, 512C and 512D. RNAs were visualized by SybrGold staining (upper panel) and Acridine Orange (AO) staining (lower panel). The 500 nt ssRNA and 500 bp dsRNA markers were used as controls. (C) RNase-susceptibility analysis. The T7 transcript for 512B was treated with an increasing concentration of RNase If (1/40, 1/20, 1/14 and 1/6 units/μl) or RNase III (1/4500, 1/3000, 1/2000 and 1/1000 units/μl), and was analyzed by native PAGE. D. MDA5-stimulatory activity of upper and lower bands of the transcript for 512B, as measured by IFNβ promoter-driven dual luciferase assay (mean ± S.D., n = 3 biological replicates). * indicates P< 0.05 (one-tailed, unpaired t test). The upper and lower bands were isolated by either RNase If/III digestion or gel purification, and their MDA5-stimulatory activities were measured as in (A). All RNAs were treated with CIP to suppress the endogenous RIG-I activity. (E–F) MDA5-stimulatory activity of upper and lower bands of 512B, as measured by MDA5 filament formation (E) and ATP hydrolysis (F) (mean ± S.D., n = 3 biological replicates). Both assays were performed using the RNA binding domain (MDA5ΔN), which can form filaments and hydrolyze ATP as the full-length protein. Filament formation was detected using 2 ng/ul of RNA and 450 nM MDA5 by negative stain electron microscopy (EM). The ATPase activity was measured using 2 ng/ul of RNA and 500 nM MDA5. Figure 1. View largeDownload slide T7 pol generates MDA5-stimulatory dsRNAs from templates designed to encode ssRNAs. (A) IFNβ promoter-driven dual luciferase assay (mean ± S.D., n = 3 biological replicates). * and ns indicate P< 0.05 and P> 0.05, respectively (one-tailed, unpaired t test, comparing values with and without CIP). 293T cells were transiently transfected with the plasmids expressing human RIG-I or MDA5 (or empty vector control, EV), and stimulated with T7 transcripts for 512A, 512B, 512C and 512D. The RIG-I- and MDA5-stimulatory activities were measured by the dual luciferase activity. The stimulatory activity was compared before and after CIP treatment to examine the effect of 5’ppp. (B) Native PAGE analysis of T7 transcripts for 512A, 512B, 512C and 512D. RNAs were visualized by SybrGold staining (upper panel) and Acridine Orange (AO) staining (lower panel). The 500 nt ssRNA and 500 bp dsRNA markers were used as controls. (C) RNase-susceptibility analysis. The T7 transcript for 512B was treated with an increasing concentration of RNase If (1/40, 1/20, 1/14 and 1/6 units/μl) or RNase III (1/4500, 1/3000, 1/2000 and 1/1000 units/μl), and was analyzed by native PAGE. D. MDA5-stimulatory activity of upper and lower bands of the transcript for 512B, as measured by IFNβ promoter-driven dual luciferase assay (mean ± S.D., n = 3 biological replicates). * indicates P< 0.05 (one-tailed, unpaired t test). The upper and lower bands were isolated by either RNase If/III digestion or gel purification, and their MDA5-stimulatory activities were measured as in (A). All RNAs were treated with CIP to suppress the endogenous RIG-I activity. (E–F) MDA5-stimulatory activity of upper and lower bands of 512B, as measured by MDA5 filament formation (E) and ATP hydrolysis (F) (mean ± S.D., n = 3 biological replicates). Both assays were performed using the RNA binding domain (MDA5ΔN), which can form filaments and hydrolyze ATP as the full-length protein. Filament formation was detected using 2 ng/ul of RNA and 450 nM MDA5 by negative stain electron microscopy (EM). The ATPase activity was measured using 2 ng/ul of RNA and 500 nM MDA5. To identify the origin of 5’ppp-independent MDA5-stimulatory activity, we analyzed the T7 transcript by non-denaturing PAGE. Two bands were observed for all four RNAs tested, with the upper bands at an expected position near the 500 nt ssRNA marker, and the lower bands at a position near the 500 bp dsRNA marker (Figure 1B). While there were batch-to-batch variations in the relative levels of the two bands, they were the most prominent bands in all cases. The two bands displayed different fluorescence of acridine orange, a dye that binds RNA regardless of its secondary structure, but fluoresces differently depending on the RNA secondary structure (Figure 1B). The upper band showed a similar fluorescent behavior as the 500 nt ssRNA marker (rendered orange in the gel image), while the lower band did as the 500 bp dsRNA marker (rendered green). The two bands also displayed different sensitivities to RNase If and RNase III that are specific to ssRNA and dsRNA, respectively (Figure 1C, only shown for 512B as a representative example). The upper band was susceptible to RNase If, while the lower band was to RNase III. Consistent with this notion that the upper band is mostly single stranded, while the lower band has significant duplex structures, only the lower band RNA, as purified by the gel extraction or RNase If digestion, had a significant stimulatory activity for MDA5-mediated antiviral signaling (Figure 1D). In vitro analysis of the interaction between the upper or lower band RNAs of 512B and MDA5 also showed that only the lower band can stimulate MDA5 filament formation and its ATPase activity (Figure 1E and F). Intriguingly, electron microscopy analysis of the MDA5 filament showed that the lower band RNA supported formation of ∼150–160 nm long MDA5 filament (Figure 1E), a length expected for ∼500 bp dsRNA (24). This suggested that the lower band is not the intended ssRNA with an intrinsic secondary structure, but is instead a ∼500 bp dsRNA byproduct. dsRNA byproduct results from antisense transcription from the promoter-less DNA end Previous studies reported that T7 pol can generate erroneous products by extending the 3′ end of the RNA with the sequence complementary to the intended RNA product (5). Such byproducts would fold onto themselves and form a hairpin structure with its duplex size equivalent to that of the intended ssRNA. To examine whether the ∼500 bp dsRNA byproduct is produced in this fashion, we examined the 3′ end sequence of the intended ssRNA transcript (512B) using 3′-RACE followed by Sanger sequencing (Figure 2A). The sequencing result shows that the ends are well-defined although a few nucleotide heterogeneities were also observed, a well-known behavior of T7 pol (27). This result is inconsistent with the idea that a long complementary 3′ extension is responsible for the observed ∼500 bp duplex byproduct. Figure 2. View largeDownload slide The dsRNA byproduct is formed by sense and antisense RNAs generated in promoter-dependent and -independent manners, respectively. (A and B) Transcriptional start sites and end sites for the intended 512B product (A) and its complementary RNA byproduct (c512B) (B), as examined by 5′- and 3′-RACE. Transcriptional start and end sites are shown upstream of the poly A tail in the 5′- and 3′-RACE sequences, respectively. Cyan underscores in the sequence chromatograms indicate sequences matching those in the template. The location of the matching sequence in the template is shown in the schematic on the right. The red box in (B) indicates the reverse complement sequence of the T7 promoter. (C) Schematic illustrating the results in (A and B). Transcription using a template with a single T7 promoter results in the production of both sense and antisense transcripts, which differ in length by the size of the T7 promoter. Solid and dotted lines indicate DNA and RNA, respectively. (D) Native PAGE analysis of T7 transcripts generated using DNA template with a single T7 promoter (1), DNA template without the T7 promoter (2), and gel-purified 512B ssRNA as a template (3). RNA template alone (4) was compared with (3). (E) Native PAGE analysis of T7 transcripts generated using fully duplexed DNA template (1), template strand ssDNA alone (512B-3end in Supplementary Table S1) (2), and partially duplexed DNA template (generated by annealing 512B-3end with T7promoter_top_strand, Supplementary Table S1) (3). * indicates unknown ssRNA byproduct from transcription. Figure 2. View largeDownload slide The dsRNA byproduct is formed by sense and antisense RNAs generated in promoter-dependent and -independent manners, respectively. (A and B) Transcriptional start sites and end sites for the intended 512B product (A) and its complementary RNA byproduct (c512B) (B), as examined by 5′- and 3′-RACE. Transcriptional start and end sites are shown upstream of the poly A tail in the 5′- and 3′-RACE sequences, respectively. Cyan underscores in the sequence chromatograms indicate sequences matching those in the template. The location of the matching sequence in the template is shown in the schematic on the right. The red box in (B) indicates the reverse complement sequence of the T7 promoter. (C) Schematic illustrating the results in (A and B). Transcription using a template with a single T7 promoter results in the production of both sense and antisense transcripts, which differ in length by the size of the T7 promoter. Solid and dotted lines indicate DNA and RNA, respectively. (D) Native PAGE analysis of T7 transcripts generated using DNA template with a single T7 promoter (1), DNA template without the T7 promoter (2), and gel-purified 512B ssRNA as a template (3). RNA template alone (4) was compared with (3). (E) Native PAGE analysis of T7 transcripts generated using fully duplexed DNA template (1), template strand ssDNA alone (512B-3end in Supplementary Table S1) (2), and partially duplexed DNA template (generated by annealing 512B-3end with T7promoter_top_strand, Supplementary Table S1) (3). * indicates unknown ssRNA byproduct from transcription. It was also reported that T7 pol can generate RNA using another RNA as a template (6,7). To examine this possibility and to determine exactly how the dsRNA byproduct was made, we performed 3′- and 5′-RACE analyses of the RNA complementary to 512B (c512B) (Figure 2B). The results showed that c512B starts primarily from the 3′ end of 512B (Figure 2C). Note that the template sequence in this end has little resemblance to the canonical T7 promoter. 3′-RACE of the c512B further revealed that c512B ends with the sequence complementary to the T7 promoter. These results suggest that the c512B is produced by transcription in reverse orientation from the promoter-less DNA end of the 512B template, rather than by RNA-dependent RNA polymerization. To confirm that c512B is synthesized by promoter-independent, DNA-dependent transcription, we performed the T7 transcription using 512B RNA as a template and 512B DNA template without the T7 promoter at either end. Consistent with the notion that c512B is generated from the promoter-independent, DNA-dependent RNA polymerization, dsRNA was generated from the DNA even without the T7 promoter, but not from the RNA template (Figure 2D). The transcription reaction using 512B RNA template did not produce any detectable level of dsRNA or transcription product aside from the input RNA. To further examine the mechanism of dsRNA byproduct formation, we also used a DNA template that is duplexed only in the promoter, instead of fully duplexed DNA (Figure 2E). Here, 80 nt sequence of 512B 3′ end (512B-3end, Supplementary Table S1) was used, instead of full 512 nt sequence of 512B, due to the size limit of ssDNA synthesis. The dsRNA byproduct was produced from the fully duplexed DNA (albeit less so than with the 512B template), but not from the partial duplex (Figure 2E). Thus, these results collectively suggest that the observed promoter-independent transcription of the antisense RNA requires the duplexed DNA end. Promoter-independent antisense transcription occurs for a wide range of DNA with various end sequences and structures Our finding that promoter-independent T7 pol transcription begins at the promoter-less DNA end raised the question of whether this activity depends on the DNA end sequence or structure. Note that the four RNAs tested above (512A-D) have different sequences but share the common 6 nucleotides at the promoter-less end of the DNA template (Supplementary Table S1). The templates were also commonly prepared by PCR amplification, and thus are expected to harbor blunt ends. To examine the importance of the DNA end sequence and structure, we compared three sequence variants that differ from the original 512B template near the promoter-less end (i.e. transcriptional start site for c512B) (Figure 3A). In the varied end sequence, we embedded unique restriction sites (XmaI, BamHI and KpnI), which enabled us to generate ends containing 5′- and 3′-overhangs by restriction digestion. The Klenow reaction was also performed to ensure that the original 512B template harbors blunt ends. Figure 3. View largeDownload slide Impact of the DNA termini sequence and structure on the promoter-independent antisense transcription. (A) Sequence and structure variants of DNA templates tested in this study. All have a single T7 promoter for transcription of 512B. The sequence near the promoter-less end (i.e. transcriptional start site for the antisense transcript, c512B) were varied and shown in capital letters. DNA end structural variants were generated by restriction digestion of the listed DNA. Restriction sites are colored blue, and –bd and –ad indicate before and after digestion, respectively. (B) Native PAGE analysis of the T7 transcripts produced using the original 512B template before and after the Klenow reaction, and its sequence variants before and after respective restriction digestions. (C). MDA5 ATPase assay of the T7 transcripts in (B) (mean ± S.D., n = 3 biological replicates). Figure 3. View largeDownload slide Impact of the DNA termini sequence and structure on the promoter-independent antisense transcription. (A) Sequence and structure variants of DNA templates tested in this study. All have a single T7 promoter for transcription of 512B. The sequence near the promoter-less end (i.e. transcriptional start site for the antisense transcript, c512B) were varied and shown in capital letters. DNA end structural variants were generated by restriction digestion of the listed DNA. Restriction sites are colored blue, and –bd and –ad indicate before and after digestion, respectively. (B) Native PAGE analysis of the T7 transcripts produced using the original 512B template before and after the Klenow reaction, and its sequence variants before and after respective restriction digestions. (C). MDA5 ATPase assay of the T7 transcripts in (B) (mean ± S.D., n = 3 biological replicates). Comparison of the transcripts from these DNA templates before and after Klenow or restriction digestion suggests that the end sequence and structures can influence the efficiency of promoter-independent antisense transcription, as measured by the dsRNA byproduct on the native gel (Figure 3B) and MDA5 ATPase activity (Figure 3C). Note that the XmaI sequence lowered the level of dsRNA, whereas restriction digestion increased the level of dsRNA regardless of the sequence or overhang type. In particular, digestion with KpnI, which generates a 3′ overhang, increased the level of dsRNA most significantly. While the higher level of dsRNA with KpnI-digested DNA is consistent with the previous report that a 3′ overhang promotes antisense transcription (28), our finding suggests that even DNA with a 5′ overhang (BamHI-ad) or a blunt end (Klenow) have a significant level of antisense transcription. These results thus suggest that the promoter-independent antisense transcription is a more general phenomenon than was previously thought, and can be influenced, but not eliminated by both DNA end sequence and structure. Modified nucleotides affect the level of dsRNA byproduct, not its MDA5-stimluatory activity It has been shown that co-transcriptional incorporation of modified nucleotides can decrease to a certain extent the immunogenicity of T7 transcripts (9,23,29,30). These modified nucleotides include pseudouridine (Ψ), 1-methylpseudouridine (m1Ψ), 6-methyladenosine (m6A), and 5-methylcytosine (m5C) (Figure 4A). We asked whether the presence of modified nucleotides suppress the formation of dsRNA byproducts or immune-stimulatory activity of the dsRNA. To test these hypotheses, we performed four independent transcription reactions, in each of which Ψ, m1Ψ, m6A or m5C completely replaced U, U, A or C, respectively. Intriguingly, Ψ, m1Ψ and m5C, but not m6A, suppressed dsRNA byproduct formation, as measured by native gel analysis (Figure 4B). The cellular MDA5 signaling activity and MDA5 ATPase activity also correlated with the level of dsRNA byproduct observed in the transcripts (Figure 4C and D). Figure 4. View largeDownload slide Modified nucleotides affect the level of dsRNA byproduct, not its MDA5-stimluatory activity. (A) Chemical structures of the modified nucleotides used in this study (boxed) and their respective unmodified counterparts. (B) Native PAGE analysis of the T7 transcripts for 512B or the dsRNA species (512B:c512B) prepared with or without the indicated modified nucleotides. The dsRNA was prepared by co-transcription of the two RNA strands using two independent templates and was purified using RNase If, which selectively degrades residual ssRNA while maintaining dsRNA. Modified nucleotides were used in place of the respective unmodified nucleotide. (C) MDA5-stimulatory activity of 512B transcripts in (B), as measured by IFNβ promoter-driven dual luciferase assay. (D) ATPase activity of MDA5 with 512B transcripts in (B). (E) MDA5-stimulatory activity of purified 512B:c512B dsRNA in (B), as measured by IFNβ promoter-driven dual luciferase assay. (F) ATPase activity of MDA5 with purified 512B:c512B dsRNA in (B). The results in C–F represent mean ± S.D., n = 3 biological replicates. In C and E, * and ns indicate P< 0.05 and P> 0.05, respectively (one-tailed, unpaired t test, compared with unmodified RNA). Figure 4. View largeDownload slide Modified nucleotides affect the level of dsRNA byproduct, not its MDA5-stimluatory activity. (A) Chemical structures of the modified nucleotides used in this study (boxed) and their respective unmodified counterparts. (B) Native PAGE analysis of the T7 transcripts for 512B or the dsRNA species (512B:c512B) prepared with or without the indicated modified nucleotides. The dsRNA was prepared by co-transcription of the two RNA strands using two independent templates and was purified using RNase If, which selectively degrades residual ssRNA while maintaining dsRNA. Modified nucleotides were used in place of the respective unmodified nucleotide. (C) MDA5-stimulatory activity of 512B transcripts in (B), as measured by IFNβ promoter-driven dual luciferase assay. (D) ATPase activity of MDA5 with 512B transcripts in (B). (E) MDA5-stimulatory activity of purified 512B:c512B dsRNA in (B), as measured by IFNβ promoter-driven dual luciferase assay. (F) ATPase activity of MDA5 with purified 512B:c512B dsRNA in (B). The results in C–F represent mean ± S.D., n = 3 biological replicates. In C and E, * and ns indicate P< 0.05 and P> 0.05, respectively (one-tailed, unpaired t test, compared with unmodified RNA). We next asked whether incorporation of Ψ, m1Ψ, m6A or m5C has any effect on the dsRNA’s ability to stimulate MDA5, in addition to the level of dsRNA byproduct formation. To address this question, we examined MDA5 signaling and ATPase activities using modified and unmodified dsRNA of 512B:c512B hybrids. The dsRNA was prepared by co-transcription of the two RNA strands using two independent templates and was purified using RNase If, which selectively degrades residual ssRNA while maintaining dsRNA (Figure 4B). The results show that Ψ, m1Ψ, m6A or m5C had minimal impact on the MDA5-stimulatory activity of the dsRNA (Figure 4E and F). It is noteworthy that another nucleotide modification, adenosine deamination (i.e. inosine) by the cellular enzyme ADAR1, significantly suppresses the MDA5 activity (31). Given that inosine weakens Watson-Crick base-pairs, while Ψ, m1Ψ, m6A and m5C do not, this result is consistent with the notion that MDA5 primarily recognizes dsRNA backbone structure with little specific contacts with RNA bases (25). Thus, the observed impact of Ψ, m1Ψ and m5C in lowering the immunogenicity of T7 transcript is at least in part due to their abilities to lower the dsRNA byproduct formation by T7 pol, rather than by decreasing the MDA5-stimulatory activity of the dsRNA byproduct. Low concentrations of Mg2+ can suppress the promoter-independent antisense transcription The above result suggested that the transcription reaction conditions can affect the dsRNA byproduct formation. We thus performed a more systematic analysis of the reaction condition using varying concentrations of individual reaction components. These include T7 pol, DNA template, NTP and MgCl2, of which a range of concentrations has been used in literature. We also tested an increasing concentration of NaCl, because non-specific DNA binding by T7 pol was previously shown to be suppressed in buffers with high ionic strength (32). By both in-house produced T7 pol (hT7 pol) and commercial T7 pol (cT7 pol), we observed little dependence of the dsRNA byproduct formation on the concentration of T7 pol (Figure 5A). Concentrations of the template, NTP and NaCl also had a minimal impact on the level of dsRNA byproduct formation (Figure 5B–D). Intriguingly, decreasing the concentration of MgCl2 from 30 to 5 mM significantly reduced dsRNA byproduct formation, to the point that dsRNA was barely detectable at 5 mM MgCl2 (Figure 5E). Note that the total yield of RNA was not significantly affected by the reduction in MgCl2. This was unexpected because it is often considered necessary to use MgCl2 in excess over NTP (which is 20 mM in total). The observed impact of MgCl2 on dsRNA production was similarly observed with the MDA5 ATPase assay, which showed that decreasing concentration of MgCl2 reduced the level of the ATPase activity (Figure 5F). Figure 5. View largeDownload slide Impact of the transcription reaction conditions on dsRNA byproduct formation. (A) Native PAGE analysis of the T7 transcripts for 512B prepared using a decreasing concentration of T7 pol. In-house T7 pol (hT7 pol) and commercial T7 pol (cT7 pol, ThermoFisher) were compared. B-E. Native PAGE analysis of the T7 transcripts for 512B prepared using a decreasing concentration of the DNA template (B), NTP (C) and MgCl2 (E), and an increasing concentration of NaCl (D). (F) MDA5 ATPase analysis of the T7 transcripts in (E). (G) Native PAGE analysis of the T7 transcripts for 512B variants (BamHI-bd and KpnI-ad), 512A, 512C, 512D, 403A, 386A and 112A generated using MgCl2 of 30 and 5 mM. See Supplementary Table S1 for the template sequence. Note that for shorter RNAs (∼150 nt), dsRNA migrates more slowly than ssRNA of an equivalent size. (H) MDA5 ATPase analysis of the T7 transcripts in (G) (mean ± S.D., n = 3 biological replicates). Figure 5. View largeDownload slide Impact of the transcription reaction conditions on dsRNA byproduct formation. (A) Native PAGE analysis of the T7 transcripts for 512B prepared using a decreasing concentration of T7 pol. In-house T7 pol (hT7 pol) and commercial T7 pol (cT7 pol, ThermoFisher) were compared. B-E. Native PAGE analysis of the T7 transcripts for 512B prepared using a decreasing concentration of the DNA template (B), NTP (C) and MgCl2 (E), and an increasing concentration of NaCl (D). (F) MDA5 ATPase analysis of the T7 transcripts in (E). (G) Native PAGE analysis of the T7 transcripts for 512B variants (BamHI-bd and KpnI-ad), 512A, 512C, 512D, 403A, 386A and 112A generated using MgCl2 of 30 and 5 mM. See Supplementary Table S1 for the template sequence. Note that for shorter RNAs (∼150 nt), dsRNA migrates more slowly than ssRNA of an equivalent size. (H) MDA5 ATPase analysis of the T7 transcripts in (G) (mean ± S.D., n = 3 biological replicates). To examine whether the observed impact of MgCl2 on dsRNA byproduct formation is generally applicable, we examined other templates with varying DNA sequences, structure and lengths. In all cases, including the one with a 3′ overhang (KpnI-ad), we found that reducing MgCl2 from 30 mM to 5 mM significantly lowered the level of dsRNA byproduct, as measured by both native gel analysis and MDA5 ATPase assay (Figure 5G and H). Although additional ssRNA bands appear for some RNAs at low MgCl2 (Figure 5G), the overall lack of dsRNA byproduct suggests that lowering the MgCl2 concentration can be used as a general method to suppress the aberrant dsRNA byproduct formation. DISCUSSION It has been known that T7 in vitro transcripts are often immunogenic. However, the precise origin of the immunogenicity—i.e. the identity of the RNA species and its immune-stimulatory mechanism−has been unclear. This study provides an answer to these questions by showing that T7 pol often produces long dsRNA byproducts that can robustly activate the cytosolic sensors RIG-I and MDA5 in 5’ppp-dependent and -independent manners, respectively. The dsRNA is formed by hybridization of the intended sense transcript and its fully complementary antisense transcript. The antisense RNA is produced by promoter-independent transcriptional initiation from a DNA end, which differs from previously reported erroneous behaviors of T7 pol, such as 3′-extension of RNA, RNA-dependent RNA polymerization and DNA-primed RNA synthesis (4–7). Our finding also shows that this promoter-independent transcription is observed with a wide range of DNA templates, and is not limited to the DNA end with 3′ overhangs as previously reported (28). Thus, our work highlights yet another level of complexity of T7 pol activity, and has broad implications for studies utilizing T7-transcribed RNAs. Structural mechanism by which T7 pol can initiate transcription from the very end of the DNA in a promoter independent manner is yet unclear. A series of elegant structures of T7 pol (33) showed that promoter-dependent transcription begins with the initiation phase, where the polymerase remains bound to the promoter and synthesizes short abortive transcripts. This is then followed by a large conformational change that releases T7 pol from the promoter and allows processive transcriptional elongation. It is temping to speculate that the promoter-independent transcription starts directly with T7 pol in this elongation phase conformation, possibly promoted by high concentration of Mg2+, with which we observed more frequent generation of dsRNA byproducts (Figure 5). In this aberrant ‘elongation’ conformation, T7 pol may mis-recognize fraying DNA ends as DNA bubbles, initiating RNA synthesis even in the absence of an RNA primer. It would be interesting to examine whether short abortive transcripts also occurs with promoter-independent transcription, and how T7 pol mutations that either promote or suppress elongation conformation (34,35) affect the promoter-independent transcription. Our work also offers novel methods to quantitatively detect such dsRNA byproduct formation. While previous studies often utilized dsRNA-specific antibodies, such as J2, for detecting dsRNA (10,36), such methods cannot distinguish between secondary structures within ssRNA vs. long dsRNA byproducts formed by sense-antisense hybridization. Instead, we propose the MDA5 filament formation and ATPase assays as robust and convenient methods to quantitatively measure dsRNA contaminants, either in T7 pol transcripts or in any other RNA products. In particular, MDA5 filament analysis by EM allows one to examine the extent of the continuous duplex length, thereby unambiguously distinguishing between intrinsic secondary structures within ssRNA and fully duplexed RNA byproducts. Of note, our work also demands re-evaluation of the conventional method of determining RNA purity, i.e. denaturing gel electrophoresis. Considering that intended sense and antisense transcripts differ by ∼20 nt (the size of the T7 promoter, Figure 2), denaturing gel would be inefficient in separating sense and antisense strands for long RNA transcripts. Instead we here used native gel electrophoresis, as proposed before (8), which allows separation of ssRNA and dsRNA products and their distinction using acridine orange staining. While native gel analysis alone can be complicated for RNAs with intrinsic secondary structures, we believe that a combination of native gel electrophoresis, MDA5 filament formation and ATPase assays would enable a comprehensive analysis of RNA transcript of interest. Finally, our study provides previously unappreciated methods to suppress the dsRNA byproduct formation. These include using modified nucleotides and reducing MgCl2 concentration, as well as altering template sequence in the promoter-less terminus. Unlike the conventional notion that RNA containing modified nucleotides are less immunogenic, we found that modified nucleotides in fact suppress the T7 pol's aberrant activity of generating immunogenic dsRNA byproduct, rather than decreasing the MDA5-stimulatory activity of dsRNA. We also found that Mg2+, among all variables tested, unexpectedly plays the most significant role in regulating dsRNA byproduct synthesis. Intriguingly, a previous study showed that HPLC purification can deplete immune-stimulatory dsRNA (10). Although the exact identity of the dsRNA byproduct in this study is not clear, an optimal transcription reaction that minimizes such byproduct formation would help in improving the purity of RNA, and could potentially alleviate the need for the HPLC purification. Altogether, the current work provides novel insights into the previously enigmatic properties of the T7 pol-an origin, detection and regulation of the immunogenic byproduct. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. FUNDING Cancer Research Institute (to S.A.); NIH [R01AI106912, R01AI111784, R21AI130791]; Burroughs Wellcome Fund (to S.H.). Funding for open access charge: NIH [R01AI111784]. Conflict of interest statement. None declared. REFERENCES 1. Sahin U. , Kariko K. , Tureci O. mRNA-based therapeutics—developing a new class of drugs . Nat. Rev. Drug Discov. 2014 ; 13 : 759 – 780 . Google Scholar CrossRef Search ADS PubMed 2. Wittrup A. , Lieberman J. Knocking down disease: a progress report on siRNA therapeutics . Nat. Rev. Genet. 2015 ; 16 : 543 – 552 . Google Scholar CrossRef Search ADS PubMed 3. Pardi N. , Hogan M.J. , Pelc R.S. , Muramatsu H. , Andersen H. , DeMaso C.R. , Dowd K.A. , Sutherland L.L. , Scearce R.M. , Parks R. et al. Zika virus protection by a single low-dose nucleoside-modified mRNA vaccination . Nature . 2017 ; 543 : 248 – 251 . Google Scholar CrossRef Search ADS PubMed 4. Krupp G. Unusual promoter-independent transcription reactions with bacteriophage RNA polymerases . Nucleic Acids Res. 1989 ; 17 : 3023 – 3036 . Google Scholar CrossRef Search ADS PubMed 5. Triana-Alonso F.J. , Dabrowski M. , Wadzack J. , Nierhaus K.H. Self-coded 3′-extension of run-off transcripts produces aberrant products during in vitro transcription with T7 RNA polymerase . J. Biol. Chem. 1995 ; 270 : 6298 – 6307 . Google Scholar CrossRef Search ADS PubMed 6. Konarska M.M. , Sharp P.A. Replication of RNA by the DNA-dependent RNA polymerase of phage T7 . Cell . 1989 ; 57 : 423 – 431 . Google Scholar CrossRef Search ADS PubMed 7. Cazenave C. , Uhlenbeck O.C. RNA template-directed RNA synthesis by T7 RNA polymerase . Proc. Natl. Acad. Sci. U.S.A. 1994 ; 91 : 6972 – 6976 . Google Scholar CrossRef Search ADS PubMed 8. Mellits K.H. , Pe’ery T. , Manche L. , Robertson H.D. , Mathews M.B. Removal of double-stranded contaminants from RNA transcripts: synthesis of adenovirus VA RNAI from a T7 vector . Nucleic Acids Res. 1990 ; 18 : 5401 – 5406 . Google Scholar CrossRef Search ADS PubMed 9. Kariko K. , Buckstein M. , Ni H. , Weissman D. Suppression of RNA Recognition by Toll-like Receptors: The Impact of Nucleoside Modification and the Evolutionary Origin of RNA . Immunity . 2005 ; 23 : 165 – 175 . Google Scholar CrossRef Search ADS PubMed 10. Kariko K. , Muramatsu H. , Ludwig J. , Weissman D. Generating the optimal mRNA for therapy: HPLC purification eliminates immune activation and improves translation of nucleoside-modified, protein-encoding mRNA . Nucleic Acids Res. 2011 ; 39 : e142 . Google Scholar CrossRef Search ADS PubMed 11. Zitvogel L. , Galluzzi L. , Kepp O. , Smyth M.J. , Kroemer G. Type I interferons in anticancer immunity . Nat. Rev. Immunol. 2015 ; 15 : 405 – 414 . Google Scholar CrossRef Search ADS PubMed 12. Yoneyama M. , Fujita T. Recognition of viral nucleic acids in innate immunity . Rev. Med. Virol. 2010 ; 20 : 4 – 22 . Google Scholar CrossRef Search ADS PubMed 13. Loo M.Y. , Fornek J. , Crochet N. , Bajwa G. , Perwitasari O. , Martinez-Sobrido L. , Akira S. , Gill M.A. , Garcia-Sastre A. Distinct RIG-I and MDA5 signaling by RNA viruses in innate immunity . J. Virol. 2008 ; 82 : 335 – 345 . Google Scholar CrossRef Search ADS PubMed 14. Yoneyama M. , Kikuchi M. , Matsumoto M. , Imaizumi T. , Miyagishi M. , Taira K. , Foy E. , Loo M.Y. , Gale M. Jr , Akira S. et al. Shared and Unique Functions of the DExD/H-Box Helicases RIG-I, MDA5, and LGP2 in Antiviral Innate Immunity . J. Immunol. 2005 ; 175 : 2851 – 2858 . Google Scholar CrossRef Search ADS PubMed 15. Kato H. , Takeuchi O. , Sato S. , Yoneyama M. , Yamamoto M. , Matsui K. , Uematsu S. , Jung A. , Kawai T. , Ishii K.J. et al. Differential roles of MDA5 and RIG-I helicases in the recognition of RNA viruses . Nature . 2006 ; 441 : 101 – 105 . Google Scholar CrossRef Search ADS PubMed 16. Fitzgerald M.E. , Rawling D.C. , Vela A. , Pyle A.M. An evolving arsenal: viral RNA detection by RIG-I-like receptors . Curr. Opin. Microbiol. 2014 ; 20 : 76 – 81 . Google Scholar CrossRef Search ADS PubMed 17. Kato H. , Takeuchi O. , Mikamo-Satoh E. , Hirai R. , Kawai T. , Matsushita K. , Hiiragi A. , Dermody T.S. , Fujita T. , Akira S. Length-dependent recognition of double-stranded ribonucleic acids by retinoic acid-inducible gene-I and melanoma differentiation-associated gene 5 . J. Exp. Med. 2008 ; 205 : 1601 – 1610 . Google Scholar CrossRef Search ADS PubMed 18. del Toro Duany Y. , Wu B. , Hur S. MDA5-filament, dynamics and disease . Curr. Opin. Virol. 2015 ; 12 : 20 – 25 . Google Scholar CrossRef Search ADS PubMed 19. Sohn J. , Hur S. Filament assemblies in foreign nucleic acid sensors . Curr. Opin. Struct. Biol. 2016 ; 37 : 134 – 144 . Google Scholar CrossRef Search ADS PubMed 20. Schlee M. , Roth A. , Hornung V. , Hagmann C.A. , Wimmenauer V. , Barchet W. , Coch C. , Janke M. , Mihailovic A. , Wardle G. et al. Recognition of 5′triphosphate by RIG-I helicase requires short blunt double-stranded RNA as contained in panhandle of negative-strand virus . Immunity . 2009 ; 31 : 25 – 34 . Google Scholar CrossRef Search ADS PubMed 21. Hornung V. , Ellegast J. , Kim S. , Brzozka K. , Jung A. , Kato A. , Poeck H. , Akira S. , Conzelmann K.K. , Schlee M. et al. 5′-Triphosphate RNA Is the Ligand for RIG-I . Science . 2006 ; 314 : 994 – 997 . Google Scholar CrossRef Search ADS PubMed 22. Pichlmair A. , Schultz D.E. , Tan C.P. , Naslund T.I. , Liljestrom P. , Weber F. , Reis e Sousa C. RIG-I–mediated antiviral responses to single-stranded RNA bearing 5′-phosphates . Science . 2006 ; 314 : 997 – 1001 . Google Scholar CrossRef Search ADS PubMed 23. Warren L. , Manos P.D. , Ahfeldt T. , Loh Y.H. , Li H. , Lau F. , Ebina W. , Mandal P.K. , Smith Z.D. , Meissner A. et al. Highly efficient reprogramming to pluripotency and directed differentiation of human cells with synthetic modified mRNA . Cell Stem Cell . 2010 ; 7 : 618 – 630 . Google Scholar CrossRef Search ADS PubMed 24. Peisley A. , Lin C. , Bin W. , Orme-Johnson M. , Liu M. , Walz T. , Hur S. Cooperative assembly and dynamic disassembly of MDA5 filaments for viral dsRNA recognition . Proc. Natl. Acad. Sci. U.S.A. 2011 ; 108 : 21010 – 21015 . Google Scholar CrossRef Search ADS PubMed 25. Wu B. , Peisley A. , Richards C. , Yao H. , Zeng X. , Lin C. , Chu F. , Walz T. , Hur S. Structural basis for dsRNA recognition, filament formation, and antiviral signal activation by MDA5 . Cell . 2013 ; 152 : 276 – 289 . Google Scholar CrossRef Search ADS PubMed 26. Ohi M. , Li Y. , Cheng Y. , Walz T. Negative staining and image classification—powerful tools in modern electron microscopy . Biol. Proced. Online . 2004 ; 6 : 23 – 34 . Google Scholar CrossRef Search ADS PubMed 27. Milligan J.F. , Groebe D.R. , Witherell G.W. , Uhlenbeck O.C. Oligoribonucleotide synthesis using T7 RNA polymerase and synthetic DNA templates . Nucleic Acids Res. 1987 ; 15 : 8783 – 8798 . Google Scholar CrossRef Search ADS PubMed 28. Schenborn E.T. , Mierendorf R.C. Jr A novel transcription property of SP6 and T7 RNA polymerases: dependence on template structure . Nucleic Acids Res. 1985 ; 13 : 6223 – 6236 . Google Scholar CrossRef Search ADS PubMed 29. Richner J.M. , Himansu S. , Dowd K.A. , Butler S.L. , Salazar V. , Fox J.M. , Julander J.G. , Tang W.W. , Shresta S. , Pierson T.C. et al. Modified mRNA vaccines protect against Zika virus infection . Cell . 2017 ; 169 : 176 . Google Scholar CrossRef Search ADS PubMed 30. Pardi N. , Secreto A.J. , Shan X. , Debonera F. , Glover J. , Yi Y. , Muramatsu H. , Ni H. , Mui B.L. , Tam Y.K. et al. Administration of nucleoside-modified mRNA encoding broadly neutralizing antibody protects humanized mice from HIV-1 challenge . Nat. Commun. 2017 ; 8 : 14630 . Google Scholar CrossRef Search ADS PubMed 31. Ahmad S. , Mu X. , Yang F. , Greenwald E. , Park J.W. , Jacob E. , Zhang C.Z. , Hur S. Breaching self-tolerance to Alu duplex RNA underlies MDA5-mediated inflammation . Cell . 2018 ; 172 : 797 – 810 . Google Scholar CrossRef Search ADS PubMed 32. Smeekens S.P. , Romano L.J. Promoter and nonspecific DNA binding by the T7 RNA polymerase . Nucleic Acids Res. 1986 ; 14 : 2811 – 2827 . Google Scholar CrossRef Search ADS PubMed 33. Steitz T.A. The structural changes of T7 RNA polymerase from transcription initiation to elongation . Curr. Opin. Struct. Biol. 2009 ; 19 : 683 – 690 . Google Scholar CrossRef Search ADS PubMed 34. Guillerez J. , Lopez P.J. , Proux F. , Launay H. , Dreyfus M. A mutation in T7 RNA polymerase that facilitates promoter clearance . Proc. Natl. Acad. Sci. U.S.A. 2005 ; 102 : 5958 – 5963 . Google Scholar CrossRef Search ADS PubMed 35. Martin C.T. , Muller D.K. , Coleman J.E. Processivity in early stages of transcription by T7 RNA polymerase . Biochemistry . 1988 ; 27 : 3966 – 3974 . Google Scholar CrossRef Search ADS PubMed 36. Son K.-N. , Liang Z. , Lipton H.L. Double-stranded RNA is detected by immunofluorescence analysis in RNA and DNA virus infections, including those by negative-stranded RNA viruses . J. Virol. 2015 ; 89 : 9383 – 9392 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

journal article

Open Access Collection

Complex repeat structure promotes hyper-amplification and amplicon evolution through rolling-circle replication

Watanabe, Takaaki;Tanaka, Hisashi;Horiuchi, Takashi

2018 Nucleic Acids Research

doi: 10.1093/nar/gky275pmid: 29718479

Abstract Inverted repeats (IRs) are abundant in genomes and frequently serve as substrates for chromosomal aberrations, including gene amplification. In the early stage of amplification, repeated cycles of chromosome breakage and rearrangement, called breakage-fusion-bridge (BFB), generate a large inverted structure, which evolves into highly-amplified, complex end products. However, it remains to be determined how IRs mediate chromosome rearrangements and promote subsequent hyper-amplification and amplicon evolutions. To dissect the complex processes, we constructed repetitive structures in a yeast chromosome and selected amplified cells using genetic markers with limited expression. The genomic architecture was associated with replication stress and produced extra-/intra-chromosomal amplification. Genetic analysis revealed structure-specific endonucleases, Mus81 and Rad27, and post-replication DNA repair protein, Rad18, suppress the amplification processes. Following BFB cycles, the intra-chromosomal products undergo intensive rearrangements, such as frequent inversions and deletions, indicative of rolling-circle replication. This study presents an integrated view linking BFB cycles to hyper-amplification driven by rolling-circle replication. INTRODUCTION Inverted repeats (IRs), two arms of repeated DNA sequences with one arm being reverse complemented relative to the other, are common in eukaryotic genomes. IRs and their subgroup, palindromes, in which two arms of IRs are separated by less than a few base pairs, can cause gross chromosomal rearrangements (GCRs), in particular, large inverted duplications of chromosomal segments. Large inverted duplications are abundant in human cancer genomes and are considered to be initial chromosomal structures that lead to the increase in gene copies at very high level (genomic amplification; (1)). Therefore, understanding how IRs promote the inverted chromosomal duplications and how inverted chromosomal duplication lead to high-copy genomic amplification has significant implications for both chromosome biology and disease etiology. Several experimental systems have been developed to study how IRs promote large inverted duplications in yeasts (2–5). Narayanan et al. showed that the insertion of 320 bp inverted repeat into a chromosome in Saccharomyces cerevisiae induces GCRs, including inverted duplications of chromosomal segments (2). This duplication is likely initiated by the extrusion of 320 bp inverted repeats to form a cruciform-like structure, followed by the resolution of the cruciform. Replication of the resulting hairpin-capped chromosome completes inverted duplication. Mizuno et al. employed the inducible replication fork barrier next to an ectopic inverted repeat of a few-kb and showed that restarted replication forks frequently make U-turns at the inverted repeat and initiate inverted duplication in S. pombe (3,5). Inverted duplication can also be generated when a DNA double-strand break occurred next to an IR. This process occurs during developmentally-programmed chromosome rearrangements in Tetrahymena, in which a 42 bp IR is required for the formation of inverted dimer carrying ribosomal RNA genes. Following the 5′ to 3′ resection, a broken chromosome end anneals intra-molecularly between inverted repeat sequences and initiates DNA synthesis (foldback priming) to generate a hairpin-capped chromosome, the replication of which results in inverted duplications. Although this is a developmentally-programmed inverted duplication in Tetrahymena, somatic inverted duplication following foldback priming from either a natural or experimentally-induced chromosome break has been demonstrated in several yeasts and mammalian cell systems. Most recently, Deng et al. have shown that small inverted repeats (5–9 bp) are sufficient to initiate foldback priming and inverted duplication (6). In mammalian cells, IRs with either 229 or 79 bp of homologies between the arms promote inverted duplications more efficiently than small IRs (7). Inverted duplication of a hairpin-capped chromosome can result in either isochromosome with two centromeres (dicentric chromosomes) or ones without centromeres (acentric chromosomes). Acentric chromosomes can accumulate in a cell by unequal segregation, as it is proposed for gene amplification by double minute chromosomes (DMs) in cancer cells. Dicentric chromosomes can lead to genomic amplification through breakage-fusion-bridge (BFB) cycles (8). During chromosome segregation, each centromere is pulled to opposite pole, resulting in a break somewhere between centromeres. This de novo break can be processed for subsequent foldback priming, followed by the formation of a hairpin-capped chromosome and inverted duplication. In each cycle, a random break between two centromeres creates the imbalance of genetic material between daughter cells: a segment close to the break will be duplicated in one daughter cell and deleted in the other daughter cell. If the segment harbors an oncogene, duplication of the segment would be beneficial to tumor cell growth and promote clonal expansion. BFB cycles are a prevalent mechanism for the amplification of ERBB2 oncogene that goes up to as many as 50 copies (9,10). However, detailed investigations would be required to determine whether BFB cycle alone can produce such a high-copy amplification. Broken ends generated during BFB cycles can be stabilized by invading into other chromosomes and initiating break-induced replication (BIR) toward the end of the chromosomes, where telomeres will be added to the broken ends. If telomere is added to the broken end after a few cycles, BFB cycles will cease and copy number gains would be limited. To address the questions, we have focused on a rapid gene amplification mode, rolling-circle replication (RCR), and have developed RCR-based model systems in yeast or mammalian cells (11,12). These RCR-inducible systems could produce extra-/intra-chromosomal amplifications as observed in cancer cells and drug-resistant cultured cells. Furthermore, the RCR processes involved intensive rearrangements, such as inversions and deletions, and we demonstrated the direct association between RCR and intensive rearrangements using yeast 2μ plasmid (13). The RCR-dependent rearrangement was the first mechanistic description of the evolution of intra-chromosomal amplicon, strongly suggesting that RCR play an important role in the late-stage of gene amplification in mammalian cells. These previous studies indicated that IRs play a critical role in both palindromic gene amplification and RCR-based amplification. Despite the role of IRs in both processes, little is known about the interaction between palindromic gene amplification and RCR. The synergy between the two processes would promote high-copy genomic amplification. To test the possibility, we placed complex inverted structures in a yeast chromosome and studied the natural amplification processes. We found that (i) the IRs are associated with replication stress; (ii) amplification is initiated by palindromic duplication via the IRs; (iii) the resulting repetitive structures cause RCR through tandem pairs of the repeat. Thus, IRs play multiple roles in genomic amplification. MATERIALS AND METHODS Yeast strains Yeast strain LS20 was used as the parental host strain, whose genotype and genomic are described previously (14). Plasmids for creating complex repeat structures were constructed as indicated in Supplementary Figures S6 and S7. The engineered genomic region was used for gene amplification systems in our previous works (11,12). The 3.1-kb sequence partially overlaps only with the IRC7 gene, which is not deleted or mutated. Although the amplicons can express Irc7p (intra-/extra-chromosomal products) and a putative protein of unknown function, Yfr057w (extra-chromosomal), there is no apparent effect on growth, viability, and morphology in this work and our previous works (11,12). To construct mutant strains, yeast knockout clones were purchased from Open Biosystems and each kanMX cassette was PCR-amplified with primers listed in Supplementary Table S1. Cells were transformed with the PCR fragments for targeting using Frozen-EZ Yeast Transformation II Kit (Zymo Research) and selected with 200 or 300 μg/ml of G418. Selection of yeast cells with gene amplification The FRFR/FFFR cells were grown in synthetic medium containing 2% glucose, lacking uracil and lysine (nonselective medium) to mid-log phase, harvested by centrifugation, washed twice with sterile distilled water, and plated at various dilutions onto equivalent medium lacking leucine and tryptophan (selective medium). The nonselective and selective media for the FRF/FFF/RRF cells contain lysine and tryptophan, respectively. Cells were also plated onto the nonselective medium to measure the number of viable cells. Cells were grown at 25°C. Colonies were counted after 5 or 6 days of growth. Gel electrophoresis and Southern analysis Pulsed-field gel electrophoresis was performed as described previously (12). The auto-algorithm mode of the CHEF Mapper XA system was used with the size range of 150–800 kb (Figures 1B–D and 5, Supplementary Figure S3B and C) and 10–60 kb (Figure 4A, Supplementary Figures S2A, C and S3A). Two-dimensional gel electrophoresis was carried out as described previously (15). Southern analyses were performed with DIG labeling and detection systems (Roche) according to the manufacturer's instructions. Probes were labeled using PCR DIG Probe Synthesis Kit (Roche). Hybridization, washing, and detection were performed using DIG Easy Hyb and DIG Wash and Block Buffer Set (Roche). Figure 1. View largeDownload slide Intra-/extra-chromosomal amplification produced from a complex IR structure. (A) Schematic structures of FR and FRFR constructs. The leu2d and trp1d genes are amplification markers. A 3.1-kb genomic sequence (gray arrow) was PCR-amplified to construct IR structures. The frequency of Leu+ (for FR) or Leu+Trp+ (for FRFR) colony formation was plotted. (B–D) Southern blots of chromosomal DNA with the leu2d probe (B), telomeric probe (C), or centromeric probe (D). The samples marked in red and green indicate intra- and extra-chromosomal products, respectively. The gray lane showed no sign of amplification, suggesting Leu+ conversion between the leu2d marker and the mutated original leu2 allele on chr III. Black asterisks on the right side of panels indicate separation limit under the PFGE-condition. M: S. cerevisiae marker; P: the parental strain, LS20; NS: non-selective conditions; EC: extra-chromosomal products. Figure 1. View largeDownload slide Intra-/extra-chromosomal amplification produced from a complex IR structure. (A) Schematic structures of FR and FRFR constructs. The leu2d and trp1d genes are amplification markers. A 3.1-kb genomic sequence (gray arrow) was PCR-amplified to construct IR structures. The frequency of Leu+ (for FR) or Leu+Trp+ (for FRFR) colony formation was plotted. (B–D) Southern blots of chromosomal DNA with the leu2d probe (B), telomeric probe (C), or centromeric probe (D). The samples marked in red and green indicate intra- and extra-chromosomal products, respectively. The gray lane showed no sign of amplification, suggesting Leu+ conversion between the leu2d marker and the mutated original leu2 allele on chr III. Black asterisks on the right side of panels indicate separation limit under the PFGE-condition. M: S. cerevisiae marker; P: the parental strain, LS20; NS: non-selective conditions; EC: extra-chromosomal products. Array CGH analysis Genomic DNA from colony #17 and the parental strain, LS20, was extracted using DNeasy Blood & Tissue Kit (Qiagen). The extracted DNA was fluorescently labeled, hybridized onto Nimblegen 385K array for S. cerevisiae, and scanned by Hokkaido System Science. RESULTS Inverted structures constructed in a yeast chromosome produced extra- and intra-chromosomal amplification To examine the role of IRs in initiating gene amplification, we first constructed yeast strains that harbor IRs with amplification markers in chromosome VI (FR and FRFR haploid strains, Figure 1A, left). To effectively mimic inverted repeats formed during BFB cycles or found in eukaryotic genomes, we inferred that several kilobase-size fragments were required for the repeat structures. The repeat length of IRs is 3.1 kb, while the spacers between repeats are ∼3 kb in size. The amplification markers with truncated promoters, leu2d and trp1d, can complement leucine and tryptophan auxotrophy when amplified, respectively (16). Two strains were constructed. The FR strain has IRs with the leu2d marker between the repeats. The FRFR strain has two pairs of IRs; the pair on the centromeric side carries the leu2d marker, whereas the pair on the telomeric side carries the trp1d marker. There are no essential genes between the IRs and telomere (4.4 kb), so the genomic environment does not affect the rearrangement frequency. The FRFR strain formed 4.3 ± 2.0 colonies per 106 plated cells on leucine/tryptophan-omitted plates, while the FR construct produced ∼8.0-fold fewer colonies under Leu- selection (5.4 colonies per 107 plated cells, Figure 1A, right). Next, the colonies from FRFR strain was analyzed by Southern hybridization of uncut DNA separated by pulse-field gel electrophoresis (PFGE, Figure 1B). Under non-selective conditions (NS, Leu+/Trp+ plates), the leu2d probe detects the modified chromosome VI (∼290-kb) in addition to chromosome III (∼360-kb) carrying leu2 fragments in the parental host strain, LS20 (14). In contrast, two types of products with increased signal intensities were seen under Leu-/Trp- selection. The samples with 30- to 40-kb leu2d products (green lanes) retained the modified chromosome VI (∼290-kb), suggesting that additional small chromosomes harboring leu2d accumulate extra-chromosomally in a cell (extra-chromosomal amplification). The samples marked in red showed strong leu2d signals above the separation limit and at the well position and lost the signals from the chromosome VI (∼290-kb), indicating extensive amplification within chromosome VI (intra-chromosomal amplification). The repeat structure also produced some variation of survivors. We assume that the clone 22 underwent gene conversion between the original leu2 on chromosome III and the leu2d gene in the constructed repeats. For clone 24, we speculate the extra-chromosomal amplification was accompanied by a subsequent deletion of the leu2d gene from chromosome VI. Intra-chromosomal amplification by BFB cycles To precisely map the amplified genomic segments (amplicons), we designed probes on the telomeric (YFR057W) and centromeric (RET2) side of the IRs to determine the co-amplification of the flanking genomic segments (Figure 1C and D). The extra-chromosomal products (green) exhibited signals with high intensity with the telomeric probe, indicating that the telomeric segment is co-amplified. In contrast, intra-chromosomal amplicons carried neither of probes at high intensity; the telomeric probe does not hybridize to any fragments, whereas weak signals were detected both at wells and at above the resolution limit with the centromeric probe. This suggests that the high-copy amplification in intra-chromosomal products would be limited within the segment flanked by two IRs. To understand the mechanism underlying intra-chromosomal amplification, we used array comparative genome hybridization (CGH) and examined genome-wide copy number alterations (Figure 2A). From the genome-wide view, we noticed very high-level amplification of markers that are ectopically inserted between two IRs, leu2 (chr III), URA3 (chr V), LYS5 (chr VII) and ADH1 terminator (chr XV), along with the copy number gain of chromosome VI. By a closer look at chromosome VI, we noticed the terminal deletion between the IRs and the telomere, which is consistent with the results from Southern blotting. (Please note that any probes for chromosome VI do not hybridize with the ectopically-inserted metabolic markers.) Furthermore, the centromere-proximal regions showed copy number gains. The 350-kb region immediately centromeric to the IRs is amplified (green in Figure 2A, bottom), while the 500 kb region further centromeric to the amplified region suggests copy number gain in a small sub-population of cells (light green). The terminal deletion with the amplification of centromere-proximal segments has been attributed to palindromic duplication initiated by a chromosome break. Palindromic duplication of the rest of chromosome VI would generate a chromosome with two centromeres. Dicentric chromosome will cause a break during chromosome segregation, which could initiate another rearrangement. The centromeric boundary of the 350-kb region contains short palindromic sequences that potentially form a hairpin-capped end following a chromosome break under near-physiological conditions (Figure 2B). A chromosome with a hairpin-capped end becomes palindromic duplication after DNA replication. The repeated palindromic duplication within the chromosome VI strongly suggests the history of breakage-fusion-bridge (BFB) cycles. The lower copy number gain may result from an alternative or additional BFB event in a small subpopulation. Figure 2. View largeDownload slide An array CGH analysis of an intra-chromosomal product. (A) DNA from colony #17 was co-hybridized with DNA from the parental strain, LS20, to a S. cerevisiae CGH microarray. Genome-wide overview (top) and the enlarged chrVI (bottom) are shown. The FRFR structure is flanked by copy number gains and telomeric deletion. Note that TRP1 gene is deleted in the parental strain, LS20, so that no peak for the trp1d gene was detected. (B) A potential hairpin structure for chromatid fusion in BFB cycle. The dicentric chromosome is illustrated based on the array CGH profile. The 200 bp region around the boundary (red dotted line) contains a palindromic pair of sequences, which could form a hairpin structure (Tm = 32.2°C) after resection of the DSB end. The hairpin capped end and next round of DNA replication form a dicentric chromosome with copy number gain. Figure 2. View largeDownload slide An array CGH analysis of an intra-chromosomal product. (A) DNA from colony #17 was co-hybridized with DNA from the parental strain, LS20, to a S. cerevisiae CGH microarray. Genome-wide overview (top) and the enlarged chrVI (bottom) are shown. The FRFR structure is flanked by copy number gains and telomeric deletion. Note that TRP1 gene is deleted in the parental strain, LS20, so that no peak for the trp1d gene was detected. (B) A potential hairpin structure for chromatid fusion in BFB cycle. The dicentric chromosome is illustrated based on the array CGH profile. The 200 bp region around the boundary (red dotted line) contains a palindromic pair of sequences, which could form a hairpin structure (Tm = 32.2°C) after resection of the DSB end. The hairpin capped end and next round of DNA replication form a dicentric chromosome with copy number gain. Very frequent recombination within amplicons indicates the involvement of rolling-circle replication The markers located between IRs amplified at a much higher level than amplicons at chromosome VI. To understand the underlying mechanisms of additional amplification, we next analyzed the structure of amplified products in detail using Southern hybridization of BamHI-digested DNA (Figure 3A). Only one BamHI site is located within the IRs (Figure 3B, top), providing clear structural information. Figure 3. View largeDownload slide Characterization of amplified chromosomal structures. (A) Southern blot of BamHI-digested DNA with the leu2d probe. The samples in Figure 1B–D, are digested with BamHI. The samples marked in red and green indicate intra- and extra-chromosomal products, respectively. The sample marked in light gray showed no sign of amplification, suggesting Leu+ recombination between the leu2d marker and the mutated original leu2 allele on chr III. M: S. cerevisiae marker; P: the parental strain, LS20; NS: non-selective conditions. (B) A predicted process for intra-chromosomal amplification. The replication-based event forms a dicentric chromosome followed by BFB cycles, producing a chromosomal break that can initiate RCR. (C) Predicted structure of RCR-amplification products associated with intensive rearrangements. RCR-amplification could form an original repeat array that produces 24.4-kb BamHI-fragments. RCR-associated inversions and deletions produce rearranged BamHI-fragments (12.2, 15.3, 27.6, 36.8, 39.8 kb). IRs engaged in the inversions are marked with red bars. (D) A predicted process for extra-chromosomal amplification. The replication-based rearrangement event between IRs marked with a bracket forms a 43-kb extra-chromosome. The BamHI-map indicates a leu2d-containing fragment (black) and a non-hybridizing fragment (gray). Figure 3. View largeDownload slide Characterization of amplified chromosomal structures. (A) Southern blot of BamHI-digested DNA with the leu2d probe. The samples in Figure 1B–D, are digested with BamHI. The samples marked in red and green indicate intra- and extra-chromosomal products, respectively. The sample marked in light gray showed no sign of amplification, suggesting Leu+ recombination between the leu2d marker and the mutated original leu2 allele on chr III. M: S. cerevisiae marker; P: the parental strain, LS20; NS: non-selective conditions. (B) A predicted process for intra-chromosomal amplification. The replication-based event forms a dicentric chromosome followed by BFB cycles, producing a chromosomal break that can initiate RCR. (C) Predicted structure of RCR-amplification products associated with intensive rearrangements. RCR-amplification could form an original repeat array that produces 24.4-kb BamHI-fragments. RCR-associated inversions and deletions produce rearranged BamHI-fragments (12.2, 15.3, 27.6, 36.8, 39.8 kb). IRs engaged in the inversions are marked with red bars. (D) A predicted process for extra-chromosomal amplification. The replication-based rearrangement event between IRs marked with a bracket forms a 43-kb extra-chromosome. The BamHI-map indicates a leu2d-containing fragment (black) and a non-hybridizing fragment (gray). Unexpectedly, intra-chromosomal products showed several amplified BamHI fragments with distinct sizes (12, 15, 24, 28, 37, 40-kb), among which the 24 kb fragments are most commonly shared between clones (11/14 clones). Such variations of band pattern were observed in our previous amplification system based on rolling-circle replication (RCR; (11,12), and we demonstrated that RCR frequently induces recombination on yeast 2μ plasmids (13). From these facts, we hypothesize that a broken end generated during BFB cycles can initiate RCR by the invasion of repeat (a) into repeat (b) (Figure 3B). The resulting 24-kb repeat units contain a variety of direct or inverted repeats (Figure 3C). Recombination between repeats can cause deletion, duplication, or inversion, and thus the multimer of the 24-kb unit could be an ancestor of other amplified BamHI fragments. As an example, an inversion between two distantly located repeats can create a 40-kb unit (Figure 3C, top right), and a deletion between two direct repeats crossing a BamHI site can create a 37-kb unit (top left). A 28-kb band can be efficiently produced by an inversion through nearby long homologous sequences in an inverted orientation (bottom left). Furthermore, another case of inversion between two distantly located repeats can create a 15-kb unit (bottom center), and a deletion occurring within the primary 24-kb unit can create a 12-kb unit (bottom right). This structural variation of amplicon among clones is also observed in RCR-amplified clones (12) and cancer cells (17), supporting the notion that the hyper-amplification and associated amplicon evolution are achieved by RCR following BFB cycles. In clones 19 and 28, smaller BamHI fragments (15 and 12 kb) are more prominently amplified than the 24-kb fragment (Figure 3A). These clones underwent two independent events for leu2d and trp1d amplification. As shown in Supplementary Figure S1B and C, the invasion of repeat (a) into repeat (b) could generate 12-kb BamHI repeat units, and inversions across the units can generate 15-kb fragments. Independently of the leu2d amplification, linear (clone 19, described later) or circular (clone 28) extra-chromosomes were formed (Supplementary Figure S1B–D). The majority of the extra-chromosomal amplicon (green) produces ∼23-kb BamHI fragments (Figure 3A). This product can arise through a U-turn of replication fork originated from the telomeric side (ARS610), generating 43-kb acentric mini-chromosomes. ((Figure 3D and Supplementary Figure S1E). The smaller (10 kb) BamHI fragment can arise from the 43-kb mini-chromosomes by the deletion between direct repeats (clone 24). Alternatively, a U-turn replication via the most-centromeric and most-telomeric repeats can produce the 30-kb extra-chromosomes (Supplementary Figure S1E, clone 3). Complex inverted structures promote hyper-amplification process and shape amplicon structures in response to genetic selection IRs are a potent trigger for the formation of dicentric chromosomes that can initiate BFB cycles, and thereafter hyper-amplification by RCR requires a direct repeat on the dicentric chromosome. Hence, forms of complex repeat structures could determine whether and how the hyper-amplification events proceed. To clarify the structural requirement for hyper-amplification, we constructed four additional strains (Figure 4A, left). The FFFR strain has both leu2d and trp1d markers, but leu2d is franked by direct repeats. Other three strains FRF, FFF and RRF only have the leu2d marker and three repeats. While FRF and RRF retain one pair of IR, three repeats are aligned as direct repeats in FFF. Figure 4. View largeDownload slide Analysis of amplification products from variants of repeat structure. (A) Schematic structures of FFFR, FRF, FFF and FRFR constructs. The leu2d and trp1d genes are amplification markers. A 3.1-kb genomic sequence (gray arrow) was PCR-amplified to construct IR structures. The frequency of Leu+ (for FR) or Leu+Trp+ (for FRFR) colony formation was plotted. (B–E) Southern blots of chromosomal DNA from FFFR (B), FRF (C), FFF (D) and RRF (E). The samples marked in red and green indicate intra- and extra-chromosomal products, respectively. The gray lanes showed no sign of amplification, suggesting Leu+ recombination between the leu2d marker and the mutated original leu2 allele on chr III. The blue samples suggest moderate copy number increase of the leu2d gene likely through unequal sister chromatid exchange. The yellow samples possibly contain a fusion between chromosome VI and III, which cause Leu+ recombination between the leu2d marker and the mutated original leu2 allele on chr III. Black asterisks on the right side of panels indicate separation limit under the PFGE-condition. M: S. cerevisiae marker; P: the parental strain, LS20; NS: non-selective conditions; EC: extra-chromosomal products. Figure 4. View largeDownload slide Analysis of amplification products from variants of repeat structure. (A) Schematic structures of FFFR, FRF, FFF and FRFR constructs. The leu2d and trp1d genes are amplification markers. A 3.1-kb genomic sequence (gray arrow) was PCR-amplified to construct IR structures. The frequency of Leu+ (for FR) or Leu+Trp+ (for FRFR) colony formation was plotted. (B–E) Southern blots of chromosomal DNA from FFFR (B), FRF (C), FFF (D) and RRF (E). The samples marked in red and green indicate intra- and extra-chromosomal products, respectively. The gray lanes showed no sign of amplification, suggesting Leu+ recombination between the leu2d marker and the mutated original leu2 allele on chr III. The blue samples suggest moderate copy number increase of the leu2d gene likely through unequal sister chromatid exchange. The yellow samples possibly contain a fusion between chromosome VI and III, which cause Leu+ recombination between the leu2d marker and the mutated original leu2 allele on chr III. Black asterisks on the right side of panels indicate separation limit under the PFGE-condition. M: S. cerevisiae marker; P: the parental strain, LS20; NS: non-selective conditions; EC: extra-chromosomal products. These strains were selected for the amplification of leu2d and the structure of amplified leu2d and surrounding genomic regions were analyzed by PFGE-Southern analysis. First, the FFF strain produced few colonies in the plates lacking leucine, suggesting that at least one pair of IR is needed for amplification to occur efficiently (Figure 4A, right). Second, extra-chromosomal amplification was predominant and intra-chromosomal amplification became a minor class. In contrast to 54% (14/26) of colonies in FRFR strain, less than 23% of colonies harbored intra-chromosomal amplification: 5/26 colonies in FFFR, 4/35 colonies in FRF and 8/35 colonies in RRF strain (Figure 4B–E). The extra-chromosomal amplicons are small, <50 kb in size and have telomeric sequences, consistent with the linear dimer structure with telomere on both ends (Supplementary Figures S2 and S3A). Intra-chromosomal amplicons are classified into three categories. First, the FRF strain showed hyper-amplification that generates 12- and 15-kb BamHI-fragments, strongly suggesting the involvement of RCR (Supplementary Figure S3B). Second, the FFF and RRF constructs produced intra-chromosomal amplification as light smear signals, indicating the lower stability of amplified regions. The 6-kb BamHI-fragment suggests that RCR formed the tandem repeats (Supplementary Figures S2A, S3C and D). The FFF would require a spontaneous chromosome break to initiate RCR, while the palindromic duplication by IRs in RRF promotes RCR initiation (Supplementary Figure S3C and S3D). Third, moderate level of intra-chromosomal amplification in FFFR showed heterogeneous patterns of BamHI-fragment and copy number gains of the centromeric region (RET2, Supplementary Figure S2C). This observation suggests that BFB cycles experienced fold-back priming at endogenous IRs in the centromeric side of inserted IRs. The BamHI-digestion patterns of intra-chromosomal FFFR products were distinct from those of RRF (Supplementary Figure S2A), although these strains have a structural similarity (direct repeats with a single IR). A key difference between these strains is the selection method; the FFFR strain was selected on Leu-/Trp- plates and the RRF on Leu- plates. Thus, complex repeat arrangements and genetic selection can generate a variety of amplicon. Furthermore, these results support the notion that multiple IRs facilitate efficient RCR and stable maintenance of amplified regions. These repeat structures also produced other types of surviving colonies (Supplementary Figure S2D–G); probably through Leu+ conversion (gray lanes), moderate amplification via unequal sister chromatid exchange (blue lanes), and Leu+ recombination forming a fusion chromosome III-VI (yellow lanes). These strains, containing different sets of amplification markers, were selected in a different way; the RRF selected on Leu- plates and the FFFR on Leu-/Trp- plates. Thus, in response to the difference in genetic selection, complex inverted structures diversely shape amplicon structure. Chromosomal rearrangements in the constructed repeats were independent of homologous recombination pathways Similar types of amplification were produced in the systems we previously constructed, by using inducible chromosome breaks within repeats (11) or association between replication and inducible recombination (12). Additionally, other groups showed that fusions of nearby inverted repeats formed acentric and dicentric chromosomes, depending on homologous recombination proteins (3,5,18) or based on a model, faulty switching of replication templates (4,19,20). We here carried out genetic analysis to better understand molecular mechanisms for the amplification processes. We first examine the role of homologous recombination process, which engages in a wide range of genomic rearrangement between direct repeats, invited repeats and sister chromatids (21–23). Deletion of core homologous recombination (HR) factors, RAD52 or RAD51, had little impact on the frequency of colony formation (Figure 5A, green), but the products in the rad52 mutant were exclusively extra-chromosomal (Figure 5B). The lack of intra-chromosomal product suggests that Rad52 is required in any step in intra-chromosomal amplification, such as BFB cycles. Figure 5. View largeDownload slide Genetic analysis of gene amplification from the FRFR construct. (A) Frequencies of colony formation in indicated mutants with the FRFR construct. Mutations were introduced into the FRFR strain and evaluated the colony formation relative to the FRFR strain. Means and medians from three independent experiments were shown as box plots. A paired t-test is used to evaluate the mutations in the amplification events (*P< 0.05, **P< 0.01). (B) Types of amplification products in mutants. Representative colonies were analyzed by Southern hybridization with the leu2d probe as in Figure 1B. The numbers of identified products are indicated. (C) Repeat-associated replication stress assessed by 2D analysis. The analyzed region is derived from the original chromosome VI. The late-firing origin ARS609 is more efficient than the telomere-proximal origin ARS610. (D) A possible model generating acentric/dicentric chromosomes. (Top) Replication fork from origins on the telomeric (left) and centric (right) side. Inverted structures are shown above the replication forks. Replication fork slowdown could cause fork regression potentially, which can be resolved by the Mus81 endonuclease. The regressed forks could be resected by an exonuclease and invade into an ectopic inverted sequence. The intermediates of ectopic invasion may be resolved by Rad27 (yeast FEN1). Steps linking dicentric chromosomes to intra-chromosomal amplification require a recombinational pathway involving Mus81 and Rad52. (E) Colony formation and product type from a Δpol32 mutant. Means and standard deviations are indicated. Figure 5. View largeDownload slide Genetic analysis of gene amplification from the FRFR construct. (A) Frequencies of colony formation in indicated mutants with the FRFR construct. Mutations were introduced into the FRFR strain and evaluated the colony formation relative to the FRFR strain. Means and medians from three independent experiments were shown as box plots. A paired t-test is used to evaluate the mutations in the amplification events (*P< 0.05, **P< 0.01). (B) Types of amplification products in mutants. Representative colonies were analyzed by Southern hybridization with the leu2d probe as in Figure 1B. The numbers of identified products are indicated. (C) Repeat-associated replication stress assessed by 2D analysis. The analyzed region is derived from the original chromosome VI. The late-firing origin ARS609 is more efficient than the telomere-proximal origin ARS610. (D) A possible model generating acentric/dicentric chromosomes. (Top) Replication fork from origins on the telomeric (left) and centric (right) side. Inverted structures are shown above the replication forks. Replication fork slowdown could cause fork regression potentially, which can be resolved by the Mus81 endonuclease. The regressed forks could be resected by an exonuclease and invade into an ectopic inverted sequence. The intermediates of ectopic invasion may be resolved by Rad27 (yeast FEN1). Steps linking dicentric chromosomes to intra-chromosomal amplification require a recombinational pathway involving Mus81 and Rad52. (E) Colony formation and product type from a Δpol32 mutant. Means and standard deviations are indicated. Complex inverted repeats were associated with replication stress We next focused on another cause of chromosomal rearrangement, DNA replication stress. Repetitive sequences, such as triplet repeats and palindromic inverted Alu sequences, have been shown to stall replication forks (24–29). To monitor replication fork movement across the dispersed repeats with kilobase-sized spacers, we used two-dimensional (2D) agarose gel electrophoresis (15). There are two replication origins on each side of this region. The late-firing origin ARS609 is more efficient than the telomere-proximal origin ARS610 (30). The hybridization probe specifically detected the identical XhoI fragments from the FRFR and parental strain. The parental strain showed an even Y-arc signal, indicating that replication forks passed this segment without a problem (Figure 5C). In contrast, the large Y-arc was more intense than the small one in the FRFR strain, indicating replication stress within the FRFR repeats. Furthermore, the FFF strain also showed the accumulation of large-Y molecules (Figure S4A), suggesting that complex repeat structures, regardless of their orientation, could be associated with replication stress, leading to genome rearrangement or gene amplification. Stalled forks are generally protected or stabilized by checkpoint proteins, mutations of which can cause collapsed forks and deleterious chromosomal rearrangements (4,19,31). Consistently, in the strain lacking a checkpoint kinase Chk1, we observed the 4-fold increase of colony formation (Figure 5A, blue). The distribution of intra- and extra-amplification was not affected by the CHK1 deletion (Figure 5B), suggesting that collapsed forks could become substrates for both types of amplification. Repeat-mediated chromosomal rearrangements were suppressed by structure-specific nucleases, Mus81 and Rad27 Next, we tested the genes for processing stalled forks. One of the key molecules that can process stalled forks is a structure-specific nuclease, Mus81, which produces one-ended DNA breaks for a restart of productive replication (32–34). Surprisingly, the deletion of MUS81 dramatically enhanced the colony formation (∼70-fold), indicating Mus81p prevents the chromosomal rearrangement leading to amplification (Figure 5A, purple). Mus81 interacts with another structure-specific endonuclease, Rad27 (yeast FEN1) or Slx1-Slx4 (35–37). In our system, the mutation of RAD27 flap endonuclease dramatically increased Leu+/Trp+ colonies (∼60-fold), while slx1 mutation had no effect. An E3 ubiquitin ligase, Rad18, is recruited to stalled forks and involved in other types of processing. Rad18 mono-ubiquitinates PCNA in response to fork stalling (38,39) and loads translesion synthesis (TLS) DNA polymerase η to stalled forks (40). Rad18 can also facilitate hemicatenane formation and promote gap filling in the lagging strand behind the replication fork (39,41). The ∼20-fold increase of colony formation in a rad18 mutant suggests that TLS pathway and/or hemicatenane formation contribute to suppression of the chromosomal rearrangements (Figure 5A, orange). Based on these genetic analyses and earlier studies, we propose that reversal forks could be involved in the chromosomal rearrangements (Figure 5D). Indications of replication stress, such as fork slowdown or stalling, could accumulate positive supercoil ahead of the forks, and this can induce fork regression (42). Oncogene-induced replication stress can cause replication fork slowdown and induce fork reversal, which accumulates in MUS81-depleted cells (43). MUS81 depletion slows replication forks in human cells (44), and FEN1 is required for efficient re-initiation of stalled replication forks at telomeres in human cells (45). In our system, the deletion of MUS81 did not affect replication stress in FRFR repeats (Supplementary Figure S4A), inferring that Mus81p could act mainly in the processing of stalled forks and/or restarting forks in concert with Rad27p. Previously, Paek et al. proposed a model in which regressed forks cause chromosomal rearrangement mediated by endogenous long terminal repeat (LTR) sequences. Our model proposes that ectopic invasion from regressed forks generate acentric and dicentric chromosomes (Figure 5D and Supplementary Figure S4B). The intra-chromosomal amplification relies on Rad52 and Mus81 The chromosome breaks formed in dicentric chromosomes would play important roles in initiating RCR. We here proposed that a resected broken end invades into a directly-oriented intra-chromosomal repeat and thereby initiates RCR, and tested the involvement of break-induced replication (BIR) in the RCR amplification. However, unexpectedly, a mutant of POL32, a subunit of DNA polymerase δ that is required for BIR, could produce intra-chromosomal amplification although the frequency of colony formation decreased moderately (Figure 5E and Supplementary Figure S5). The smear signals and small copy number gains in some Δpol32 colonies suggest that an efficient RCR-amplification requires Pol32 (Supplementary Figure S6, gray lanes). Consistently, previous studies showed Pol32-independent BIR can cause chromosomal rearrangements and Pol32 affects the efficiency (46,47). The distribution of amplification forms in this study indicates the dependency of RCR on Rad52 and Mus81 (Figure 5B and Supplementary Figure S5). Previously, the Pol32-independent rearrangement depends on recombination proteins including Rad52 (46,47). Mus81 is proposed to cleave D-loop after strand invasion to establish replication forks, and was shown to promote BIR (48). Collectively, it remains to be characterized how the broken end initiates RCR and which proteins are required for the process. DISCUSSION We here constructed complex inverted structures and thereby traced gene amplification events seen in tumorigenesis or drug-resistance; a palindromic duplication followed by BFB cycles, extra-/intra-chromosomal amplification, and hyper-amplification accompanied by intensive rearrangements (Figure 6). Figure 6. View largeDownload slide A schematic model for BFB–RCR amplification. Replication stress slows fork movement so that a part of replication forks can be regressed. If regressed forks are not rescued appropriately, ectopic DNA strand invasions can potentially result in dicentric chromosome formation. Dicentric chromosomes initiate BFB cycles, during which a chromosome break can trigger RCR hyper-amplification. RCR and its associated intensive rearrangements can generate a variety of amplicon according to a combination of pairs of repeats and genetic selection. Figure 6. View largeDownload slide A schematic model for BFB–RCR amplification. Replication stress slows fork movement so that a part of replication forks can be regressed. If regressed forks are not rescued appropriately, ectopic DNA strand invasions can potentially result in dicentric chromosome formation. Dicentric chromosomes initiate BFB cycles, during which a chromosome break can trigger RCR hyper-amplification. RCR and its associated intensive rearrangements can generate a variety of amplicon according to a combination of pairs of repeats and genetic selection. The RCR process is possible to highly amplify a genomic segment even in a single cell cycle, whereas BFB cycles require cell divisions and thus take a long time to achieve high-level amplification. This rapid and dynamic process can arise in a small population, providing heterogeneous hyper-amplified cells with amplicon variations. The genomic variation is the basis of intra-tumor heterogeneity, a key phenotype underlying tumor malignancy including metastasis, recurrence, and drug-resistance (49,50). It has been proposed that rolling-circle replication (RCR) possibly participates in amplification events (2,51). However, RCR is a difficult process to study due to the transient nature. To overcome this difficulty, we have developed inducible systems for RCR-type amplification in yeast or mammalian cells (11,12), and determined a characteristic feature of RCR, intensive rearrangement in amplified regions (13). In this study, the semi-natural inverted structure (not depending on inducible replication fork blocking or DNA double-strand breaks) was used as an effective approach to trace RCR-amplification and associated rearrangements, instead of analyzing complex end products formed endogenously. The semi-natural setting could cause replication stress, presenting the fork slowdown in the 2D analysis. An elegant system of inducible fork blocking in S. pombe has been reported to produce acentric or dicentric chromosomes in concert with inverted repeats, depending on recombination proteins; Rhp51 (Rad51 in S. cerevisiae), Rad22 (Rad52), and Rhp54 (Rad54) (3,18). A tumor suppressor, BRCA1, is also engaged in homologous recombination at inducible fork stalling on a plasmid DNA in mammalian cells (52). In our genetic analysis, however, rad51/52 deletions did not affect the frequency of amplification, suggesting that the processes observed here would differ from those in the inducible fork-block systems. Aside from the acentric/dicentric chromosome formation, Rad52p and Mus81 would be required for processing chromosomal ends during BFB cycles or initiating RCR (Figure 5B). Another line of research genetically analyzed chromosomal rearrangements between endogenous long terminal repeat (LTR) sequences, presenting a replication-based model, faulty template switching (4,19). The LTR-mediated rearrangements were enhanced in rad9, and rad51, and rad52 mutants, whereas the frequency of amplification in this study was reduced in the rad9 mutant (Figure 5A and Supplementary Figure S4C) or not affected by rad51/52 mutations (Figure 5A). The frequency of LTR-mediated rearrangements moderately increased with mus81 mutation, which dramatically enhanced the colony formation (∼70-fold) in our system. The differences in genetic response suggest a diversity of molecular mechanism in mediating the chromosome rearrangements. We here proposed a fork regression model for the chromosomal rearrangements (Figure 5D). Indications of replication stress, such as fork slowdown or stalling, could accumulate positive supercoil ahead of the forks, inducing fork regression (42). Oncogene-induced replication stress can cause replication fork slowdown and induce fork reversal, which accumulates by MUS81 depletion (43). A point to be solved is the RAD51 dependency of the process. Recently, whereas RAD51 was shown to mediate fork regression in sub-lethal genotoxic drug treatments, it was shown that RAD51 depletion did not reduce fork regression under an untreated condition (53,54). Consistently, in our genetic analysis, the rad51 mutant did not affect the frequency of colony formation. These findings indicate that RAD51 dependency of natural fork regression in untreated cells still has to be determined. We also presented an alternative model associated with single-strand DNA (ssDNA) gaps behind a replication forks (Supplementary Figure S4B, bottom left). This process at uncoupled forks may support the independence from Rad51 (55). Alternatively, random stochastic DNA replication errors may generate the initial recombinogenic lesion. The genomic region that the complex repeats were inserted into is reported to be a very late-replicating region, prone to generate chromosomal breaks. If the breaks can be repaired by a RAD51-independent pathway, the orientation of repeats would affect recombination outcomes, leading to BFB cycles in the case of the inverted repeats. Mus81 is a structure-specific endonuclease (56) with diverse functions, including replication fork restart (32,57,58) and the resolution of recombination intermediates (59,60). Mus81 plays important roles in processing stalled or slowed forks, and MUS81 depletion slow down replication forks and accumulate regressed forks in human cells (43,44). Mus81 can cleave 3′-flap structures formed by invading DNA strands of D-loop structure (61). We propose that Mus81 resolves regressed or uncoupled forks to facilitate fork restarts free from ectopic DNA-strand invasions (Figure 5D and Supplementary Figure S4B, red arrowheads). Mus81 complex physically and functionally interacts with Rad27 (yeast FEN1); mutations in MUS81 and RAD27 are synthetically lethal (62) and these two nucleases mutually stimulate each other (35). FEN1 is a structure-specific endonuclease that cleaves 5′-flap structures, which could be produced during DNA strand invasion (Figure 5D and Supplementary Figure S4B, red arrowheads). FEN1 participates not only in Okazaki fragment processing during lagging strand DNA replication (63) but also has gap endonuclease activity (64), which may act similarly to Mus81. These endonucleases would prevent complex inverted structures from causing regressed forks or ectopic template switching. Surprisingly, mutation of Rad9, a DNA damaging checkpoint protein, reduced the colony formation (Supplementary Figure S4C). Previously, Rad9 is shown to suppress activities of some exonucleases, including Exo1, particular at telomere (65). Thus, Rad9 may suppress a pathway (e.g. Exo1) that faithfully resumes regressed or uncoupled forks (Supplementary Figure S4B), and instead, such unrepaired forks may be processed harmfully by Rad9-independent pathways. This hypothesis may explain the negative impact of Rad9 on the chromosome rearrangements presented here. In this study, the array CGH data indicated the palindromic duplications were followed by BFB cycles. Previously we demonstrated that the fusion step of BFB is mediated by hairpin-capped ends, which are key intermediates for palindromic chromosomal rearrangements. The centromeric boundary of higher copy number gain in Figure 2 contains short palindromic sequences that potentially form hairpin-capped ends under near-physiological conditions (Figure 2B). BFB cycles provide an ideal structural feature for RCR initiation (Figure 6). The RCR process can be induced by repetitive elements, including long/short interspersed nuclear element (LINE/SINE), highly abundant in the human genome (66). By a combination of repeats and genetic selection, RCR can create a variety of amplicons, as shown in Figure 4. Repetitive elements further could mediate amplicon evolution through the associated rearrangements. Of the resulting genetic variation of amplicons, some may confer much higher resistance to chemotherapy or the other may acquire enhanced the ability of metastasis in cancer cells. Thus, the RCR process can be a central process that links between BFB cycles and complex end products and diversifies amplification products for cancer evolution. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank D. Butler (Montana State University-Billings) for providing a yeast strain, NIBB Core Research Facilities for technical support, and Dr Horiuchi's lab members for critical discussions. FUNDING Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Young Scientists [21770194 to T.W.]; National Cancer Institute [R01CA149385 to H.T.]; Grant for Research Projects from Hayama Center for Advanced Studies (to T.H. and T.W.); Kanzawa Medical Research Foundation (to T.W.). Cedars-Sinai Medical Center (to H.T.); Funding for open access charge: Tokai University. Conflict of interest statement. None declared. REFERENCES 1. Debatisse M. , Malfoy B. Gene amplification mechanisms . Adv. Exp. Med. Biol. 2005 ; 570 : 343 – 361 . Google Scholar CrossRef Search ADS PubMed 2. Narayanan V. , Mieczkowski P.A. , Kim H.M. , Petes T.D. , Lobachev K.S. The pattern of gene amplification is determined by the chromosomal location of hairpin-capped breaks . Cell . 2006 ; 125 : 1283 – 1296 . Google Scholar CrossRef Search ADS PubMed 3. Mizuno K. , Lambert S. , Baldacci G. , Murray J.M. , Carr A.M. Nearby inverted repeats fuse to generate acentric and dicentric palindromic chromosomes by a replication template exchange mechanism . Genes Dev. 2009 ; 23 : 2876 – 2886 . Google Scholar CrossRef Search ADS PubMed 4. Paek A.L. , Kaochar S. , Jones H. , Elezaby A. , Shanks L. , Weinert T. Fusion of nearby inverted repeats by a replication-based mechanism leads to formation of dicentric and acentric chromosomes that cause genome instability in budding yeast . Genes Dev. 2009 ; 23 : 2861 – 2875 . Google Scholar CrossRef Search ADS PubMed 5. Mizuno K. , Miyabe I. , Schalbetter S.A. , Carr A.M. , Murray J.M. Recombination-restarted replication makes inverted chromosome fusions at inverted repeats . Nature . 2013 ; 493 : 246 – 249 . Google Scholar CrossRef Search ADS PubMed 6. Deng S.K. , Yin Y. , Petes T.D. , Symington L.S. Mre11-Sae2 and RPA collaborate to prevent palindromic gene amplification . Mol. Cell . 2015 ; 60 : 500 – 508 . Google Scholar CrossRef Search ADS PubMed 7. Tanaka H. , Tapscott S.J. , Trask B.J. , Yao M.C. Short inverted repeats initiate gene amplification through the formation of a large DNA palindrome in mammalian cells . Proc. Natl. Acad. Sci. U.S.A. 2002 ; 99 : 8772 – 8777 . Google Scholar CrossRef Search ADS PubMed 8. McClintock B. The stability of broken ends of chromosomes in Zea Mays . Genetics . 1941 ; 26 : 234 – 282 . Google Scholar PubMed 9. Marotta M. , Chen X. , Watanabe T. , Faber P.W. , Diede S.J. , Tapscott S. , Tubbs R. , Kondratova A. , Stephens R. , Tanaka H. Homology-mediated end-capping as a primary step of sister chromatid fusion in the breakage-fusion-bridge cycles . Nucleic Acids Res. 2013 ; 41 : 9732 – 9740 . Google Scholar CrossRef Search ADS PubMed 10. Marotta M. , Onodera T. , Johnson J. , Budd G.T. , Watanabe T. , Cui X. , Giuliano A.E. , Niida A. , Tanaka H. Palindromic amplification of the ERBB2 oncogene in primary HER2-positive breast tumors . Sci. Rep. 2017 ; 7 : 41921 . Google Scholar CrossRef Search ADS PubMed 11. Watanabe T. , Horiuchi T. A novel gene amplification system in yeast based on double rolling-circle replication . EMBO J. 2005 ; 24 : 190 – 198 . Google Scholar CrossRef Search ADS PubMed 12. Watanabe T. , Tanabe H. , Horiuchi T. Gene amplification system based on double rolling-circle replication as a model for oncogene-type amplification . Nucleic Acids Res. 2011 ; 39 : e106 . Google Scholar CrossRef Search ADS PubMed 13. Okamoto H. , Watanabe T.A. , Horiuchi T. Double rolling circle replication (DRCR) is recombinogenic . Genes Cells . 2011 ; 16 : 503 – 513 . Google Scholar CrossRef Search ADS PubMed 14. Butler D.K. , Yasuda L.E. , Yao M.C. Induction of large DNA palindrome formation in yeast: implications for gene amplification and genome stability in eukaryotes . Cell . 1996 ; 87 : 1115 – 1122 . Google Scholar CrossRef Search ADS PubMed 15. Friedman K.L. , Brewer B.J. Analysis of replication intermediates by two-dimensional agarose gel electrophoresis . Methods Enzymol. 1995 ; 262 : 613 – 627 . Google Scholar CrossRef Search ADS PubMed 16. Erhart E. , Hollenberg C.P. The presence of a defective LEU2 gene on 2 mu DNA recombinant plasmids of Saccharomyces cerevisiae is responsible for curing and high copy number . J Bacteriol . 1983 ; 156 : 625 – 635 . Google Scholar PubMed 17. Kawagoe H. , Kandilci A. , Kranenburg T.A. , Grosveld G.C. Overexpression of N-Myc rapidly causes acute myeloid leukemia in mice . Cancer Res. 2007 ; 67 : 10677 – 10685 . Google Scholar CrossRef Search ADS PubMed 18. Lambert S. , Mizuno K. , Blaisonneau J. , Martineau S. , Chanet R. , Fréon K. , Murray J.M. , Carr A.M. , Baldacci G. Homologous recombination restarts blocked replication forks at the expense of genome rearrangements by template exchange . Mol. Cell . 2010 ; 39 : 346 – 359 . Google Scholar CrossRef Search ADS PubMed 19. Kaochar S. , Shanks L. , Weinert T. Checkpoint genes and Exo1 regulate nearby inverted repeat fusions that form dicentric chromosomes in Saccharomyces cerevisiae . Proc. Natl. Acad. Sci. U.S.A. 2010 ; 107 : 21605 – 21610 . Google Scholar CrossRef Search ADS PubMed 20. Paek A.L. , Jones H. , Kaochar S. , Weinert T. The role of replication bypass pathways in dicentric chromosome formation in budding yeast . Genetics . 2010 ; 186 : 1161 – 1173 . Google Scholar CrossRef Search ADS PubMed 21. Thomas B.J. , Rothstein R. The genetic control of direct-repeat recombination in Saccharomyces: the effect of rad52 and rad1 on mitotic recombination at GAL10, a transcriptionally regulated gene . Genetics . 1989 ; 123 : 725 – 738 . Google Scholar PubMed 22. Bai Y. , Symington L.S. A Rad52 homolog is required for RAD51-independent mitotic recombination in Saccharomyces cerevisiae . Genes Dev. 1996 ; 10 : 2025 – 2037 . Google Scholar CrossRef Search ADS PubMed 23. Fasullo M. , Giallanza P. , Dong Z. , Cera C. , Bennett T. Saccharomyces cerevisiae rad51 mutants are defective in DNA damage-associated sister chromatid exchanges but exhibit increased rates of homology-directed translocations . Genetics . 2001 ; 158 : 959 – 972 . Google Scholar PubMed 24. Voineagu I. , Narayanan V. , Lobachev K.S. , Mirkin S.M. Replication stalling at unstable inverted repeats: interplay between DNA hairpins and fork stabilizing proteins . Proc. Natl. Acad. Sci. U.S.A. 2008 ; 105 : 9936 – 9941 . Google Scholar CrossRef Search ADS PubMed 25. Zhang Y. , Saini N. , Sheng Z. , Lobachev K.S. Genome-wide screen reveals replication pathway for quasi-palindrome fragility dependent on homologous recombination . PLoS Genet. 2013 ; 9 : e1003979 . Google Scholar CrossRef Search ADS PubMed 26. Viterbo D. , Michoud G. , Mosbach V. , Dujon B. , Richard G.F. Replication stalling and heteroduplex formation within CAG/CTG trinucleotide repeats by mismatch repair . DNA Repair (Amst.) . 2016 ; 42 : 94 – 106 . Google Scholar CrossRef Search ADS PubMed 27. Shishkin A.A. , Voineagu I. , Matera R. , Cherng N. , Chernet B.T. , Krasilnikova M.M. , Narayanan V. , Lobachev K.S. , Mirkin S.M. Large-scale expansions of Friedreich's ataxia GAA repeats in yeast . Mol. Cell . 2009 ; 35 : 82 – 92 . Google Scholar CrossRef Search ADS PubMed 28. Gerhardt J. , Tomishima M.J. , Zaninovic N. , Colak D. , Yan Z. , Zhan Q. , Rosenwaks Z. , Jaffrey S.R. , Schildkraut C.L. The DNA replication program is altered at the FMR1 locus in fragile X embryonic stem cells . Mol. Cell . 2014 ; 53 : 19 – 31 . Google Scholar CrossRef Search ADS PubMed 29. Voineagu I. , Surka C.F. , Shishkin A.A. , Krasilnikova M.M. , Mirkin S.M. Replisome stalling and stabilization at CGG repeats, which are responsible for chromosomal fragility . Nat. Struct. Mol. Biol. 2009 ; 16 : 226 – 228 . Google Scholar CrossRef Search ADS PubMed 30. Siow C.C. , Nieduszynska S.R. , Muller C.A. , Nieduszynski C.A. OriDB, the DNA replication origin database updated and extended . Nucleic Acids Res. 2012 ; 40 : D682 – D686 . Google Scholar CrossRef Search ADS PubMed 31. Lambert S. , Carr A.M. Impediments to replication fork movement: stabilisation, reactivation and genome instability . Chromosoma . 2013 ; 122 : 33 – 45 . Google Scholar CrossRef Search ADS PubMed 32. Hanada K. , Budzowska M. , Davies S.L. , van Drunen E. , Onizawa H. , Beverloo H.B. , Maas A. , Essers J. , Hickson I.D. , Kanaar R. The structure-specific endonuclease Mus81 contributes to replication restart by generating double-strand DNA breaks . Nat. Struct. Mol. Biol. 2007 ; 14 : 1096 – 1104 . Google Scholar CrossRef Search ADS PubMed 33. Rass U. Resolving branched DNA intermediates with structure-specific nucleases during replication in eukaryotes . Chromosoma . 2013 ; 122 : 499 – 515 . Google Scholar CrossRef Search ADS PubMed 34. Osman F. , Whitby M.C. Exploring the roles of Mus81-Eme1/Mms4 at perturbed replication forks . DNA Repair (Amst.) . 2007 ; 6 : 1004 – 1017 . Google Scholar CrossRef Search ADS PubMed 35. Kang M.J. , Lee C.H. , Kang Y.H. , Cho I.T. , Nguyen T.A. , Seo Y.S. Genetic and functional interactions between Mus81-Mms4 and Rad27 . Nucleic Acids Res. 2010 ; 38 : 7611 – 7625 . Google Scholar CrossRef Search ADS PubMed 36. Thu H.P. , Nguyen T.A. , Munashingha P.R. , Kwon B. , Dao Van Q. , Seo Y.S. A physiological significance of the functional interaction between Mus81 and Rad27 in homologous recombination repair . Nucleic Acids Res. 2015 ; 43 : 1684 – 1699 . Google Scholar CrossRef Search ADS PubMed 37. Wyatt H.D. , Sarbajna S. , Matos J. , West S.C. Coordinated actions of SLX1-SLX4 and MUS81-EME1 for Holliday junction resolution in human cells . Mol. Cell . 2013 ; 52 : 234 – 247 . Google Scholar CrossRef Search ADS PubMed 38. Kannouche P.L. , Wing J. , Lehmann A.R. Interaction of human DNA polymerase eta with monoubiquitinated PCNA: a possible mechanism for the polymerase switch in response to DNA damage . Mol. Cell . 2004 ; 14 : 491 – 500 . Google Scholar CrossRef Search ADS PubMed 39. Branzei D. , Szakal B. DNA damage tolerance by recombination: Molecular pathways and DNA structures . DNA Repair (Amst.) . 2016 ; 44 : 68 – 75 . Google Scholar CrossRef Search ADS PubMed 40. Watanabe K. , Tateishi S. , Kawasuji M. , Tsurimoto T. , Inoue H. , Yamaizumi M. Rad18 guides poleta to replication stalling sites through physical interaction and PCNA monoubiquitination . EMBO J. 2004 ; 23 : 3886 – 3896 . Google Scholar CrossRef Search ADS PubMed 41. Branzei D. , Vanoli F. , Foiani M. SUMOylation regulates Rad18-mediated template switch . Nature . 2008 ; 456 : 915 – 920 . Google Scholar CrossRef Search ADS PubMed 42. Olavarrieta L. , Martínez-Robles M.L. , Sogo J.M. , Stasiak A. , Hernández P. , Krimer D.B. , Schvartzman J.B. Supercoiling, knotting and replication fork reversal in partially replicated plasmids . Nucleic Acids Res. 2002 ; 30 : 656 – 666 . Google Scholar CrossRef Search ADS PubMed 43. Neelsen K.J. , Zanini I.M. , Herrador R. , Lopes M. Oncogenes induce genotoxic stress by mitotic processing of unusual replication intermediates . J. Cell Biol. 2013 ; 200 : 699 – 708 . Google Scholar CrossRef Search ADS PubMed 44. Fu H. , Martin M.M. , Regairaz M. , Huang L. , You Y. , Lin C.M. , Ryan M. , Kim R. , Shimura T. , Pommier Y. et al. The DNA repair endonuclease Mus81 facilitates fast DNA replication in the absence of exogenous damage . Nat. Commun. 2015 ; 6 : 6746 . Google Scholar CrossRef Search ADS PubMed 45. Saharia A. , Teasley D.C. , Duxin J.P. , Dao B. , Chiappinelli K.B. , Stewart S.A. FEN1 ensures telomere stability by facilitating replication fork re-initiation . J. Biol. Chem. 2010 ; 285 : 27057 – 27066 . Google Scholar CrossRef Search ADS PubMed 46. Ruiz J.F. , Gomez-Gonzalez B. , Aguilera A. Chromosomal translocations caused by either pol32-dependent or pol32-independent triparental break-induced replication . Mol. Cell. Biol. 2009 ; 29 : 5441 – 5454 . Google Scholar CrossRef Search ADS PubMed 47. Smith C.E. , Lam A.F. , Symington L.S. Aberrant double-strand break repair resulting in half crossovers in mutants defective for Rad51 or the DNA polymerase delta complex . Mol. Cell. Biol. 2009 ; 29 : 1432 – 1441 . Google Scholar CrossRef Search ADS PubMed 48. Pardo B. , Aguilera A. Complex chromosomal rearrangements mediated by break-induced replication involve structure-selective endonucleases . PLoS Genet. 2012 ; 8 : e1002979 . Google Scholar CrossRef Search ADS PubMed 49. Burrell R.A. , McGranahan N. , Bartek J. , Swanton C. The causes and consequences of genetic heterogeneity in cancer evolution . Nature . 2013 ; 501 : 338 – 345 . Google Scholar CrossRef Search ADS PubMed 50. Yates L.R. , Campbell P.J. Evolution of the cancer genome . Nat. Rev. Genet. 2012 ; 13 : 795 – 806 . Google Scholar CrossRef Search ADS PubMed 51. Harada S. , Sekiguchi N. , Shimizu N. Amplification of a plasmid bearing a mammalian replication initiation region in chromosomal and extrachromosomal contexts . Nucleic Acids Res. 2011 ; 39 : 958 – 969 . Google Scholar CrossRef Search ADS PubMed 52. Willis N.A. , Chandramouly G. , Huang B. , Kwok A. , Follonier C. , Deng C. , Scully R. BRCA1 controls homologous recombination at Tus/Ter-stalled mammalian replication forks . Nature . 2014 ; 510 : 556 – 559 . Google Scholar CrossRef Search ADS PubMed 53. Neelsen K.J. , Lopes M. Replication fork reversal in eukaryotes: from dead end to dynamic response . Nat. Rev. Mol. Cell Biol. 2015 ; 16 : 207 – 220 . Google Scholar CrossRef Search ADS PubMed 54. Zellweger R. , Dalcher D. , Mutreja K. , Berti M. , Schmid J.A. , Herrador R. , Vindigni A. , Lopes M. Rad51-mediated replication fork reversal is a global response to genotoxic treatments in human cells . J. Cell Biol. 2015 ; 208 : 563 – 579 . Google Scholar CrossRef Search ADS PubMed 55. Mott C. , Symington L.S. RAD51-independent inverted-repeat recombination by a strand-annealing mechanism . DNA Repair (Amst.) . 2011 ; 10 : 408 – 415 . Google Scholar CrossRef Search ADS PubMed 56. Ciccia A. , McDonald N. , West S.C. Structural and functional relationships of the XPF/MUS81 family of proteins . Annu. Rev. Biochem. 2008 ; 77 : 259 – 287 . Google Scholar CrossRef Search ADS PubMed 57. Fugger K. , Chu W.K. , Haahr P. , Kousholt A.N. , Beck H. , Payne M.J. , Hanada K. , Hickson I.D. , Sørensen C.S. FBH1 co-operates with MUS81 in inducing DNA double-strand breaks and cell death following replication stress . Nat. Commun. 2013 ; 4 : 1423 . Google Scholar CrossRef Search ADS PubMed 58. Pepe A. , West S.C. MUS81-EME2 promotes replication fork restart . Cell Rep. 2014 ; 7 : 1048 – 1055 . Google Scholar CrossRef Search ADS PubMed 59. Matos J. , Blanco M.G. , Maslen S. , Skehel J.M. , West S.C. Regulatory control of the resolution of DNA recombination intermediates during meiosis and mitosis . Cell . 2011 ; 147 : 158 – 172 . Google Scholar CrossRef Search ADS PubMed 60. Szakal B. , Branzei D. Premature Cdk1/Cdc5/Mus81 pathway activation induces aberrant replication and deleterious crossover . EMBO J. 2013 ; 32 : 1155 – 1167 . Google Scholar CrossRef Search ADS PubMed 61. Pepe A. , West S.C. Substrate specificity of the MUS81-EME2 structure selective endonuclease . Nucleic Acids Res. 2014 ; 42 : 3833 – 3845 . Google Scholar CrossRef Search ADS PubMed 62. Tong A.H. , Evangelista M. , Parsons A.B. , Xu H. , Bader G.D. , Pagé N. , Robinson M. , Raghibizadeh S. , Hogue C.W. , Bussey H. et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants . Science . 2001 ; 294 : 2364 – 2368 . Google Scholar CrossRef Search ADS PubMed 63. Balakrishnan L. , Bambara R.A. Flap endonuclease 1 . Annu. Rev. Biochem. 2013 ; 82 : 119 – 138 . Google Scholar CrossRef Search ADS PubMed 64. Zheng L. , Zhou M. , Chai Q. , Parrish J. , Xue D. , Patrick S.M. , Turchi J.J. , Yannone S.M. , Chen D. , Shen B. Novel function of the flap endonuclease 1 complex in processing stalled DNA replication forks . EMBO Rep. 2005 ; 6 : 83 – 89 . Google Scholar CrossRef Search ADS PubMed 65. Zubko M.K. , Guillard S. , Lydall D. Exo1 and Rad24 differentially regulate generation of ssDNA at telomeres of Saccharomyces cerevisiae cdc13-1 mutants . Genetics . 2004 ; 168 : 103 – 115 . Google Scholar CrossRef Search ADS PubMed 66. Cordaux R. , Batzer M.A. The impact of retrotransposons on human genome evolution . Nat. Rev. Genet. 2009 ; 10 : 691 – 703 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

journal article

Open Access Collection

journal article

Open Access Collection

High throughput discovery of protein variants using proteomics informed by transcriptomics

Saha, Shyamasree;Matthews, David A;Bessant, Conrad

2018 Nucleic Acids Research

doi: 10.1093/nar/gky295pmid: 29718325

Abstract Proteomics informed by transcriptomics (PIT), in which proteomic MS/MS spectra are searched against open reading frames derived from de novo assembled transcripts, can reveal previously unknown translated genomic elements (TGEs). However, determining which TGEs are truly novel, which are variants of known proteins, and which are simply artefacts of poor sequence assembly, is challenging. We have designed and implemented an automated solution that classifies putative TGEs by comparing to reference proteome sequences. This allows large-scale identification of sequence polymorphisms, splice isoforms and novel TGEs supported by presence or absence of variant-specific peptide evidence. Unlike previously reported methods, ours does not require a catalogue of known variants, making it more applicable to non-model organisms. The method was validated on human PIT data, then applied to Mus musculus, Pteropus alecto and Aedes aegypti. Novel discoveries included 60 human protein isoforms, 32 392 polymorphisms in P. alecto, and TGEs with non-methionine start sites including tyrosine. INTRODUCTION RNA sequencing (RNA-Seq) followed by de novo transcript assembly provides unprecedented insight into gene expression in a given sample, even if the species under study has a poorly annotated genome (1). However, an assembled transcript might not correspond to a functional protein, either for biological reasons or because of sequencing or assembly errors. To resolve this ambiguity we developed the PIT (proteomics informed by transcriptomics) methodology in which spectra acquired from liquid chromatography tandem mass spectrometry (LC–MS/MS) proteomics are searched against open reading frames (ORFs) derived from de novo assembled transcripts acquired from the same sample (2). Using sample-specific ORFs allows unbiased identification of translated genomic elements (TGEs), unlike traditional proteomics where spectra are searched against reference protein sequences. PIT therefore allows discovery of new TGEs, including variants of known proteins, and can provide confirmation of transcriptomic observations. While we have previously published software pipelines for PIT analysis (3), their output is essentially a list of identified TGE sequences (i.e. ORFs supported by peptide evidence). Further post-processing is needed to confidently classify each TGE identification and the sample-specific events that underpin them, such as single amino acid polymorphisms (SAPs), insertions and deletions (INDELs) and alternative splicing. Such events have been associated with disease phenotypes (4–8) and gene regulation (9–11). The significance of single nucleotide polymorphisms (SNPs) in disease phenotypes has prompted several studies to confirm SAPs using mass-spectrometry data (12–14). Many alternative splice isoforms have been observed using RNA-Seq (13,15) but it is unclear how many are translated. A common method for confirming variations at protein level has been to search spectra against a reference proteome augmented with an existing database of known variations or variations identified from RNA-Seq data, although not necessarily from the same sample (13,16–20). The disadvantage of relying on existing databases is that novel protein variants cannot be found—a particular limitation for non-model organisms where databases are incomplete or unavailable. Here, we present a TGE classification pipeline that generates variation information directly from RNA-Seq data for each sample, and seeks to confirm this at peptide level using proteomics data from the same sample (Figure 1A and B). The result is a molecular survey of unprecedented detail, with TGEs simultaneously classified into groups including novel proteins, known proteins, protein isoforms and proteins with SAPs and other polymorphisms. Figure 1. View largeDownload slide PIT pipeline, now including classification of TGEs. (A) RNA-Seq assembly begins with the Trinity de novo transcript assembler. PASA is then used to assemble spliced alignments and identify alternative splicing events at transcript level. Transdecoder is used to predict ORFs from PASA’s transcripts. (B) TGEs are classified, based on sequence similarity to an existing proteome, into four main classes, (i) known protein or isoform, (ii) known protein or isoform with polymorphisms, (iii) novel isoforms and (iv) novel TGE. Within these main classes there are four polymorphism categories and sixteen novel isoform classes. We look for supporting peptides from the mass spectrometry (MS) data to verify these events at protein level. Variation information and peptide evidence for all the identified TGEs is output and deposited in a database called PITDB. (C) Nomenclature of polymorphism types derived from BLAST alignments used in this paper—see main text for details. Figure 1. View largeDownload slide PIT pipeline, now including classification of TGEs. (A) RNA-Seq assembly begins with the Trinity de novo transcript assembler. PASA is then used to assemble spliced alignments and identify alternative splicing events at transcript level. Transdecoder is used to predict ORFs from PASA’s transcripts. (B) TGEs are classified, based on sequence similarity to an existing proteome, into four main classes, (i) known protein or isoform, (ii) known protein or isoform with polymorphisms, (iii) novel isoforms and (iv) novel TGE. Within these main classes there are four polymorphism categories and sixteen novel isoform classes. We look for supporting peptides from the mass spectrometry (MS) data to verify these events at protein level. Variation information and peptide evidence for all the identified TGEs is output and deposited in a database called PITDB. (C) Nomenclature of polymorphism types derived from BLAST alignments used in this paper—see main text for details. MATERIALS AND METHODS Data The pipeline was evaluated on data acquired from a human (HeLa) cell line infected with adenovirus (2), then applied to data from three other experiments: Pteropus alecto kidney cell line PaKiT03 exposed to Nelson Bay orthoreovirus (NBV) (21), Mus musculus fibroblast L929 cell line (22), and immortalised Aedes Aegypti cell line Aag2 (23). Details of the proteomics and RNA-Seq data acquisition, and information about where to find this data, are summarised in Supplementary Table S1. The results generated by applying our pipeline to the data are available in the specially created database PITDB (21) [http://pitdb.org] (experiment accession numbers EXP000001, EXP000003, EXP000004 and EXP000008). RNA-Seq transcript assembly and protein identification RNA-Seq reads were initially assembled de novo using Trinity (24). Default Trinity read trimming was used along with ‘trimmomatic’ and ‘normalize_reads’ for quality control. Clusters of overlapping Trinity transcripts were assembled into maximal alignment assemblies using the Program to Assemble Spliced Alignments (PASA) (25). PASA runs the seqclean (26; https://sourceforge.net/projects/seqclean/) tool to discard low quality sequences, find evidence of polyadenylation, strip poly-A tails and trim vectors. It then maps remaining transcripts to a reference genome using a spliced alignment process that infers the intron-exon structure of the parent gene. Reference genomes used were hg38 (human), mm10 (M. musculus), ASM32557v1 (P. alecto) and aedes-aegypti-liverpoolscaffoldsaaegl3 (A. aegypti). Any transcripts that do not map to the selected genome assembly (e.g. from viruses) are discarded at this stage. Applying PASA reduces the number of incomplete ORFs and duplicate transcripts, minimising search space in subsequent peptide identification (sSupplementary Figure S1). Transdecoder (27) was then used for six frame translation of the transcripts, using the default universal genetic code in which methionine is the start codon. Transdecoder assigns ORFs to one of four classes: complete, 5prime_partial, 3prime_partial and internal, based on existence of start and stop codon in the transcript (Supplementary Figure S2). Missing start or stop codons may be due to poor sequence assembly, or alternative start/stop codons. One transcript may produce multiple ORFs, and transcripts with identical protein coding regions (but different untranslated regions) can produce identical ORFs. Duplicate ORFs are retained prior to protein identification to preserve transcript relationships, but are merged into a single TGE when reporting results. MS-GF+ (28) was used for peptide spectrum matching, followed by mzidentML-lib (29) for FDR calculation, thresholding, and protein grouping. MS-GF+ computes PSM and peptide level q-value using a target-decoy (30) approach. The search database for each sample contained the ORFs obtained from the RNA-Seq data for that sample, plus contaminant sequences from the common Repository of Adventitious Proteins (http://www.thegpm.org/crap). We used 1% global PSM-level FDR and only ORFs with at least two identified peptides were retained as TGEs. Classification of observed TGEs by ORF homology TGEs were classified according to their sequence similarity to a reference proteome using BLAST. BLOSUM80 substitution matrix was used as it is best suited for comparing closely related sequences (31). UniProt complete proteomes were used as the reference for all species except A. aegypti, for which a superior proteome was taken from VectorBase. TGEs with 100% sequence identity to a reference protein are classed as known protein, or known protein isoform if the sequence is flagged as an isoform in the reference database. A TGE that does not map to a reference protein with a BLAST e-value below 1 × 10−30 is classified as novel. Polymorphic proteins TGEs are classed as known protein with polymorphism when the BLAST alignments are not identical but have e-value below 1 × 10−30 and cover the full length of the TGE and the reference protein by introducing polymorphisms such as SAPs, alterations (polymorphisms involving multiple AAs—labeled as ALT), insertions and deletions. Similarity of polymorphisms is identified from the BLAST alignment midline string, as shown in Figure 1C. If an AA is replaced by a chemically similar AA, we call it a SSAP (similar SAP). The same applies to alteration events: an ALT where all AAs have similar chemical properties to their reference sequence counterpart is assigned to a separate category called similar alteration (SALT). Isoform classification Polymorphisms involving more than nine AAs are assigned to a separate internal alternative splice variant group (labeled SV), accounting for alternative splicing events such as exon skipping, intron retention and mutually exclusive exons. The nine AA threshold is generally accepted as the shortest length of an exon (∼99% of protein coding exons are longer than 27 bp). Some TGEs do not map to the full length of the reference protein, or they extend beyond the reference protein. They may also contain polymorphisms. These TGEs are putative novel isoforms, which we categorize into fifteen different classes depending on the nature and location of the variation, as shown in Figure 2A. These classes describe variations at the N-terminal (5-prime end), the C-terminal (3-prime end), or both ends, of the TGE. Figure 2. View largeDownload slide (A) Classification of TGEs based on BLAST alignment to UniProt proteome (including isoforms) of the species under study. TGEs with 100% sequence map to UniProt proteins are labelled as known proteins or known isoforms. TGEs with BLAST e-value above 1 × 10−30 or that do not map to a UniProt protein are classified as novel. The remaining TGEs are classified into one of 17 types based on location, length and type of variation. (B) Example of scoring a putative novel isoform based on its mapping to the most homologous protein found by BLAST (ODB2_HUMAN). Only peptides not shared between the TGE and reference are used to compute the scores of the TGE and the reference protein, using Equation (1). Q-value scores and sample-specific detectability scores (SS) are used for identified and unidentified peptides respectively. In this case, the TGE score exceeds that of the reference sequence, suggesting we have a novel variant of ODB2. (C) Identified TGEs of each type supported by peptide evidence, and unique peptide evidence in parentheses. The proportion of novel findings is higher in species with less well annotated genomes. Figure 2. View largeDownload slide (A) Classification of TGEs based on BLAST alignment to UniProt proteome (including isoforms) of the species under study. TGEs with 100% sequence map to UniProt proteins are labelled as known proteins or known isoforms. TGEs with BLAST e-value above 1 × 10−30 or that do not map to a UniProt protein are classified as novel. The remaining TGEs are classified into one of 17 types based on location, length and type of variation. (B) Example of scoring a putative novel isoform based on its mapping to the most homologous protein found by BLAST (ODB2_HUMAN). Only peptides not shared between the TGE and reference are used to compute the scores of the TGE and the reference protein, using Equation (1). Q-value scores and sample-specific detectability scores (SS) are used for identified and unidentified peptides respectively. In this case, the TGE score exceeds that of the reference sequence, suggesting we have a novel variant of ODB2. (C) Identified TGEs of each type supported by peptide evidence, and unique peptide evidence in parentheses. The proportion of novel findings is higher in species with less well annotated genomes. Confirmation of protein variants using peptide evidence Applying the aforementioned classification strategy to TGEs in this study resulted in the majority being classified as putative novel isoforms and variant proteins. Evidence underpinning each TGE observation was at least two peptide observations mapped to the ORF, but not necessarily to variations within the ORF. Many supposedly novel variants could therefore be due to RNA-Seq errors or poor transcript assembly affecting regions of the ORF not covered by peptide identifications. To separate these from true TGE observations, two methods were applied – a simple approach that demands variant-specific peptides, and a probabilistic scoring approach in which the likely presence of a variant is computed with respect to the reference alternative. Variant-specific peptides Unlike previous studies that rely on prior knowledge of SNPs and splice sites, we identify variations within our TGE classification pipeline and check for peptides mapping specifically to the variant areas of the TGEs on a sample by sample basis. The majority of these peptides are shared with other TGEs in the sample, so we report peptides uniquely mapping to the variation region separately as this is stronger evidence of the variation. Scoring of variants using predicted peptide detectability Given that the sequence coverage of LC-MS/MS proteomics is generally low (e.g. ∼17% for known human proteins in this study) it is arguably too conservative to demand peptide evidence for every variant. Some variants are covered by a single peptide, which may not be detectable by MS. A more advanced strategy was therefore implemented, in which predicted peptide detectability, together with peptide identification confidence (represented by q-value), is used to determine the probability that a novel protein variant is more likely to be present in the sample than its corresponding reference protein. We used an enhanced version of CONSeQuence (32) to calculate a sequence-based detectability score for every tryptic peptide that could be identified in the sample (as trypsin was used for proteolysis in all samples), then calibrated these to a sample-specific detection score (s) using a transform function built using empirical peptide detectability information from ORFs in the sample that had already been identified as known proteins. By comparing the combined probability of detection of the set of peptides, R = {r1, r2 … rn}, that uniquely describe the reference protein against the set of peptides, V = {v1, v2 … vm}, that describe the protein variant it is possible to predict which is most likely to be present in the sample. The details of this calculation are shown in Equation (1), and an example of its use is shown in Figure 2B. \begin{equation*}\begin{array}{@{}*{1}{l}@{}} {scor{e_{{\rm{variant}}}} = \frac{1}{{\left| V \right| + \left| R \right|}}\left( {\sum\limits_{\forall a \in A} {(1 - {q_a})} } \right.\ \ }\\ {\left. {-\sum\limits_{\forall b \in B} {\frac{{1 - {q_b}}}{4}} + \sum\limits_{\forall b \in B\prime } {\frac{{{s_b}}}{8}} - \sum\limits_{\forall a \in A\prime } {\frac{{{s_a}}}{8}} } \right)} \end{array}\end{equation*} (1) where A is the set of identified peptides from V, B is the set of identified peptides from R, A′ is the set of unidentified peptides in V and B′ is the set of unidentified peptides in R. An equivalent equation is used to compute scorereference \begin{equation*}\begin{array}{@{}*{1}{l}@{}} {scor{e_{{\rm{reference}}}} = \frac{1}{{\left| R \right| + \left| V \right|}}\ \left( {\mathop \sum \limits_{\forall b \in B} (1 - {q_b})} \right.}\\ {\left. {-\mathop \sum \limits_{\forall a \in A} \frac{{1 - {q_a}}}{4} + \mathop \sum \limits_{\forall a \in A^\prime} \frac{{{s_a}}}{8} - \mathop \sum \limits_{\forall b \in B^\prime} \frac{{{s_b}}}{8}} \right)} \end{array}\end{equation*} (2) The score, scorevariant, for the TGE is calculated by considering only peptides that cover variant regions of the protein. Scores are assigned to each of these peptides as follows. Peptides from the TGE are given a score of 1 – q (where q is the lowest q-value for that peptide) if they are identified in the sample or –s/8 if they are not identified. Peptides from the reference sequence are given a score of (1 – q)/4 if they are identified in the sample or –(s/8) if they were not identified. The sum of peptide scores for the reference sequence is then subtracted from the sum of peptide scores for the TGE and normalised for peptide count to give the final TGE score. A similar equation is used to calculate scorereference (Equation 2). The denominators of unidentified peptides were set to 8 to compensate for an anticipated LC-MS/MS peptide coverage of 12.5%. The denominator of 4 is used to ensure that the difference between scorevariant and scorereference is small when both reference and variant-specific peptides are observed, indicating that both sequences are likely to exist in the sample. The scorevariant and scorereference are calculated separately so that the magnitude of the difference between them can be used to accommodate situations where both versions of the sequence may be present. Applying a threshold to this difference can separate confidently classified variants from reference proteins. Unless otherwise stated, we report TGEs as variants when scorevariant > scorereference. More detail regarding the scoring pipeline can be found in Supplementary Figure S3. Validation of variant scoring method using human data PIT data was processed in the absence of prior protein variation information (i.e. TGE classification BLASTed against the UniProt canonical proteome only) such that all observed isoforms would be classified as novel isoforms (Supplementary Figure S4). Separately, TGE classification was performed by BLASTing against the UniProt human proteome including known isoforms. Comparing the list of novel isoforms from the first classification with the list of identified known isoforms from the second classification indicated the ability of the classification pipeline to identify isoforms in the absence of prior knowledge. Rating TGEs by available evidence The overall confidence in the presence of an individual TGE can be assessed by considering all the aforementioned evidence collectively. For example, a list of observed TGEs can be ranked using a rating system such as that shown in Supplementary Table S2, where higher ratings are awarded to TGEs with more rigorous forms of evidence such as a unique PSM covering a variant region. This allows identified TGEs to be prioritised for further evaluation or validation. RESULTS AND DISCUSSION TGE classification Results for all four species are summarised in Table 1, with sample-specific breakdowns provided in Supplementary Tables S3-S5. Prior to considering variation-specific peptide evidence, the majority of putative TGEs are classified as novel isoforms, or known proteins with sequence polymorphism. Only 39% of putative TGEs from the human dataset have 100% sequence similarity to reference protein, a proportion that is lower still for A. aegypti (36%), M. musculus (10%) and P. alecto (∼3%). For M. musculus and P. alecto, PIT identifies significantly more sequences compared to the standard database search, probably due to PIT’s ability to account for sample specific variations. However, the number of protein variations is likely to be a significant overestimate as the TGEs are not necessarily supported by variant-specific peptide evidence at this stage. Overview of PIT TGE classification results Table 1. Overview of PIT TGE classification results Dataset (number of samples in parentheses) Homo sapiens (1) Mus musculus (8) Pteropus alecto (9) Aedes aegypti (1) Total spectra 210,560 293,894 350,890 829,093 Standard search Peptides 24,187 23,151 22,554 58,336 PAGs (protein ambiguity groups) 3,011 3,536 3,270 4,743 Total proteins 12,589 14,107 3,522 5,692 SwissProt Canonical 3,302 3,534 2 71 Isoform 3,365 1,344 0 79 TrEMBL 5,922 9,229 3,520 5,542 PIT search Peptides 21,612 24,297 23,875 52,221 PAGs 2,646 2,814 2,701 4,394 Total TGEs 3,504 24,602 28,311 5,488 TGEs mapping to SwissProt Canonical Total 1,134 1,270 0 77 Complete ORF 1,134 1,268 0 77 Isoform Total 197 195 0 1 Complete ORF 197 193 0 1 TGEs mapping to TrEMBL Total 38 925 765 1,939 Complete ORF 38 915 756 1,930 Putative novel isoform SwissProt Total 1,815 12,351 0 57 Complete ORF 174 1,864 0 20 Score 363 707 0 9 With specific peptide evidence 50 357 0 11 With unique specific peptide ev. 24 76 0 7 TrEMBL Total 233 9,194 26,328 3,080 Complete ORF 30 1,643 5,700 1,077 Score 92 488 5,092 891 With specific peptide evidence 10 390 4,735 903 With unique specific peptide ev. 3 82 1,452 428 Known protein with SwissProt Total 47 278 0 4 polymorphism Complete ORF 21 92 0 4 Score 7 14 0 0 With specific peptide evidence 1 6 0 0 With unique specific peptide ev. 1 3 0 0 TrEMBL Total 8 187 251 97 Complete ORF 5 86 95 85 Score 0 31 25 32 With specific peptide evidence 0 16 21 24 With unique specific peptide ev. 0 6 13 12 Novel TGE Total 32 202 967 233 Complete ORF 3 38 236 61 With unique peptide evidence 0 18 283 131 Dataset (number of samples in parentheses) Homo sapiens (1) Mus musculus (8) Pteropus alecto (9) Aedes aegypti (1) Total spectra 210,560 293,894 350,890 829,093 Standard search Peptides 24,187 23,151 22,554 58,336 PAGs (protein ambiguity groups) 3,011 3,536 3,270 4,743 Total proteins 12,589 14,107 3,522 5,692 SwissProt Canonical 3,302 3,534 2 71 Isoform 3,365 1,344 0 79 TrEMBL 5,922 9,229 3,520 5,542 PIT search Peptides 21,612 24,297 23,875 52,221 PAGs 2,646 2,814 2,701 4,394 Total TGEs 3,504 24,602 28,311 5,488 TGEs mapping to SwissProt Canonical Total 1,134 1,270 0 77 Complete ORF 1,134 1,268 0 77 Isoform Total 197 195 0 1 Complete ORF 197 193 0 1 TGEs mapping to TrEMBL Total 38 925 765 1,939 Complete ORF 38 915 756 1,930 Putative novel isoform SwissProt Total 1,815 12,351 0 57 Complete ORF 174 1,864 0 20 Score 363 707 0 9 With specific peptide evidence 50 357 0 11 With unique specific peptide ev. 24 76 0 7 TrEMBL Total 233 9,194 26,328 3,080 Complete ORF 30 1,643 5,700 1,077 Score 92 488 5,092 891 With specific peptide evidence 10 390 4,735 903 With unique specific peptide ev. 3 82 1,452 428 Known protein with SwissProt Total 47 278 0 4 polymorphism Complete ORF 21 92 0 4 Score 7 14 0 0 With specific peptide evidence 1 6 0 0 With unique specific peptide ev. 1 3 0 0 TrEMBL Total 8 187 251 97 Complete ORF 5 86 95 85 Score 0 31 25 32 With specific peptide evidence 0 16 21 24 With unique specific peptide ev. 0 6 13 12 Novel TGE Total 32 202 967 233 Complete ORF 3 38 236 61 With unique peptide evidence 0 18 283 131 To allow comparison with standard proteomics methods, peptide and protein identification was also performed for each species by searching directly against the reference proteome—the results of this are shown in the top (standard search) portion of the table. Throughout the table, identified proteins are shown based on the source reference sequence: Swiss-Prot or TrEMBL. Swiss-Prot proteins are further divided into two groups, canonical and isoform. TGEs with exact sequence map to reference proteins are classed as known proteins. TGEs not mapping to any reference proteins or with e-value above the threshold are classified as novel TGEs. The remaining TGEs are classified as known proteins with polymorphism, or novel isoforms of known proteins. The novel isoform TGEs are further separated into 16 classes and reliability of this annotation is verified by isoform-specific peptide evidence (see Supplementary Table S5 for details). Peptide and protein counts reported in the table are unique sequences across all the samples for datasets with multiple samples and average PAG (protein ambiguity group) counts are reported for these cases. View Large Table 1. Overview of PIT TGE classification results Dataset (number of samples in parentheses) Homo sapiens (1) Mus musculus (8) Pteropus alecto (9) Aedes aegypti (1) Total spectra 210,560 293,894 350,890 829,093 Standard search Peptides 24,187 23,151 22,554 58,336 PAGs (protein ambiguity groups) 3,011 3,536 3,270 4,743 Total proteins 12,589 14,107 3,522 5,692 SwissProt Canonical 3,302 3,534 2 71 Isoform 3,365 1,344 0 79 TrEMBL 5,922 9,229 3,520 5,542 PIT search Peptides 21,612 24,297 23,875 52,221 PAGs 2,646 2,814 2,701 4,394 Total TGEs 3,504 24,602 28,311 5,488 TGEs mapping to SwissProt Canonical Total 1,134 1,270 0 77 Complete ORF 1,134 1,268 0 77 Isoform Total 197 195 0 1 Complete ORF 197 193 0 1 TGEs mapping to TrEMBL Total 38 925 765 1,939 Complete ORF 38 915 756 1,930 Putative novel isoform SwissProt Total 1,815 12,351 0 57 Complete ORF 174 1,864 0 20 Score 363 707 0 9 With specific peptide evidence 50 357 0 11 With unique specific peptide ev. 24 76 0 7 TrEMBL Total 233 9,194 26,328 3,080 Complete ORF 30 1,643 5,700 1,077 Score 92 488 5,092 891 With specific peptide evidence 10 390 4,735 903 With unique specific peptide ev. 3 82 1,452 428 Known protein with SwissProt Total 47 278 0 4 polymorphism Complete ORF 21 92 0 4 Score 7 14 0 0 With specific peptide evidence 1 6 0 0 With unique specific peptide ev. 1 3 0 0 TrEMBL Total 8 187 251 97 Complete ORF 5 86 95 85 Score 0 31 25 32 With specific peptide evidence 0 16 21 24 With unique specific peptide ev. 0 6 13 12 Novel TGE Total 32 202 967 233 Complete ORF 3 38 236 61 With unique peptide evidence 0 18 283 131 Dataset (number of samples in parentheses) Homo sapiens (1) Mus musculus (8) Pteropus alecto (9) Aedes aegypti (1) Total spectra 210,560 293,894 350,890 829,093 Standard search Peptides 24,187 23,151 22,554 58,336 PAGs (protein ambiguity groups) 3,011 3,536 3,270 4,743 Total proteins 12,589 14,107 3,522 5,692 SwissProt Canonical 3,302 3,534 2 71 Isoform 3,365 1,344 0 79 TrEMBL 5,922 9,229 3,520 5,542 PIT search Peptides 21,612 24,297 23,875 52,221 PAGs 2,646 2,814 2,701 4,394 Total TGEs 3,504 24,602 28,311 5,488 TGEs mapping to SwissProt Canonical Total 1,134 1,270 0 77 Complete ORF 1,134 1,268 0 77 Isoform Total 197 195 0 1 Complete ORF 197 193 0 1 TGEs mapping to TrEMBL Total 38 925 765 1,939 Complete ORF 38 915 756 1,930 Putative novel isoform SwissProt Total 1,815 12,351 0 57 Complete ORF 174 1,864 0 20 Score 363 707 0 9 With specific peptide evidence 50 357 0 11 With unique specific peptide ev. 24 76 0 7 TrEMBL Total 233 9,194 26,328 3,080 Complete ORF 30 1,643 5,700 1,077 Score 92 488 5,092 891 With specific peptide evidence 10 390 4,735 903 With unique specific peptide ev. 3 82 1,452 428 Known protein with SwissProt Total 47 278 0 4 polymorphism Complete ORF 21 92 0 4 Score 7 14 0 0 With specific peptide evidence 1 6 0 0 With unique specific peptide ev. 1 3 0 0 TrEMBL Total 8 187 251 97 Complete ORF 5 86 95 85 Score 0 31 25 32 With specific peptide evidence 0 16 21 24 With unique specific peptide ev. 0 6 13 12 Novel TGE Total 32 202 967 233 Complete ORF 3 38 236 61 With unique peptide evidence 0 18 283 131 To allow comparison with standard proteomics methods, peptide and protein identification was also performed for each species by searching directly against the reference proteome—the results of this are shown in the top (standard search) portion of the table. Throughout the table, identified proteins are shown based on the source reference sequence: Swiss-Prot or TrEMBL. Swiss-Prot proteins are further divided into two groups, canonical and isoform. TGEs with exact sequence map to reference proteins are classed as known proteins. TGEs not mapping to any reference proteins or with e-value above the threshold are classified as novel TGEs. The remaining TGEs are classified as known proteins with polymorphism, or novel isoforms of known proteins. The novel isoform TGEs are further separated into 16 classes and reliability of this annotation is verified by isoform-specific peptide evidence (see Supplementary Table S5 for details). Peptide and protein counts reported in the table are unique sequences across all the samples for datasets with multiple samples and average PAG (protein ambiguity group) counts are reported for these cases. View Large Variations supported by simple peptide evidence The proportion of variations with variant-specific peptide evidence varied greatly among species. Only a small minority of those in Table 1 have isoform specific peptides (∼2% for human and M. musculus; ∼20% for P. Alecto and A. Aegypti, reflecting the relative annotation quality of these species). The distribution of identified isoforms with variant-specific evidence among the various types is summarised in Figure 2C (numbers in parentheses indicate TGEs that meet the more conservative criteria of having unique peptide evidence). Besides identifying peptides from the variant regions, we found junction peptides for TGEs from alternative and extended isoform classes. The majority of these junction peptides support alternative sequence variations. Many of these junction peptides are also unique peptides (33–40% for the non-human datasets). This demonstrates that the PIT pipeline is capable of high throughput discovery of novel isoforms in the absence of prior information about gene structure. Regarding polymorphisms, human and M. musculus have the lowest percentage of peptide-supported polymorphisms, only 5% and 7% respectively, whereas A. aegypti and P. alecto have 15% and 23%. These were found in all variant TGE classes, and in known proteins (counted separately in Table 1). Peptide supported polymorphisms are shown for each species in Figure 3A. The total number of polymorphisms range from just 60 for human, through to 32 392 for P. Alecto, reflecting the relative quality of the reference proteomes for these species. This suggests significant scope for improving the P. Alecto reference proteome, by using the polymorphisms identified by PIT to correct existing protein sequences predicted from an imperfect reference genome. Figure 3. View largeDownload slide (A) Overview of polymorphisms with variation-specific peptide evidence, for each species. (B) Distribution of alternative start codons confirmed by peptide evidence for different species. For context, 172 and 166 Swiss-Prot proteins have alternative starts for human and M. musculus respectively. Swiss-Prot contains only two proteins for P. alecto, and both start with valine. Figure 3. View largeDownload slide (A) Overview of polymorphisms with variation-specific peptide evidence, for each species. (B) Distribution of alternative start codons confirmed by peptide evidence for different species. For context, 172 and 166 Swiss-Prot proteins have alternative starts for human and M. musculus respectively. Swiss-Prot contains only two proteins for P. alecto, and both start with valine. Alternative start codons The majority of TGEs classified as known protein were from ORFs classified as complete by Transdecoder (see Table 1), suggesting that many incomplete ORFs are due to poor sequence assembly. However, some proteins do not start with methionine so Transdecoder incompleteness does not necessarily indicate an erroneous ORF. Alternative start codons are found in Swiss-Prot for all the species in this study except A. aegypti. Our results include several non-methionine starts with peptide evidence (often unique peptide evidence) for all the species in this study (Figure 3B). To avoid the possibility that an alternative start is called due to the N-terminus of a truncated ORF coinciding with a tryptic cleavage site, we discounted all TGEs with alternative starts where the reference protein has lysine or arginine at the preceding position. The highest number of non-methionine start codons supported by peptide evidence are observed for P. alecto, most of which are valine or alanine. We identified TGEs with N-terminus methionine removed, which is significant for function and stability (33). Validation of variant scoring method using human data Our scoring-based classification method identified 76 known isoforms of the 197 found to be present in the sample during the evaluation process. These known isoforms are supported by at least two peptides but, as in any proteomics experiment, their presence in the sample cannot be proven definitively without laboratory validation. The scoring method identifies 42 known isoforms that the simple peptide evidence approach missed, but fails to classify three known isoforms reported by the simple peptide evidence approach (Figure 4A). These missed isoforms were due to peptide identifications from the reference having higher confidence than from the variant, and failure to observe highly detectable variant-specific peptides. A ROC curve (Figure 4B) shows the performance of the scoring method using known isoforms identified from the PIT search. Novel protein variants confirmed by variant-specific peptides is the best way to confirm their presence in the sample without separate laboratory validation, therefore we used known isoforms confirmed by variant-specific peptide evidence as the gold standard for the validation process and observed an area under the curve (AUC) of 0.90. Figure 4C shows how increasing the score difference threshold reduces the number of TGEs classified, but increases the proportion of those confirmed as isoforms present in the sample. Protein ambiguity group analysis shows that known isoforms usually share peptides with other variants of the protein, making their presence ambiguous. This evaluation exercise shows that the scoring method can significantly increase the number of confident novel isoform identifications compared to the simple variation-specific peptide evidence approach. Figure 4. View largeDownload slide TGE scoring and validation. (A) Comparison of methods for confirming the identification of novel protein isoforms, applied to human PIT data. A set of 197 known isoforms found in the sample using PIT was used for validation. The scoring method can identify 76 of these as isoforms, which is an improvement over the 37 confirmed by the simple peptide evidence approach. (B) ROC curve showing performance of the scoring method in comparison to the traditional variant-specific peptide evidence based method for known isoforms. (C) The number of TGEs classified as isoform rapidly decreases as the threshold between TGE and reference score is increased, while the proportion of those that have been confirmed as isoforms present in the sample increases. (D) Comparison of isoform classification techniques applied on novel isoforms from all species. The scoring method predicts higher numbers of variant isoforms in the sample compared to the peptide evidence method, but misses some TGEs confirmed by peptide evidence. (E) Class distribution of TGEs confirmed by the scoring method for novel isoforms in human. Figure 4. View largeDownload slide TGE scoring and validation. (A) Comparison of methods for confirming the identification of novel protein isoforms, applied to human PIT data. A set of 197 known isoforms found in the sample using PIT was used for validation. The scoring method can identify 76 of these as isoforms, which is an improvement over the 37 confirmed by the simple peptide evidence approach. (B) ROC curve showing performance of the scoring method in comparison to the traditional variant-specific peptide evidence based method for known isoforms. (C) The number of TGEs classified as isoform rapidly decreases as the threshold between TGE and reference score is increased, while the proportion of those that have been confirmed as isoforms present in the sample increases. (D) Comparison of isoform classification techniques applied on novel isoforms from all species. The scoring method predicts higher numbers of variant isoforms in the sample compared to the peptide evidence method, but misses some TGEs confirmed by peptide evidence. (E) Class distribution of TGEs confirmed by the scoring method for novel isoforms in human. Scoring variants using predicted peptide detectability For the human dataset, application of the scoring method with a zero threshold suggests that 455 out of 2048 putative novel isoforms are indeed novel isoforms, and seven out of 55 putative known proteins with polymorphisms are also confirmed (Supplementary Figure S5). As in the evaluation, the scoring method classifies more TGEs than the simple variant-specific peptide evidence method, and there is a significant overlap between the two methods (Figure 4D). Except for one TGE each from P. alecto and human, the remaining TGEs supported by junction peptides were classified as variants using the scoring method. Most TGEs confirmed exclusively by the scoring method come from the N-terminus truncated class (see Figure 4E), due to non-identification of highly detectable peptides from the truncated region. In summary, applying the TGE scoring method has allowed us to promote several thousand putative protein variants (14% of the total) to a higher level of confidence. Shared TGEs among species Only one TGE, histone H3 protein, is observed in all four species—a Swiss-Prot protein for human and M. musculus that is reported in TrEMBL for P. Alecto and A. Aegypti. However, there are many TGEs in common between pairs of species (Figure 5A), most of which are known proteins. Some shared TGEs are classified as known in one species but as novel isoform in another, for example three TGEs that are known M. musculus proteins but have been classified as N-terminus truncated (two TGEs) and known protein with polymorphisms for human. The known protein with polymorphism (Vesicle-trafficking protein SEC22b) has unique peptide evidence for one of the polymorphisms. Human also shares 116 identified TGEs with P. Alecto, although none of these shared TGEs is a novel variant supported by peptide evidence. One novel isoform of Heterogeneous nuclear ribonucleoprotein D0 protein with peptide evidence is shared between P. alecto and M. musculus (with unique peptide evidence in mouse). The distribution of known M. musculus proteins shared with P. alecto is shown in Supplementary Figure S6. Such identifications of the same TGE in multiple species can increase confidence in the biological validity of that identification, and are also relevant to cross-species studies. Figure 5. View largeDownload slide (A) Overlap of all identified TGEs across organisms (to be classified as an overlap the TGEs were required to have identical sequences). Overlapping TGEs are often known in one species but novel variant for the others. Some of the overlapping novel variants have variant-specific peptide evidence. (B) Length distribution of different TGE classes identified from human, P. alecto, M. musculus and A. aegypti datasets. Novel TGEs are significantly shorter than the rest of the TGE types for human and mouse, while A. aegypti and P. alecto have novel TGEs with lengths similar to those of the other TGE classes. (C) Distribution of peptide coverage per TGE for different TGE classes in each species. Figure 5. View largeDownload slide (A) Overlap of all identified TGEs across organisms (to be classified as an overlap the TGEs were required to have identical sequences). Overlapping TGEs are often known in one species but novel variant for the others. Some of the overlapping novel variants have variant-specific peptide evidence. (B) Length distribution of different TGE classes identified from human, P. alecto, M. musculus and A. aegypti datasets. Novel TGEs are significantly shorter than the rest of the TGE types for human and mouse, while A. aegypti and P. alecto have novel TGEs with lengths similar to those of the other TGE classes. (C) Distribution of peptide coverage per TGE for different TGE classes in each species. Novel TGEs We identified novel TGEs in each dataset, from 32 in human to 967 in P. alecto, but the majority are not supported by unique peptide evidence (Table 1). In human and M. musculus, most putative novel TGEs are significantly shorter than other TGE classes (Figure 5B). A large portion of novel P. alecto TGEs are short but overall have median length close to known proteins. Novel A. aegypti TGEs also have similar median length compared to their known counterparts but are not skewed towards shorter TGEs. Peptide coverage is similar or higher for novel TGEs (Figure 5C), giving confidence in these identifications. Collectively, this suggests that most of the novel TGEs from P. alecto and A. aegypti are likely to be newly discovered proteins, whereas those from M. musculus and human may be too short to be functional proteins. This is confirmed by the fact that most of the supposedly novel short human TGEs were found to map directly to subsections of multiple existing proteins. They exceed the BLAST e-value threshold because the significance of individual matches decreases when there are multiple matches, but they are very likely to be ORFs predicted from partially assembled transcripts. CONCLUSION The TGE classification pipeline presented here has been shown to be a significant improvement in PIT methodology, providing deeper insight into human samples, and finding large numbers of confidently identified polymorphisms and novel splice variants in non-model species that can be used to rapidly improve their reference proteomes. For example, strong evidence has been found for hundreds of novel TGEs and protein isoforms in P. alecto and A. aegypti, including many with alternative start codons. The significant reduction in putative TGEs seen when peptide evidence is considered demonstrates the benefit of using PIT rather than extrapolating translated products from RNA-seq data alone. By developing this pipeline and making it publicly available we give the research community the opportunity to adopt this alternative approach. DATA AVAILABILITY The software pipeline, and documentation, is available via GitHub [https://github.com/bezzlab/TGEClassification]. The results generated, including novel protein sequences, are available in PITDB ([http://pitdb.org] with experiment accession numbers EXP000001, EXP000003, EXP000004 and EXP000008. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. FUNDING Biotechnology and Biological Sciences Research Council (BBSRC) [BB/K016075/1]; Queen Mary University of London (QMUL) Life Sciences Initiative; Queen Mary's MidPlus computational facilities, supported by QMUL Research-IT; EPSRC [EP/K000128/1]. Funding for open access charge: Block grant awarded to Queen Mary University of London. Conflict of interest statement. None declared. REFERENCES 1. da Fonseca R.R. , Albrechtsen A. , Themudo G.E. , Ramos-Madrigal J. , Sibbesen J.A. , Maretty L. , Zepeda-Mendoza M.L. , Campos P.F. , Heller R. , Pereira R.J. Next-generation biology: Sequencing and data analysis approaches for non-model organisms . Mar. Geonomics . 2016 ; 30 : 3 – 13 . Google Scholar CrossRef Search ADS 2. Evans V.C. , Barker G. , Heesom K.J. , Fan J. , Bessant C. , Matthews D.A. De novo derivation of proteomes from transcriptomes for transcript and protein identification . Nat. Methods . 2012 ; 9 : 1207 – 1211 . Google Scholar CrossRef Search ADS PubMed 3. Fan J. , Saha S. , Barker G. , Heesom K.J. , Ghali F. , Jones A.R. , Matthews D.A. , Bessant C. Galaxy integrated Omics: Web-based Standards-Compliant workflows for proteomics informed by transcriptomics . Mol. Cell. Proteomics . 2015 ; 14 : 3087 – 3093 . Google Scholar CrossRef Search ADS PubMed 4. Di Fede G. , Catania M. , Morbin M. , Rossi G. , Suardi S. , Mazzoleni G. , Merlin M. , Giovagnoli A.R. , Prioni S. , Erbetta A. et al. A recessive mutation in the APP gene with dominant-negative effect on amyloidogenesis . Science (New York, N.Y.) . 2009 ; 323 : 1473 – 1477 . Google Scholar CrossRef Search ADS PubMed 5. Skotheim R.I. , Nees M. Alternative splicing in cancer: noise, functional, or systematic? . Int. J. Biochem. Cell Biol. 2007 ; 39 : 1432 – 1449 . Google Scholar CrossRef Search ADS PubMed 6. Andrews S.J. , Rothnagel J.A. Emerging evidence for functional peptides encoded by short open reading frames . Nat. Rev. Genet. 2014 ; 15 : 193 – 204 . Google Scholar CrossRef Search ADS PubMed 7. Ciriello G. , Miller M.L. , Aksoy B.A. , Senbabaoglu Y. , Schultz N. , Sander C. Emerging landscape of oncogenic signatures across human cancers . Nat. Genet. 2013 ; 45 : 1127 – 1133 . Google Scholar CrossRef Search ADS PubMed 8. Sui Z. , Wen B. , Gao Z. , Chen Q. Fusion-Related host proteins are actively regulated by NA during influenza infection as revealed by quantitative proteomics analysis . PLoS One . 2014 ; 9 : e105947 . Google Scholar CrossRef Search ADS PubMed 9. Francesconi M. , Lehner B. The effects of genetic variation on gene expression dynamics during development . Nature . 2013 ; 505 : 208 – 211 . Google Scholar CrossRef Search ADS PubMed 10. Banfai B. , Jia H. , Khatun J. , Wood E. , Risk B. , Gundling W.E. Jr , Kundaje A. , Gunawardena H.P. , Yu Y. , Xie L. et al. Long noncoding RNAs are rarely translated in two human cell lines . Genome Res. 2012 ; 22 : 1646 – 1657 . Google Scholar CrossRef Search ADS PubMed 11. Cheetham S.W. , Gruhl F. , Mattick J.S. , Dinger M.E. Long noncoding RNAs and the genetics of cancer . Br. J. Cancer . 2013 ; 108 : 2419 – 2425 . Google Scholar CrossRef Search ADS PubMed 12. Cao R. , Shi Y. , Chen S. , Ma Y. , Chen J. , Yang J. , Chen G. , Shi T. dbSAP: single amino-acid polymorphism database for protein variation detection . Nucleic Acids Res. 2017 ; 45 : D827 – D832 . Google Scholar CrossRef Search ADS PubMed 13. Sheynkman G.M. , Johnson J.E. , Jagtap P.D. , Shortreed M.R. , Onsongo G. , Frey B.L. , Griffin T.J. , Smith L.M. Using galaxy-P to leverage RNA-Seq for the discovery of novel protein variations . BMC Genomics . 2014 ; 15 : 703 . Google Scholar CrossRef Search ADS PubMed 14. Pang C.N. , Tay A.P. , Aya C. , Twine N.A. , Harkness L. , Hart-Smith G. , Chia S.Z. , Chen Z. , Deshpande N.P. , Kaakoush N.O. et al. Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing . J. Proteome Res. 2014 ; 13 : 84 – 98 . Google Scholar CrossRef Search ADS PubMed 15. Ruggles K.V. , Tang Z. , Wang X. , Grover H. , Askenazi M. , Teubl J. , Cao S. , McLellan M.D. , Clauser K.R. , Tabb D.L. et al. An analysis of the sensitivity of proteogenomic mapping of somatic mutations and novel splicing events in cancer . Mol. Cell. Proteomics . 2016 ; 15 : 1060 – 1071 . Google Scholar CrossRef Search ADS PubMed 16. Woo S. , Cha S.W. , Na S. , Guest C. , Liu T. , Smith R.D. , Rodland K.D. , Payne S. , Bafna V. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data . Proteomics . 2014 ; 14 : 2719 – 2730 . Google Scholar CrossRef Search ADS PubMed 17. Krasnov G.S. , Dmitriev A.A. , Kudryavtseva A.V. , Shargunov A.V. , Karpov D.S. , Uroshlev L.A. , Melnikova N.V. , Blinov V.M. , Poverennaya E.V. , Archakov A.I. et al. PPLine: an automated pipeline for SNP, SAP, and splice variant detection in the context of proteogenomics . J. Proteome Res. 2015 ; 14 : 3729 – 3737 . Google Scholar CrossRef Search ADS PubMed 18. Wang X. , Slebos R.J.C. , Wang D. , Halvey P.J. , Tabb D.L. , Liebler D.C. , Zhang B. Protein identification using customized protein sequence databases derived from RNA-Seq data . J. Proteome Res. 2011 ; 11 : 1009 – 1017 . Google Scholar CrossRef Search ADS PubMed 19. Wang X. , Zhang B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search . Bioinformatics . 2013 ; 29 : 3235 – 3237 . Google Scholar CrossRef Search ADS PubMed 20. Ning K. , Nesvizhskii A.I. The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-Seq data: a preliminary assessment . BMC Bioinformatics . 2010 ; 11 ( Suppl. 11 ): S14 . Google Scholar CrossRef Search ADS PubMed 21. Wynne J.W. , Shiell B.J. , Marsh G.A. , Boyd V. , Harper J.A. , Heesom K. , Monaghan P. , Zhou P. , Payne J. , Klein R. et al. Proteomics informed by transcriptomics reveals Hendra virus sensitizes bat cells to TRAIL-mediated apoptosis . Genome Biol. 2014 ; 15 : 532 . Google Scholar PubMed 22. Mok L. , Wynne J.W. , Grimley S. , Shiell B. , Green D. , Monaghan P. , Pallister J. , Bacic A. , Michalski W.P. Mouse fibroblast L929 cells are less permissive to infection by Nelson Bay orthoreovirus compared to other mammalian cell lines . J. Gen. Virol. 2015 ; 96 : 1787 – 1794 . Google Scholar CrossRef Search ADS PubMed 23. Maringer K. , Yousuf A. , Heesom K.J. , Fan J. , Lee D. , Fernandez-Sesma A. , Bessant C. , Matthews D.A. , Davidson A.D. Proteomics informed by transcriptomics for characterising active transposable elements and genome annotation in Aedes aegypti . BMC Genomics . 2017 ; 18 : 101 . Google Scholar CrossRef Search ADS PubMed 24. Grabherr M.G. , Haas B.J. , Yassour M. , Levin J.Z. , Thompson D.A. , Amit I. , Adiconis X. , Fan L. , Raychowdhury R. , Zeng Q. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome . Nat. Biotechnol. 2011 ; 29 : 644 – 652 . Google Scholar CrossRef Search ADS PubMed 25. Haas B.J. , Delcher A.L. , Mount S.M. , Wortman J.R. , Smith R.K. , Hannick L.I. , Maiti R. , Ronning C.M. , Rusch D.B. , Town C.D. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies . Nucleic Acids Res. 2003 ; 31 : 5654 – 5666 . Google Scholar CrossRef Search ADS PubMed 26. TIGR Gene Index group. Seqclean. Accessed on 09 Aug 2016 . 27. Haas B.J. , Papanicolaou A. , Yassour M. , Grabherr M. , Blood P.D. , Bowden J. , Couger M.B. , Eccles D. , Li B. , Lieber M. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis . Nat. Protoc. 2013 ; 8 : 1494 – 1512 . Google Scholar CrossRef Search ADS PubMed 28. Kim S. , Pevzner P.A. MS-GF+ makes progress towards a universal database search tool for proteomics . Nat. Commun. 2014 ; 5 : 5277 . Google Scholar CrossRef Search ADS PubMed 29. Ghali F. , Krishna R. , Lukasse P. , Martínez-Bartolomé S. , Reisinger F. , Hermjakob H. , Vizcaíno J.A. , Jones A.R. Tools (Viewer, Library and Validator) that facilitate use of the peptide and protein identification standard format, termed mzIdentML . Mol. Cell. Proteomics . 2013 ; 12 : 3026 – 3035 . Google Scholar CrossRef Search ADS PubMed 30. Elias J.E. , Gygi S.P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry . Nat. Methods . 2007 ; 4 : 207 – 214 . Google Scholar CrossRef Search ADS PubMed 31. Henikoff S. , Henikoff J.G. Amino acid substitution matrices from protein blocks . PNAS . 1992 ; 89 : 10915 – 10919 . Google Scholar CrossRef Search ADS PubMed 32. Eyers C.E. , Lawless C. , Wedge D.C. , Lau K.W. , Gaskell S.J. , Hubbard S.J. CONSeQuence: prediction of reference peptides for absolute quantitative proteomics using consensus machine learning approaches . Mol. Cell. Proteomics . 2011 ; 10 : M110.003384 . Google Scholar CrossRef Search ADS PubMed 33. Liao Y.D. , Jeng J.C. , Wang C.F. , Wang S.C. , Chang S.T. Removal of N-terminal methionine from recombinant proteins by engineered E. coli methionine aminopeptidase . Protein Sci. 2004 ; 13 : 1802 – 1810 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

journal article

Open Access Collection

A metastable rRNA junction essential for bacterial 30S biogenesis

Sharma, Indra Mani;Rappé, Mollie C;Addepalli, Balasubrahmanyam;Grabow, Wade W;Zhuang, Zhuoyun;Abeysirigunawardena, Sanjaya C;Limbach, Patrick A;Jaeger, Luc;Woodson, Sarah A

2018 Nucleic Acids Research

doi: 10.1093/nar/gky120pmid: 29850893

Abstract Tertiary sequence motifs encode interactions between RNA helices that create the three-dimensional structures of ribosomal subunits. A Right Angle motif at the junction between 16S helices 5 and 6 (J5/6) is universally conserved amongst small subunit rRNAs and forms a stable right angle in minimal RNAs. J5/6 does not form a right angle in the mature ribosome, suggesting that this motif encodes a metastable structure needed for ribosome biogenesis. In this study, J5/6 mutations block 30S ribosome assembly and 16S maturation in Escherichia coli. Folding assays and in-cell X-ray footprinting showed that J5/6 mutations favor an assembly intermediate of the 16S 5′ domain and prevent formation of the central pseudoknot. Quantitative mass spectrometry revealed that mutant pre-30S ribosomes lack protein uS12 and are depleted in proteins uS5 and uS2. Together, these results show that impaired folding of the J5/6 right angle prevents the establishment of inter-domain interactions, resulting in global collapse of the 30S structure observed in electron micrographs of mutant pre-30S ribosomes. We propose that the J5/6 motif is part of a spine of RNA helices that switch conformation at distinct stages of assembly, linking peripheral domains with the 30S active site to ensure the integrity of 30S biogenesis. INTRODUCTION The core of the ribosome is largely composed of rRNA (1,2) and adopts a similar three-dimensional structure in ribosomes from all kingdoms of life (3). Conserved sequence motifs in the rRNA encode for tertiary structural motifs (or modules) that contribute to the formation of the tertiary architecture of the ribosome and create the active sites for tRNA binding and peptide synthesis (4–7). Although most RNA tertiary motifs are needed to stabilize the mature structure of the ribosome, some motifs may exchange interaction partners or refold during the assembly process. An outstanding question is how RNA motifs in disparate regions of the ribosome communicate with each other to ensure complete assembly. The Right Angle (RA) is a complex RNA sequence motif identified at several locations in small subunit (SSU) and large subunit (LSU) rRNAs (Supplementary Figure S1) (8–10). The RA motif comprises an along-groove stacking interaction between neighboring helices (11) that is stabilized by two GA-minor motifs (Figure 1A, left panel, Supplementary Figure S2). The RA sequence at the junction (J) of helix (h) 5 and h6 (J5/6) in the SSU rRNA is universally conserved (Supplementary Figure S1). Surprisingly, J5/6 does not form a Right Angle in mature ribosomes (Figures 1A, right panel and 1B, inset, Supplementary Figure S2A). Instead, the tip of h15 interacts with the along-groove stacking surface of h5, forcing h6 into the splayed (obtuse) angle that defines the spur of the small subunit of the ribosome (Figure 1B, Supplementary Figure S2B). Figure 1. View largeDownload slide An essential helix junction motif in the 16S rRNA. (A) Consensus sequence for the Right Angle (RA) motif (9) comprised of the Along-Groove stacking submotif (pink box) and two GA-minor submotifs (blue boxes). The RA motif between 16S h5 and h6 (J5/6) is destabilized by mutations used in this study at positions 1, 6, 7 and 12 (see also Figure 2B). Helices 5 and 6 form a right angle in isolation but are splayed apart to interact with h15 in the 30S ribosome. (B) Secondary and tertiary interactions between the J5/6 RA motif (h5, h6 in wheat; J5/6 RA motif in red) and its docking helix (h15 in pale green) in the three-dimensional structure of 30S ribosome (2I2P; (56)). The central pseudoknot (CP) is in bright red. Symbols and abbreviations: R, purine; Y, pyrimidine; N, any nucleotide. For base pair symbols, see legend of Supplementary Figure S2: Watson–Crick (WC), Hoogsteen (HG), and shallow groove (SG) edges are indicated by circles, squares and triangles, respectively. Figure 1. View largeDownload slide An essential helix junction motif in the 16S rRNA. (A) Consensus sequence for the Right Angle (RA) motif (9) comprised of the Along-Groove stacking submotif (pink box) and two GA-minor submotifs (blue boxes). The RA motif between 16S h5 and h6 (J5/6) is destabilized by mutations used in this study at positions 1, 6, 7 and 12 (see also Figure 2B). Helices 5 and 6 form a right angle in isolation but are splayed apart to interact with h15 in the 30S ribosome. (B) Secondary and tertiary interactions between the J5/6 RA motif (h5, h6 in wheat; J5/6 RA motif in red) and its docking helix (h15 in pale green) in the three-dimensional structure of 30S ribosome (2I2P; (56)). The central pseudoknot (CP) is in bright red. Symbols and abbreviations: R, purine; Y, pyrimidine; N, any nucleotide. For base pair symbols, see legend of Supplementary Figure S2: Watson–Crick (WC), Hoogsteen (HG), and shallow groove (SG) edges are indicated by circles, squares and triangles, respectively. Earlier footprinting data suggested that J5/6 in the 16S 5′ domain undergoes specific conformational changes during 30S assembly. It is one of the slowest regions of the 16S 5′ domain to fold in 20 mM Mg2+, indicating that proteins are needed to guide this region of the rRNA to its final structure (12). Residues in h6 are buried when the RNA is folded in ≤2.5 mM MgCl2, while higher MgCl2 or ribosomal proteins are needed to bury h15, suggesting that h6 packs with other helices before it interacts with h15 (Supplementary Figure S2B). Tethered Fe(II) hydroxyl radical cleavage of rRNA residues near the N-terminal alpha helix of bS20, which lies on one side of J5/6, revealed different intermediate structures in the presence of protein bS20 or bS20 plus uS17, compared with the mature 30S ribosome (13). Protein bS16, which interacts with 16S h15 on the other side of J5/6, holds h15 against h6 when h6 is in its proper orientation and produces a footprinting pattern similar to that in the complete 30S ribosome (13–15). Importantly, protein bS20, which contacts J5/6, switches the 16S rRNA to a structure that is able to productively add protein bS16 (16), both increasing the probability of bS16 binding and the lifetime of bS16 complexes (17). J5/6 is structurally coupled to a second conformational switch at 16S h3, which is coincidently joined to h18 through another RA motif (9). Hydroxyl radical footprinting and single molecule (sm)FRET showed that the 16S 5′ domain passes through an assembly intermediate in which 16S h3 is flipped out of the structure rather than docked against protein uS4, as it is in the mature 30S ribosome (15,18). Binding of protein bS16 greatly favors h3 docking, which in turn connects the 5′ domain to the central and 3′ domains via the central pseudoknot (14,15,19). Thus, J5/6 participates in a chain of RNA and protein interactions that connect h6 (30S spur) with the central pseudoknot of the 30S ribosome (Figures 1B and 8). This evidence that the RNA helices around h5 and h6 change conformation during 30S assembly, and the conservation of the J5/6 RA motif among SSU rRNAs, motivated us to inquire whether this motif is required for ribosome assembly. Here, we show that mutations in J5/6 block 30S maturation and impair formation of the central pseudoknot, resulting in a total collapse of interactions between the major 30S domains. The results demonstrate that RNA motifs far from the active site of the ribosome are functionally important and may encode metastable conformations that are needed at specific stages of assembly, thus accounting for their conservation during molecular evolution. MATERIALS AND METHODS TectoRNA design, synthesis and assembly The tectoRNA system (Supplementary Table S1) used in the study was designed as previously reported (9). The equilibrium constant of dissociation (Kd) of transcribed tectoRNAs was measured by mixing equimolar amounts of each tectoRNA at various concentrations (typically 10 nM to 20 μM) in water. Samples were denatured 2 min at 95°C, snap-cooled 3 min at 4°C, and incubated 20 min at 30°C in association buffer [89 mM Tris–borate pH 8.2, 50 mM KCl and 15 mM Mg(OAc)2]. The probe (containing the GGAA tetraloop and 11-nt receptor) contained a fixed amount of 3′-[32P]-pCp-labeled RNA (∼1 nM final). Samples were cooled on ice before addition of blue loading buffer (magnesium buffer, 0.01% bromophenol blue, 0.01% xylene cyanol, 50% glycerol). tectoRNA assembly was monitored by native 10% (29:1) PAGE at a maximum temperature of 10°C for 3 h in [89 mM Tris–borate, pH 8.3, and 15 mM Mg(OAc)2]. Kd values were derived from the titration experiments performed at 10°C (Supplementary Table S2). Monomers [Probe (Ph15) and RA attenuator constructs (MJ5/6)] and heterodimers [Ph15×MJ5/6] were quantified using ImageQuant. Kd values for the equilibrium reaction Ph15 + MJ5/6 → Ph15×MJ5/6 were determined using a non-linear fit of the experimental data to the equation: ƒ = [2βM0 + Kd – (4M0βKd + Kd2)0.5]/2M0, where ƒ is the fraction of the RNA heterodimer, defined as the ratio of the dimer (Ph15×MJ5/6) to the total RNA species (Ph15 + MJ5/6 + P×MJ5/6), M0 is the total concentration of the probe, and β is the maximum fraction of RNA able to dimerize (20). In the case where β is equal to 1, the equation simplifies to Kd = [(M0)(1 – ƒ)2]/ƒ so that M0/2 represents the value at which 50% of the heterodimer is formed. Each reported Kd represents the average of a minimum of three independent experiments. Bacterial strains and plasmids Bacterial strains, plasmids, primers and oligonucleotides used in the study are listed in Supplementary Tables S3–S5. The J5/6 mutations were introduced into pLK45 expressing the Escherichia coli rrnB operon from λ pL (21,22), and into pSpur, a pLK45 derivative with an MS2-hairpin at the tip of helix 6 (23). Bacterial growth assays For plating assays, DH1/pCI857 cells with pLK45 derivatives containing J5/6 mutations were grown in LB containing 25 μg/ml kanamycin and 25 μg/ml carbenicillin at 30°C until mid-log (OD600 = 0.45–0.6). The cultures were diluted to OD = 0.05 and 5 μl of eight serial 10-fold dilutions was spotted onto LB agar containing 25 μg/ml kanamycin and 25 μg/ml carbenicillin or 25 μg/ml carbenicillin and 10 μg/ml spectinomycin. Plates were incubated at 32°C (repressive) or 42°C (permissive) as previously described (22). For growth in liquid media, Δ7rrn/pTRNA67/pHK-rrnC+sacB cells (24,25) transformed with pLK45 or pLK45-Triple were grown at 37°C in LB (100 μg/ml ampicillin and 50 μg/ml kanamycin or 100 μg/ml ampicillin only). After 120 min, 3% sucrose was added to the ampicillin-only cultures to select for loss of pHK-rrnC+sacB. The cell density (OD600) was recorded every 30 min. Analytical sucrose gradients and primer extension Analytical 10–40% sucrose gradients (20 mM Tris–HCl pH 7.8, 10 mM MgCl2, 100 mM NH4Cl, 2 mM DTT) were performed as previously described (26). Gradients were analyzed with a BioComp piston fractionator, and UV absorbance traces at 254 nm were recorded with WINDAQ software (DataQ). Fractions (400 μl) from peaks of interest were precipitated with ethanol overnight, extracted four times with phenol, twice with chloroform, and precipitated with ethanol prior to primer extension analysis. To map the 16S 5′ ends by primer extension, either 2 μg total RNA or 500 ng purified 16S rRNA (1 pmol) was annealed to 1 pmol 32P-labeled primer 161 (Supplementary Table S4) and extended by SuperScript III reverse transcriptase (Invitrogen) at 52.5°C for 30 min (14,27). Samples were analyzed by denaturing 8% PAGE. For total RNA from pSpur-transformed cells, the counts in the major cDNA products corresponding to chromosomally-derived 16S and 17S rRNA and plasmid-derived MS2–16S and MS2–17S rRNA were normalized to the total amount of mature (16S) rRNA, as E. coli regulates rRNA expression levels to control for gene dosage (28,29). SHAPE chemical probing and ensemble FRET of 16S 5′ domain complexes SHAPE experiments were performed in vitro on the 16S 5′ domain RNA (with a 3′ 1199 extension), with or without J5/6 mutations, and with or without proteins uS4, bS16, uS17, and bS20), as described previously (17,30) and in Supplementary Methods. Ensemble FRET experiments were performed in 80 mM K-Hepes pH 7.5, 330 mM KCl, 20 mM MgCl2 at 37°C as previously described (19). In vivo X-ray footprinting MRE600 (RNase I−) cells transformed with pSpur and pSpur J5/6 derivatives were grown in LB at 37°C to mid-log (OD600 0.4–0.6), frozen in 5 μl samples, and exposed for 25–100 ms to a synchrotron X-ray beam in a pre-chilled multi-sample holder on a motorized stage (X28C, National Synchrotron Light Source at Brookhaven National Laboratory) (31). Cell pellets were resuspended in 500 μl RNAprotect bacteria reagent (Qiagen) and total RNA extracted (RNeasy mini prep, Qiagen). The cleavage pattern was assayed by extension of a 32P-labeled SpcR allele specific primer (Supplementary Table S4). Dideoxynucleotide sequencing ladders were generated on un-irradiated RNA templates. Gels were quantified using SAFA (32) and normalized to a strong band with minimal variation between lanes. After normalization, the nucleotide intensities for the three technical replicates were averaged. The error bars indicate the standard deviation of the triplicates. For a few nucleotides (<5%), the band intensity of one replicate was quite different from the other two, and such outliers were discarded. Quantitative mass spectrometry of MS2-tagged ribosomes Affinity purification of pSpur-WT or pSpur-Triple ribosomes was carried out as previously described (23,33) with modifications described in Supplemental Methods. Purified ribosomal protein (∼3 μg) was digested with trypsin (0.30 μg) (34) before LC-MS/MS analysis on an Orbitrap Fusion Lumos™ mass spectrometer equipped with an electrospray ionization (ESI) source (see Supplemental Methods for details). The Orbitrap raw mass spectral data files were analyzed and matched by Thermo Proteome Discoverer (version 2.1) featuring the SEQUEST™ protein search algorithm and annotated E. coli proteome database for protein identification. Accessibility of central pseudoknot Oligonucleotide-directed RNase H cleavage of residues in the central pseudoknot was performed as previously described (35) using DNA oligomers anti-CP and anti-h21 (Supplementary Table S4) that base pair with 16S rRNA regions 906–920 and 589–603, respectively. MS2-tagged wild type and triple mutant complexes were purified by affinity as described in Supplemental Methods. pSpur-WT ribosomes were split at low MgCl2 and the 30S complex re-purified from a sucrose gradient before hybridization with anti-sense oligonucleotide. pSpur-Triple complexes were used without further purification. RNase H reactions were performed in 20 mM Tris–HCl pH7.5, 10 mM MgCl2, 40 mM NH4Cl, 60 mM KCl, 3 mM DTT with 3 or 33 μM oligomer, 50 nM 30S and 5 U RNase H on ice for 16 hrs. The cleavage products were resolved on a 2% agarose gel and stained with ethidium bromide. Negative stain electron microscopy WT and triple mutant 30S ribosome samples were imaged by negative stain transmission electron microscopy (36), as described in Supplemental Methods. Particles resembling 30S complexes were counted manually (>100 for WT and >50 for triple mutant 30S). RESULTS 16S J5/6 junction forms a stable right angle We confirmed that the h5 and h6 region of the Escherichia coli 16S rRNA folds into a right angle, using a minimal tectoRNA folding model described previously (9). TectoRNAs contain specific RNA structural modules that can control self-assembly into predefined, larger structures (37–40). In this assay (9), the RA conformation of the J5/6 test RNA attenuates its association with a second probe RNA that mimics 16S h15 (Figure 2A). The binding equilibrium between the test and probe tectoRNAs allows the relative stability of the RA conformation, ΔΔGAT, to be determined from the degree of attenuation. We used this system to test the stability of the RA motif in 26 natural and synthetic variations of the GA minor submotifs (Figure 2B). The free energies obtained from the binding experiments showed that the J5/6 junction from E. coli 16S rRNA forms a particularly stable RA structure (‘AAAG’ in Figure 2C), compared to the other variants tested. Moreover, those variants that are the most prevalent among bacterial and eukaryotic SSU RNAs typically formed a stable RA structure in the tectoRNA system (blue and green bars; Figure 2C). By contrast, synthetic sequences designed to disrupt the GA minor motifs or alter the inter-helix stacking were 1.2–2.2 kcal/mol less stable than the E. coli 16S J5/6 motif (orange bars; Figure 2C). Figure 2. View largeDownload slide Thermodynamics of minimal RA folding using tectoRNA assembly. (A) Schematic of the experimental strategy: each RNA mimic of 16S J5/6 (MJ5/6) contains an RA sequence variant (purple box) at the junction between a hairpin containing a GAAA tetraloop (blue) and a second hairpin with a GGAA R1 receptor (red). MJ5/6 molecules are evaluated for their ability to bind to a probe that mimics 16S h15 (Ph15) and that also contains a GAAA 11-nt receptor (blue) and a GGAA tetraloop (red). MJ5/6 can only dock with Ph15 by adopting the splayed conformation. (B) List of J5/6 variants tested in the tectoRNA system. The WT J5/6 sequence (AAAG) is used as a reference and sequence variations are indicated in red. Construct variants are named after GA-minor positions 1_6_7_12 (in blue) as well as sequence variations (in red) in the along groove submotif (in pink). Asterisks indicate constructs previously tested (9). (C) Apparent free energy of attenuation of tectoRNA assembly at 10°C: ΔΔGAT = ΔGJ5/6 – ΔGref, where ΔGJ5/6 is the free energy of MJ5/6 and Ph15 dimerization and ΔGref is the same for the MJ5/6 reference RNA, which was chosen to be the triple mutant ACCU. The letters below each column refer to the sequence variants in (B). The column color indicates whether the RA sequence motif is natural (J5/6, blue; other rRNA, green) or synthetic (orange). For J5/6 RA sequences, letters indicate the phyletic origin, as in the key. Figure 2. View largeDownload slide Thermodynamics of minimal RA folding using tectoRNA assembly. (A) Schematic of the experimental strategy: each RNA mimic of 16S J5/6 (MJ5/6) contains an RA sequence variant (purple box) at the junction between a hairpin containing a GAAA tetraloop (blue) and a second hairpin with a GGAA R1 receptor (red). MJ5/6 molecules are evaluated for their ability to bind to a probe that mimics 16S h15 (Ph15) and that also contains a GAAA 11-nt receptor (blue) and a GGAA tetraloop (red). MJ5/6 can only dock with Ph15 by adopting the splayed conformation. (B) List of J5/6 variants tested in the tectoRNA system. The WT J5/6 sequence (AAAG) is used as a reference and sequence variations are indicated in red. Construct variants are named after GA-minor positions 1_6_7_12 (in blue) as well as sequence variations (in red) in the along groove submotif (in pink). Asterisks indicate constructs previously tested (9). (C) Apparent free energy of attenuation of tectoRNA assembly at 10°C: ΔΔGAT = ΔGJ5/6 – ΔGref, where ΔGJ5/6 is the free energy of MJ5/6 and Ph15 dimerization and ΔGref is the same for the MJ5/6 reference RNA, which was chosen to be the triple mutant ACCU. The letters below each column refer to the sequence variants in (B). The column color indicates whether the RA sequence motif is natural (J5/6, blue; other rRNA, green) or synthetic (orange). For J5/6 RA sequences, letters indicate the phyletic origin, as in the key. Mutations in 16S J5/6 junction are recessive lethal in E. coli To study the importance of the J5/6 motif for 30S ribosome biogenesis, we designed mutations in E. coli 16S J5/6 (Figure 1A, middle panel) that were intended to destabilize the right angle between h5 and h6, without destabilizing tertiary interactions between J5/6 and h15 in the mature ribosome (Figure 1B, middle panel). The chosen mutations – G107U (single), A59, 60C (Double), and A59, 60C, G107U (Triple) correspond to positions 12, 6, 7 of the RA motif (Figure 1A), and were found to destabilize the right angle conformation in experiments with minimal tectoRNAs, as predicted (Figure 2C). These single, double or triple J5/6 mutations were introduced into pLK45, which expresses the rrnB operon from the λ pL promoter under the control of a temperature-sensitive λ repressor (cI857) (21,22). pLK45 also contains a 16S mutation that makes 30S ribosomes containing plasmid-encoded rRNA resistant to spectinomycin (spcR). E. coli cells (DH1/pCI857) containing plasmids with J5/6 mutations grew nearly as well as cells containing the parental (WT J5/6) pLK45 at 42°C in the absence of spectinomycin (Figure 3A, top panel). By contrast, cells expressing the J5/6 mutations were unable to grow in the presence of spectinomycin, indicating that ribosomes containing mutant 16S rRNA were not functional (Figure 3A, bottom panel). Figure 3. View largeDownload slide Mutations in J5/6 are recessive lethal and inhibit 30S assembly and maturation. (A) Growth of DH1/ pCI857 cells transformed with pLK45 or derivatives containing J5/6 mutations in Figure 1A (middle). Ten-fold serial dilutions were spotted on LB agar with or without 10 μg/ml spectinomycin at 42°C (pLK45 expressed). (B) Growth of Δ7rrn/pTRNA67/pHK-rrnC+sacB/pLK45 (black) or Δ7rrn/pTRNA67/pHK-rrnC+sacB/pLK45-Triple (blue) at 37°C. Cultures were continued (solid lines), or 3% sucrose was added after 120 min to select for loss of the rrnC helper plasmid (dashed lines). (C) Plasmid-encoded rRNA is marked with an MS2 hairpin (triangle) in 16S h6. The relative locations of J5/6 mutations, processing sites for RNase III (17S rRNA) and RNase G (mature 16S 5′ end), and the priming site for cDNA synthesis in panel E, are indicated. (D) Sucrose gradient profiles from MRE600 transformed with pSpur (WT, left panel) and pSpur-Triple (J5/6 Triple mutant, right panel). See Supplementary Figure S3A for G107U and double mutants. The 30S peak is broad and light when cells express J5/6 mutant 16S rRNA, suggesting many particles are incompletely assembled. The dotted gray line indicates the sedimentation of mature 30S complexes. (E) Primer extension to map the 5′ end of 16S rRNA extracted from peak fractions of the sucrose gradient in (D). The MS2 hairpin in h6 creates longer primer extension products, distinguishing pSpur-encoded MS2-tagged rRNA from chromosomally encoded rRNA. Products corresponding to mature (16S) and immature (17S) rRNA are indicated. See Supplementary Figure S3B for bar graph of each rRNA species (n ≥ 2). The J5/6 mutant pre-rRNA is not processed to MS2–16S and does not enter the 70S fraction. Figure 3. View largeDownload slide Mutations in J5/6 are recessive lethal and inhibit 30S assembly and maturation. (A) Growth of DH1/ pCI857 cells transformed with pLK45 or derivatives containing J5/6 mutations in Figure 1A (middle). Ten-fold serial dilutions were spotted on LB agar with or without 10 μg/ml spectinomycin at 42°C (pLK45 expressed). (B) Growth of Δ7rrn/pTRNA67/pHK-rrnC+sacB/pLK45 (black) or Δ7rrn/pTRNA67/pHK-rrnC+sacB/pLK45-Triple (blue) at 37°C. Cultures were continued (solid lines), or 3% sucrose was added after 120 min to select for loss of the rrnC helper plasmid (dashed lines). (C) Plasmid-encoded rRNA is marked with an MS2 hairpin (triangle) in 16S h6. The relative locations of J5/6 mutations, processing sites for RNase III (17S rRNA) and RNase G (mature 16S 5′ end), and the priming site for cDNA synthesis in panel E, are indicated. (D) Sucrose gradient profiles from MRE600 transformed with pSpur (WT, left panel) and pSpur-Triple (J5/6 Triple mutant, right panel). See Supplementary Figure S3A for G107U and double mutants. The 30S peak is broad and light when cells express J5/6 mutant 16S rRNA, suggesting many particles are incompletely assembled. The dotted gray line indicates the sedimentation of mature 30S complexes. (E) Primer extension to map the 5′ end of 16S rRNA extracted from peak fractions of the sucrose gradient in (D). The MS2 hairpin in h6 creates longer primer extension products, distinguishing pSpur-encoded MS2-tagged rRNA from chromosomally encoded rRNA. Products corresponding to mature (16S) and immature (17S) rRNA are indicated. See Supplementary Figure S3B for bar graph of each rRNA species (n ≥ 2). The J5/6 mutant pre-rRNA is not processed to MS2–16S and does not enter the 70S fraction. To further test whether strains bearing J5/6 mutations are viable, J5/6 WT and J5/6 triple mutant pLK45 plasmids were transformed into an E. coli strain that lacks all seven chromosomal rRNA operons and contains a sucrose-sensitive plasmid expressing the rrnC operon (Δ7rrn/pTRNA67/pHK-rrnC+sacB) (24,25). In the presence of the rrnC+sacB helper plasmid, both strains were able to grow at 37°C (Figure 3B, solid lines). Upon selecting for the loss of the pHK-rrnC+sacB helper plasmid with sucrose, the cells transformed with WT J5/6 pLK45 recovered after a few generations. By contrast, cells transformed with mutant J5/6 pLK45 derivatives did not recover growth (Figure 3B, dotted lines). Thus, both assays showed that 30S ribosomes containing mutations in the J5/6 right angle motif do not support cell growth. J5/6 mutant ribosomes cannot mature To examine if the J5/6 mutants cannot support growth because of a defect in 30S assembly, the J5/6 mutations were introduced into pSpur, a pLK45 derivative with an MS2-hairpin at the tip of helix 6 (23). The 36 nt MS2 tag allowed the expression and maturation of the plasmid-encoded 16S rRNA to be followed by primer extension against a background of chromosomally-encoded 30S subunits (Figure 3C). A polysome profile from MRE600/pSpur (WT J5/6) cells showed a pronounced 70S peak, smaller 30S and 50S peaks, and detectible 2X and 3X polyribosome peaks (Figure 3D, left panel). Primer extension analysis of 5′ end processing of the 16S rRNA revealed 75% immature rRNA in the lightest fraction of the 30S peak and a tiny fraction of immature rRNA in the 70S peak, as expected (Figure 3E, left panel). Compared to the chromosomally-encoded 16S rRNA, a higher proportion of plasmid-derived MS2–16S rRNA was found in the 30S peak fractions than the 70S peak fractions. The MS2-tagged rRNA was processed normally, however, and able to form 70S ribosomes, as observed previously (23). By contrast, MRE600/pSpur-Triple mutant cells contained more free 30S and 50S subunits as well as smaller 2X and 3X polyribosome peaks (Figure 3D, right panel). In addition, the 30S peak was substantially shifted toward lighter (21S – 26S) fractions, indicating a defect in 30S ribosome assembly and a build-up of immature pre-30S particles. This was confirmed by primer extension analysis (Figure 3E, right panel), which showed that the 21–26S and 30S fractions mostly contained immature MS2-tagged 17S rRNA. Virtually no mature MS2-tagged 16S rRNA containing the J5/6 mutations was detectable above the background in any of the gradient fractions, demonstrating that these mutations impair 30S assembly and prevent normal maturation of the 16S 5′ end by RNase G. Heterogeneous primer extension products may reflect inaccurate processing of the triple mutant pre-rRNA (Figure 3E). We obtained similar results for the G107U single mutation and the A59C, A60C double mutation (Supplementary Figure S3), although ∼2% of G107U MS2–16S rRNA was able to form 70S ribosomes. Therefore, even a single base change in the J5/6 motif severely impairs 30S biogenesis in E. coli. J5/6 mutations locally perturb the rRNA structure To determine if the J5/6 mutations prevent the 16S 5′ domain from folding normally, we probed the secondary structure of the 5′ domain rRNA using SHAPE chemical footprinting. SHAPE chemical footprinting is sensitive to the flexibility of the RNA backbone and the conformation of the 2′ OH group (41), which can be influenced by nearby proteins (42). With some exceptions, such as 16S h12, which requires bS16 to adopt the native secondary structure (14,43), and J5/6 itself, the SHAPE data for both the WT and J5/6 mutant rRNAs were consistent with the known secondary structure of the 16S 5′ domain (Supplementary Figure S4A). Although the addition of ribosomal proteins uS4, uS17, bS20 and bS16 that bind the 5′ domain stabilized the RNA overall, the J5/6 mutations increased the reactivity of h13, the J4/5 junction and an internal loop in 16S h17 that forms part of the bS16 interaction site (red dots in h17 in Supplementary Figure S4B). In contrast, 16S h15, the four-way junction between h8-h10 that binds bS20, and interactions between h16 and h18 that bind uS4, were more folded in the triple J5/6 mutant than in the WT RNA (blue dots in Supplementary Figure S4B). These changes in SHAPE modification suggested that the J5/6 mutations prevent native packing between h17 and h15 against h5, thereby trapping the 5′ domain in an unproductive conformation. As discussed below, these differences in SHAPE reactivity are consistent with stabilization of a non-native 5′ domain assembly intermediate by the J5/6 mutations. J5/6 mutations stabilize an assembly intermediate Previous hydroxyl radical footprinting (15) and single molecule FRET experiments (18) showed that assembly of the 16S 5′ domain passes through an intermediate in which h3 is flipped away from the rest of the domain and from protein uS4 (Figure 4A). Protein uS4 binds the 5′ domain RNA when h3 is in either its flipped (F) or native (N) conformations (18). The equilibrium between these conformations of h3 was measured by ensemble fluorescence experiments, in which 5′ domain RNA with Cy3 attached near the end of h3 was titrated with Cy5-labeled S4 protein. The increase in FRET efficiency with uS4 concentration (Figure 4B) was fit to an equation for the four state binding model in Figure 4A, yielding the equilibrium constant K2 between the flipped (low FRET) and native (high FRET) uS4-RNA complexes (19). Figure 4. View largeDownload slide Conformation of 16S helix 3 by FRET. (A) Four-state model for protein uS4 (pink) binding to the 5′ domain of the 16S rRNA. 16S h3 (blue cylinder) can adopt either a native (N) high FRET or flipped (F) low FRET conformation. This equilibrium constant, K2, can be determined from the FRET efficiency of the complexes. (B) Titration of 0.2 nM Cy3-labeled RNA with Cy5-S4 protein. The 16S 5′ domain was extended and hybridized with Cy3-SA5 oligomer. A higher FRET endpoint reflects a larger proportion of native uS4 complexes. Data were fit to a quadratic binding model (see Materials and Methods) to obtain K2. Circles and smooth line, Cy5-S4 only; diamonds and dashed line, Cy5-S4 plus bS16 and bS20. (C) Equilibrium between native (N•S4) and flipped (F•S4) complexes, K2, with and without bS16 and bS20. Single and triple J5/6 mutations raise the proportion of flipped complexes. Figure 4. View largeDownload slide Conformation of 16S helix 3 by FRET. (A) Four-state model for protein uS4 (pink) binding to the 5′ domain of the 16S rRNA. 16S h3 (blue cylinder) can adopt either a native (N) high FRET or flipped (F) low FRET conformation. This equilibrium constant, K2, can be determined from the FRET efficiency of the complexes. (B) Titration of 0.2 nM Cy3-labeled RNA with Cy5-S4 protein. The 16S 5′ domain was extended and hybridized with Cy3-SA5 oligomer. A higher FRET endpoint reflects a larger proportion of native uS4 complexes. Data were fit to a quadratic binding model (see Materials and Methods) to obtain K2. Circles and smooth line, Cy5-S4 only; diamonds and dashed line, Cy5-S4 plus bS16 and bS20. (C) Equilibrium between native (N•S4) and flipped (F•S4) complexes, K2, with and without bS16 and bS20. Single and triple J5/6 mutations raise the proportion of flipped complexes. Although the G107U and triple J5/6 mutations did not change the overall affinity of uS4, these mutations shifted the conformational equilibrium toward the flipped intermediate complex, relative to uS4 complexes with the WT 5′ domain (compare black and blue plateaus in Figure 4B and left bars in Figure 4C). The addition of proteins bS16 and bS20 to the complex stabilized the native conformation of h3 in the S4 binding site, as previously observed (19). Nevertheless, even bS16 and bS20 could not fully overcome the preference of J5/6 mutants for the flipped intermediate conformation of h3, compared to the WT 5′ domain (right bars in Figure 4C). Because the N-terminus of protein bS20 directly interacts with J5/6, and because bS20 also increases the stable addition of bS16 during 30S assembly (16,17), we compared binding of S20 to WT and J5/6 mutant RNAs by native PAGE (Supplementary Figure S5). The KD for the WT 5′ domain RNA was ≤ 15 nM bS20, whereas it was above 30 nM bS20 for the G107U or triple J5/6 mutant RNA (Supplementary Figure S5B). This difference was not substantially rescued by the presence of uS4 (Supplementary Figure S5C). Thus, the J5/6 mutations weaken the bS20–5′domain interactions. This suggests a specific folding defect that prevents restructuring of the J5/6 junction. Altogether, the results of the SHAPE footprinting, uS4 and bS20 binding assays show that the J5/6 mutations favor a 5′ domain assembly intermediate, and disfavor native-like complexes in which 16S h3 is docked against protein S4. The J5/6 mutations could exert this effect either by directly weakening interactions between J5/6 and surrounding elements such as the N-terminal helix of bS20, or by preventing the formation of an early metastable structure that facilitates later refolding of h3. Because h3 is directly connected to h1 at the 5′ end of the 16S rRNA and to the central pseudoknot which links the 16S 5′, central and 3′ domains in the mature 30S ribosome, misdocking of h3 has the potential to inhibit later stages of 30S assembly, thereby explaining why the J5/6 mutations block 5′ processing of the 16S rRNA in E. coli. J5/6 triple mutation causes incomplete assembly of 30S 3′ domain To gain greater insight into how the J5/6 region affects 30S ribosome biogenesis overall, we probed the structure of the 16S 3′ domain (30S head) by X-ray hydroxyl radical footprinting. In vivo hydroxyl radical footprinting is a powerful method for determining the solvent accessibility of the RNA backbone of even heterogeneous and difficult-to-isolate species (44). We exposed MRE600 E. coli cells transformed with the pSpur and pSpur-Triple plasmids to a synchrotron X-ray beam, which generates hydroxyl radical in the cytoplasm. Extension of a primer covering the spectinomycin resistance point mutation 16S 1193U was used to selectively analyze the footprinting pattern of the 3′ domain of plasmid-encoded 16S rRNA. The 16S 3′ domain is one of the last regions of the 30S ribosome to assemble and is bound by tertiary assembly proteins uS3 and uS2 (16,45). In order to identify structural differences in the triple mutant pre-30S ribosomes, hydroxyl radical cleavage of pSpur and pSpur-Triple encoded rRNAs were compared to each other and to unirradiated controls (Figure 5A). Nucleotides with at least a 50% change in the solvent accessibility in the triple mutant versus the pSpur control (Figure 5B) were mapped onto the 16S secondary structure and the tertiary structure of the 16S rRNA in the 30S ribosome (Figure 5C and D). Figure 5. View largeDownload slide Incomplete assembly of 16S 3′ domain revealed by in vivo X-ray footprinting. (A) X-ray hydroxyl radical footprinting data for the 16S nt 1065–1115 (h35-h37) using a primer specific for the plasmid-borne spcR allele. Average adjusted reactivity for each nucleotide, which correlates with relative exposure of the rRNA backbone (see Methods). Error bars, S.D. between three technical replicates. Black, pSpur-WT; blue, pSpur-Triple. Light colors, no X-ray exposure. (B) Histogram of relative backbone exposure for Triple mutant and WT 16S rRNA. Ratios 1.4–1.8 (pink) and >1.8 (red), nucleotides that are more exposed in pSpur-Triple ribosomes; ratios < 0.75 (blue), nucleotides that are more protected in pSpur-Triple ribosomes. (C) Nucleotides with altered backbone exposure mapped onto the 16S 3′ major domain. Colored as in (B). Purple residues exhibit a strong RT pause in pSpur-Triple RNA. Light grey residues were not detected by the allele-specific primer. (D) Three-dimensional structure of the 16S 3′ domain in the ribosome (solvent side) colored as in C with protein uS2 in yellow. Figure 5. View largeDownload slide Incomplete assembly of 16S 3′ domain revealed by in vivo X-ray footprinting. (A) X-ray hydroxyl radical footprinting data for the 16S nt 1065–1115 (h35-h37) using a primer specific for the plasmid-borne spcR allele. Average adjusted reactivity for each nucleotide, which correlates with relative exposure of the rRNA backbone (see Methods). Error bars, S.D. between three technical replicates. Black, pSpur-WT; blue, pSpur-Triple. Light colors, no X-ray exposure. (B) Histogram of relative backbone exposure for Triple mutant and WT 16S rRNA. Ratios 1.4–1.8 (pink) and >1.8 (red), nucleotides that are more exposed in pSpur-Triple ribosomes; ratios < 0.75 (blue), nucleotides that are more protected in pSpur-Triple ribosomes. (C) Nucleotides with altered backbone exposure mapped onto the 16S 3′ major domain. Colored as in (B). Purple residues exhibit a strong RT pause in pSpur-Triple RNA. Light grey residues were not detected by the allele-specific primer. (D) Three-dimensional structure of the 16S 3′ domain in the ribosome (solvent side) colored as in C with protein uS2 in yellow. In the region covered by our primer extension assay, h35-h37 and h2, which form the ‘neck’ of the 30S ribosome, were strongly perturbed by the J5/6 mutations. Helix 36 extends down the solvent side of the 30S ribosome and interacts with the minor groove of nt 16–19 that form the central pseudoknot. Helix 36 also packs against h25, forming the binding site for protein uS2. These nucleotides were more exposed in the J5/6 mutant, suggesting that this region is unfolded and not recognized by protein uS2. Milder perturbations in h33 and h34 lie under the recognition site for protein uS3. Exposure of the h2, h33 and h36 has been observed in other in vivo probing experiments of immature ribosomes (44,46), suggesting a common barrier or ‘checkpoint’ to 30S maturation. Central pseudoknot is not formed in J5/6 Triple mutant ribosomes Formation of the central pseudoknot and binding of protein uS2 are among the final events in the 30S ribosome assembly (16,44,45,47,48). The partial folding of the 3′ domain in our in vivo footprinting results motivated us to examine whether the mutation in J5/6 has an impact on formation of the central pseudoknot. Allele-specific primer extension showed that residue A918 of the central pseudoknot is cleaved in ∼39% of non-irradiated BW25113/Triple mutant ribosomes (Figure 6A, left panel), suggesting that incomplete folding may leave this region accessible to nucleases, as reported previously (35). RimP, a non-enzymatic chaperone of 30S biogenesis, interacts with the 16S rRNA near the central pseudoknot. Deletion of rimP results in decreased stability of the central pseudoknot and depletion of proteins uS5 and uS12 from the 30S particles purified from the ΔrimP strain. Therefore, we tested whether overexpression of RimP could reduce cleavage of A918 in the triple mutant, and found that it did. Induction of RimP reduced the cleavage of the central pseudoknot at A918 from 31% to 19% (Figure 6A, right panel). Figure 6. View largeDownload slide Formation of 16S central pseudoknot. (A) Allele-specific primer extension of spcR plasmid-derived rRNA extracted from BW25113 cells. BW25113 is the parent of Keio collection strains used for assembly factor over-expression. Hydrolysis of pSpur-Triple rRNA results in a significant RT pause at 16S A918, which is reduced by IPTG induction of RimP. Fraction of cDNA paused at A918 (% CP pause) is indicated below the image. The pause site was assigned based on footprinting experiments and sequencing ladders in Figure 5. Lower panel shows unextended primer at the bottom of the sequencing gel. (B) Accessibility of 16S central pseudoknot was probed by hybridization of an Anti-CP oligomer and cleavage by RNase H (see Materials and Methods). Anti-h21 is a control oligomer complementary to a sequence in helix 21. A small amount of 23S rRNA (50S) co-purified with pSpur-Triple rRNA (pre-30S) (see Materials and Methods). Left lane, DNA MW standards. 2% agarose gel was stained with ethidium bromide. Cleavage of CP region indicates that J5/6 mutations impair formation of the central pseudoknot. Figure 6. View largeDownload slide Formation of 16S central pseudoknot. (A) Allele-specific primer extension of spcR plasmid-derived rRNA extracted from BW25113 cells. BW25113 is the parent of Keio collection strains used for assembly factor over-expression. Hydrolysis of pSpur-Triple rRNA results in a significant RT pause at 16S A918, which is reduced by IPTG induction of RimP. Fraction of cDNA paused at A918 (% CP pause) is indicated below the image. The pause site was assigned based on footprinting experiments and sequencing ladders in Figure 5. Lower panel shows unextended primer at the bottom of the sequencing gel. (B) Accessibility of 16S central pseudoknot was probed by hybridization of an Anti-CP oligomer and cleavage by RNase H (see Materials and Methods). Anti-h21 is a control oligomer complementary to a sequence in helix 21. A small amount of 23S rRNA (50S) co-purified with pSpur-Triple rRNA (pre-30S) (see Materials and Methods). Left lane, DNA MW standards. 2% agarose gel was stained with ethidium bromide. Cleavage of CP region indicates that J5/6 mutations impair formation of the central pseudoknot. We further assayed the defect in central pseudoknot formation by testing whether this region is accessible to RNase H cleavage in the presence of an anti-central pseudoknot (anti-CP) oligomer (35). MS2–16S rRNA containing the J5/6 Triple mutation displayed two RNase H cleavage products in Figure 6B (lanes 3 and 4) whereas the WT rRNA was not cleaved (Figure 6B, lanes 9 & 10), suggesting that the central pseudoknot and hence the decoding active site is not formed in the Triple mutant. Perturbation of the central pseudoknot appears to be a hallmark of stalled biogenesis that prevents 30S maturation (35). Mass spectrometry and electron microscopy of J5/6 Triple mutant To more precisely pinpoint the stage of assembly that is blocked by the J5/6 mutations, we used quantitative mass spectrometry to determine which ribosomal proteins were missing in the J5/6 triple mutant pre-30S complexes. MS2 hairpin-tagged ribosomes with a wild type or triple mutant J5/6 sequence were purified by affinity with MS2 coat protein, which recovers all of the 30S proteins (Figure 7A). The total protein content of the complexes was analyzed using LC-MS/MS with excellent coverage of ribosomal proteins (Supplementary Figure S6), and the concentration of each protein was normalized to that of uS4 (Figure 7B). The immature triple mutant pre-30S complex lacked tertiary assembly proteins uS2 (∼70% reduction) and bS21 (∼99% reduction), which are commonly missing in pre-30S particles (35,44). This is consistent with in vivo footprinting results showing that the uS2 binding site is exposed to solvent. Partial depletion of uS3 further indicated that the head was not well formed. Figure 7. View largeDownload slide Mass spectrometry and electron microscopy of isolated J5/6 Triple mutant ribosomes. (A) Profile of proteins from isolated MS2-tagged 30S ribosomes with WT and Triple mutant J5/6 sequence. 4–20% SDS PAGE with MW standards (left lane). (B) Relative abundance of r-proteins in Triple mutant 30S. Absolute concentration of each r-protein was quantified with high sequence coverage using LC MS/MS (∼50% coverage for > 60% r-proteins, Supplementary Figure S6) and normalized to the amount of protein uS4 (see Materials and Methods). Error bars represent the standard deviation of two technical replicates. Peptides mapping to protein uS14 were not detected in this analysis. (C) Negative stain electron micrographs of WT and Triple mutant 30S complexes. The three-domain architecture of the mature 30S (body, head and platform) can be seen in WT complexes, whereas this global structure has collapsed in triple mutant complexes. Figure 7. View largeDownload slide Mass spectrometry and electron microscopy of isolated J5/6 Triple mutant ribosomes. (A) Profile of proteins from isolated MS2-tagged 30S ribosomes with WT and Triple mutant J5/6 sequence. 4–20% SDS PAGE with MW standards (left lane). (B) Relative abundance of r-proteins in Triple mutant 30S. Absolute concentration of each r-protein was quantified with high sequence coverage using LC MS/MS (∼50% coverage for > 60% r-proteins, Supplementary Figure S6) and normalized to the amount of protein uS4 (see Materials and Methods). Error bars represent the standard deviation of two technical replicates. Peptides mapping to protein uS14 were not detected in this analysis. (C) Negative stain electron micrographs of WT and Triple mutant 30S complexes. The three-domain architecture of the mature 30S (body, head and platform) can be seen in WT complexes, whereas this global structure has collapsed in triple mutant complexes. Less expected was that the mutant pre-30S complexes entirely lacked uS12 and had a ∼50% abundance of uS5. The other 5′ domain proteins, uS4, bS16, uS17 and bS20, were present at normal levels. Thus, proteins bS16 and bS20 that bind near the J5/6 mutation still joined the complex, but proteins uS12 and uS5 that bind at the interface between the 5′, central and 3′ domains were completely or partially prevented from binding. The absence of these proteins is consistent with the exposure of the central pseudoknot, which connects the three major domains of the 30S ribosome. Protein uS12 interacts with 16S h3 near the central pseudoknot, and failure to properly dock h3 could hinder binding of uS12. Protein uS5 directly interacts with the central pseudoknot as well as with protein uS4 and uS12 (49,50). Thus, we reasoned that defective recruitment of uS12 to the interface with the 5′ domain resulted in long-range perturbations in the 3′ domain ‘neck’ and head, preventing binding of protein uS2. Since the in vivo footprinting and mass spectrometry results indicated specific defects in the interactions between the major domains of the 16S rRNA, we were motivated to visualize the overall structure of J5/6 mutant ribosomes. Negative stain electron micrograph images in Figure 7C suggest that structure of triple mutant 30S ribosome is severely distorted and heterogeneous, compared to wild type MS2-tagged 30S ribosomes. For the WT 30S ribosomes, we observed the well-formed body, platform and head, as expected. Whereas, in the J5/6 triple mutant complexes, none of the normal connections between domains could be identified. This global collapse of the overall structure is reminiscent of pre-30S particles from a strain lacking RimP (35), an assembly factor that aids formation of the central pseudoknot and recruitment of uS5 (35,44,51), further suggesting a specific defect in inter-domain interactions. DISCUSSION Non-coding RNAs contain recurring sequence motifs that usually adopt similar three-dimensional structures in different contexts (7). Here, we describe a right angle (RA) motif at J5/6 that is conserved among SSU rRNAs (Supplementary Figure S1), yet is splayed apart in the mature ribosome (Figure 1, Supplementary Figure S2), suggesting it could form an early metastable structure that guides 30S ribosome assembly. The importance of this motif is reinforced by the observation that the most prevalent natural J5/6 sequences also form stable RA motifs (Figure 2). Although the J5/6 lies far from the center of the ribosome, mutations that disrupt the RA fold between h5 and h6 completely block pre-16S rRNA processing and are recessive lethal in E. coli. Structural probes and electron microscopy show that these mutations impart massive structural deformities by blocking formation of the central pseudoknot and interactions between the 5′, central, and 3′ domains of the 16S rRNA. The deformities correlate with a failure to recruit protein uS12, which contacts the central pseudoknot and all of the major 16S domains. Our results show that the J5/6 mutations do not prevent binding of proteins bS16 and bS20 that directly contact J5/6. Instead, the impact of the J5/6 mutations is felt at a later stage of 30S assembly around the central pseudoknot and in the 3′ domain where protein uS2 must bind. These observations raise the question of how mutations in the ‘foot’ of 30S ribosome are communicated to its ‘head’. That J5/6 mutations act at a distance and at a later time of assembly suggests an allosteric mechanism in which a conformational switch at J5/6 favors an RNA conformation that is competent for the addition of tertiary assembly proteins. We propose that J5/6 and other structural motifs within the 16S rRNA are linked through a ‘spine’ of RNA helices that runs through the center of the 30S ribosome (Figure 8A). The connectivity of this spine involves conserved elements of the 16S rRNA and can be traced through the structures of the ribosome, from the h6 spur through J5/6 to h15, h4 and h3 within the 5′ body of the ribosome. Helix 3 is in turn connected to h28 in the 3′ domain via h1 and the central pseudoknot (h2), and to the central domain through h19 and h27. We propose that this central spine of RNA connects conformational switches in distal regions of the 16S rRNA that signal the correct assembly of the 5′ and 3′ domains. By linking these events with formation of the mRNA decoding site and processing of the 17S pre-rRNA, this allosteric model may ensure the quality of ribosome biogenesis. It also explains why a mutation in the normally stable body of the 16S rRNA has a catastrophic effect on overall assembly. Figure 8. View largeDownload slide Allosteric communication of assembly status through an RNA spine. (A) A model depicting path of conformational switch originating from mutations in J5/6. Bacterial 30S showing RNA spine (left panel), helices and interactions that form the RNA spine (middle panel), mutation in J5/6 destabilizes the RNA spine that results in loss of interaction between 5′, central, and 3′ domain (right panel). (B) J5/6 motif is conserved across kingdoms. Bacterial 30S (2I2P), yeast 40S (5TGM; (57)) and Tetrahymena 40S (4V5O; (58)). Figure 8. View largeDownload slide Allosteric communication of assembly status through an RNA spine. (A) A model depicting path of conformational switch originating from mutations in J5/6. Bacterial 30S showing RNA spine (left panel), helices and interactions that form the RNA spine (middle panel), mutation in J5/6 destabilizes the RNA spine that results in loss of interaction between 5′, central, and 3′ domain (right panel). (B) J5/6 motif is conserved across kingdoms. Bacterial 30S (2I2P), yeast 40S (5TGM; (57)) and Tetrahymena 40S (4V5O; (58)). The results of footprinting and FRET experiments indicate how the conformation of J5/6 is transmitted to other regions of the 16S rRNA. In vitro SHAPE experiments showed that the J5/6 mutations prevent bS16 from natively packing h17 and h15 against h5, trapping the 5′ domain in an unproductive conformation. Loose packing of h15 communicates its negative effect to h3 via h4, which we confirmed by detecting 16S h3 in its non-native flipped conformation in the J5/6 mutant RNA using FRET (Figure 4C). Three-color smFRET experiments showed that h3 normally fluctuates between its native docked and non-native flipped conformations, but after S16 binding, the native h3 conformation persists for longer periods (17). The results here indicate that the J5/6 RA motif is needed for the normal effect of bS16 on h3 docking. Proper folding of 16S h3 against protein uS4 has been recently established as a ‘check point’ that guides 30S assembly (15,17,18). Protein uS12 contacts the opposite face of h3 from uS4, and poor h3 docking likely hinders uS12 recruitment. Moreover, the non-native conformation of h3 negatively influences refolding of h1, which can no longer participate in the central pseudoknot that connects the 5′, central and 3′domains. Our footprinting, mass spectrometry and electron microscopy results show that the J5/6 mutations in the 16S 5′ domain do indeed impair assembly of the 3′ major domain and prevent formation of the central pseudoknot. The absence of the central pseudoknot prevents normal interactions between the major rRNA domains and with h44, thereby preventing formation of the decoding site. Exposure of the central pseudoknot and the absence of uS2 and uS3 are hallmarks of assembly that has stalled at the pre-30S stage (35,44). Although similar types of pre-30S complexes accumulate in the absence of assembly factors such as ΔrimM and ΔyjeQ (52) or ΔrbfA (44,53), many of these pre-30S particles convert into mature 30S ribosomes. By contrast, pre-30S complexes with J5/6 mutations never mature (Figure 3). This observation suggests that failure to correctly restructure certain 16S rRNA motifs, such as J5/6, raises the energy barrier for processing of the 17S pre-rRNA to a point where it is completely blocked (Figure 3, Supplementary Figure S3). Alternatively, continued assembly may cement an early rRNA folding error, committing the particle to a dead-end that cannot be easily reversed. Parallel assembly pathways (54) could bypass 16S misfolding in some cases (52). For example, ∼2% 16S rRNA containing the single J5/6 mutation G107U is processed and forms a 70S complex (Supplementary Figure S3). Individual complexes may stumble at different stages, explaining why we see heterogeneously deformed pre-30S particles in the negative stain electron micrographs of J5/6 mutant (Figure 7C). The sequence between J5/6 encodes a conformational switch that is communicated to the three domains of 30S. Locating such a crucial switch in the 5′ domain is advantageous because early transcription and folding can guide proper assembly of rest of the ribosome. That the most prevalent J5/6 sequences form stable RA motifs within h5 and h6 suggests that the RA may act as a ‘timer’ to delay interactions with h15 until 5′ domain proteins are in place, although this motif may play some other role in stabilizing the folded SSU rRNA. A comparison of J5/6 in SSU from bacteria, yeast and Tetrahymena shows that its architecture and structural environment are conserved (Figure 8B). In bacterial ribosomes, J5/6 is sandwiched between bS16 and bS20 and packed against the tip of h15 that supplies the along-groove interactions with h5 that would normally be made by h6 in an RA fold. The ‘splayed’ conformation of J5/6 is stabilized by the N-terminal alpha helix of bS20. Interestingly, proteins eS4 and eS24 appear to fulfill similar roles in the yeast and Tetrahymena 40S ribosomes, in which an alpha helix from eS24 packs into open groove of J5/6. Thus, the structure of J5/6 is evolutionarily conserved and may serve a similar switch function during the biogenesis and assembly of eukaryotic 40S ribosomes (55). SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS The authors thank Cathy Squires, U. Maivali, Rachel Green, Gloria Culver and Jie Xiao for gifts of plasmids and bacterial strains, and Michael Delannoy for technical support with electron microscopy at Johns Hopkins University. We also thank Prof. Joan-Emma Shea (UCSB) for discussions at an early stage of the project. FUNDING National Institute of Health [R01GM60819 to S.W., R01GM058843, S10OD018485 to P.A.L., R01GM079604 to L.J.]; UCSB Academic Senate (to L.J.). The open access publication charge for this paper has been waived by Oxford University Press – NAR Editorial Board members are entitled to one free paper per year in recognition of their work on behalf of the journal. Conflict of interest statement. None declared. REFERENCES 1. Noller H.F. Ribosomal RNA and translation . Annu. Rev. Biochem. 1991 ; 60 : 191 – 227 . Google Scholar CrossRef Search ADS PubMed 2. Moore P.B. , Steitz T.A. The ribosome revealed . Trends Biochem. Sci. 2005 ; 30 : 281 – 283 . Google Scholar CrossRef Search ADS PubMed 3. Melnikov S. , Ben-Shem A. , Garreau de Loubresse N. , Jenner L. , Yusupova G. , Yusupov M. One core, two shells: bacterial and eukaryotic ribosomes . Nat. Struct. Mol. Biol. 2012 ; 19 : 560 – 567 . Google Scholar CrossRef Search ADS PubMed 4. Jaeger L. , Verzemnieks E.J. , Geary C. The UA_handle: a versatile submotif in stable RNA architectures . Nucleic Acids Res. 2009 ; 37 : 215 – 230 . Google Scholar CrossRef Search ADS PubMed 5. Noller H.F. RNA structure: reading the ribosome . Science . 2005 ; 309 : 1508 – 1514 . Google Scholar CrossRef Search ADS PubMed 6. Stern S. , Weiser B. , Noller H.F. Model for the three-dimensional folding of 16 S ribosomal RNA . J. Mol. Biol. 1988 ; 204 : 447 – 481 . Google Scholar CrossRef Search ADS PubMed 7. Lescoute A. , Westhof E. Topology of three-way junctions in folded RNAs . RNA . 2006 ; 12 : 83 – 93 . Google Scholar CrossRef Search ADS PubMed 8. Chworos A. , Severcan I. , Koyfman A.Y. , Weinkam P. , Oroudjev E. , Hansma H.G. , Jaeger L. Building programmable jigsaw puzzles with RNA . Science . 2004 ; 306 : 2068 – 2072 . Google Scholar CrossRef Search ADS PubMed 9. Grabow W.W. , Zhuang Z. , Swank Z.N. , Shea J.-E. , Jaeger L. The right angle (RA) motif: a prevalent ribosomal RNA structural pattern found in Group I introns . J. Mol. Biol. 2012 ; 424 : 54 – 67 . Google Scholar CrossRef Search ADS PubMed 10. Grabow W.W. , Zhuang Z. , Shea J.E. , Jaeger L. The GA-minor submotif as a case study of RNA modularity, prediction, and design . Wiley Interdiscip. Rev. RNA . 2013 ; 4 : 181 – 203 . Google Scholar CrossRef Search ADS PubMed 11. Gagnon M.G. , Steinberg S.V. GU receptors of double helices mediate tRNA movement in the ribosome . RNA . 2002 ; 8 : 873 – 877 . Google Scholar CrossRef Search ADS PubMed 12. Adilakshmi T. , Ramaswamy P. , Woodson S.A. Protein-independent folding pathway of the 16S rRNA 5′ domain . J. Mol. Biol. 2005 ; 351 : 508 – 519 . Google Scholar CrossRef Search ADS PubMed 13. Dutca L.M. , Culver G.M. Assembly of the 5′ and 3′ minor domains of 16S ribosomal RNA as monitored by tethered probing from ribosomal protein S20 . J. Mol. Biol. 2008 ; 376 : 92 – 108 . Google Scholar CrossRef Search ADS PubMed 14. Stern S. , Changchien L.M. , Craven G.R. , Noller H.F. Interaction of proteins S16, S17 and S20 with 16 S ribosomal RNA . J. Mol. Biol. 1988 ; 200 : 291 – 299 . Google Scholar CrossRef Search ADS PubMed 15. Ramaswamy P. , Woodson S.A. S16 throws a conformational switch during assembly of 30S 5′ domain . Nat. Struct. Mol. Biol. 2009 ; 16 : 438 – 445 . Google Scholar CrossRef Search ADS PubMed 16. Held W.A. , Ballou B. , Mizushima S. , Nomura M. Assembly mapping of 30 S ribosomal proteins from Escherichia coli. Further studies . J. Biol. Chem. 1974 ; 249 : 3103 – 3111 . Google Scholar PubMed 17. Abeysirigunawardena S.C. , Kim H. , Lai J. , Ragunathan K. , Rappé M.C. , Luthey-Schulten Z. , Ha T. , Woodson S.A. Evolution of protein-coupled RNA dynamics during hierarchical assembly of ribosomal complexes . Nat. Commun. 2017 ; 8 : 492 . Google Scholar CrossRef Search ADS PubMed 18. Kim H. , Abeysirigunawarden S.C. , Chen K. , Mayerle M. , Ragunathan K. , Luthey-Schulten Z. , Ha T. , Woodson S.A. Protein-guided RNA dynamics during early ribosome assembly . Nature . 2014 ; 506 : 334 – 338 . Google Scholar CrossRef Search ADS PubMed 19. Abeysirigunawardena S.C. , Woodson S.A. Differential effects of ribosomal proteins and Mg 2+ ions on a conformational switch during 30S ribosome 5′-domain assembly . RNA . 2015 ; 21 : 1859 – 1865 . Google Scholar CrossRef Search ADS PubMed 20. Paillart J. , Skripkin E. , Ehresmann B. , Ehresmann C. , Marquet R. A loop-loop “kissing” complex is the essential part of the dimer linkage of genomic HIV-1 RNA . Proc. Natl. Acad. Sci. U.S.A. 1996 ; 93 : 5572 – 5577 . Google Scholar CrossRef Search ADS PubMed 21. Zhang F. , Ramsay E.S. , Woodson S.A. In vivo facilitation of Tetrahymena group I intron splicing in Escherichia coli pre-ribosomal RNA . RNA . 1995 ; 1 : 284 – 292 . Google Scholar PubMed 22. Powers T. , Noller H.F. Dominant lethal mutations in a conserved loop in 16S rRNA . Proc. Natl. Acad. Sci. U.S.A. 1990 ; 87 : 1042 – 1046 . Google Scholar CrossRef Search ADS PubMed 23. Youngman E.M. , Green R. Affinity purification of in vivo-assembled ribosomes for in vitro biochemical analysis . Methods . 2005 ; 36 : 305 – 312 . Google Scholar CrossRef Search ADS PubMed 24. Asai T. , Zaporojets D. , Squires C. , Squires C.L. An Escherichia coli strain with all chromosomal rRNA operons inactivated: complete exchange of rRNA genes between bacteria . Proc. Natl. Acad. Sci. U.S.A. 1999 ; 96 : 1971 – 1976 . Google Scholar CrossRef Search ADS PubMed 25. Zaporojets D. , French S. , Squires C.L. Products transcribed from rearranged rrn genes of Escherichia coli can assemble to form functional ribosomes . J. Bacteriol. 2003 ; 185 : 6921 – 6927 . Google Scholar CrossRef Search ADS PubMed 26. Spedding G. Ribosomes and Protein Synthesis: A Practical Approach . 1990 ; IRL Press at Oxford University Press . 27. Moazed D. , Van Stolk B.J. , Douthwaite S. , Noller H.F. Interconversion of active and inactive 30 S ribosomal subunits is accompanied by a conformational change in the decoding region of 16 S rRNA . J. Mol. Biol. 1986 ; 191 : 483 – 493 . Google Scholar CrossRef Search ADS PubMed 28. Jinks-Robertson S. , Gourse R.L. , Nomura M. Expression of rRNA and tRNA genes in Escherichia coli: evidence for feedback regulation by products of rRNA operons . Cell . 1983 ; 33 : 865 – 876 . Google Scholar CrossRef Search ADS PubMed 29. Condon C. , French S. , Squires C. , Squires C.L. Depletion of functional ribosomal RNA operons in Escherichia coli causes increased expression of the remaining intact copies . EMBO J. 1993 ; 12 : 4305 – 4315 . Google Scholar PubMed 30. Mayerle M. , Bellur D.L. , Woodson S.A. Slow formation of stable complexes during coincubation of minimal rRNA and ribosomal protein S4 . J. Mol. Biol. 2011 ; 412 : 453 – 465 . Google Scholar CrossRef Search ADS PubMed 31. Adilakshmi T. , Soper S.F.C. , Woodson S.A. Structural analysis of RNA in living cells by in vivo synchrotron X-ray footprinting . Methods Enzymol. 2009 ; 468 : 239 – 258 . Google Scholar CrossRef Search ADS PubMed 32. Das R. , Laederach A. , Pearlman S.M. , Herschlag D. , Altman R.B. SAFA: semi-automated footprinting analysis software for high-throughput quantification of nucleic acid footprinting experiments . RNA . 2005 ; 11 : 344 – 354 . Google Scholar CrossRef Search ADS PubMed 33. Gupta N. , Culver G.M. Multiple in vivo pathways for Escherichia coli small ribosomal subunit assembly occur on one pre-rRNA . Nat. Struct. Mol. Biol. 2014 ; 21 : 937 – 943 . Google Scholar CrossRef Search ADS PubMed 34. Dator R.P. , Gaston K.W. , Limbach P.A. Multiple enzymatic digestions and ion mobility separation improve quantification of bacterial ribosomal proteins by data independent acquisition liquid chromatography−mass spectrometry . Anal. Chem. 2014 ; 86 : 4264 – 4270 . Google Scholar CrossRef Search ADS PubMed 35. Sashital D.G. , Greeman C.A. , Lyumkis D. , Potter C.S. , Carragher B. , Williamson J.R. A combined quantitative mass spectrometry and electron microscopy analysis of ribosomal 30S subunit assembly in E. coli . Elife . 2014 ; 3 : e04491 . Google Scholar CrossRef Search ADS 36. Rames M. , Yu Y. , Ren G. Optimized negative staining: a high-throughput protocol for examining small and asymmetric protein structure by electron microscopy . J. Vis. Exp. 2014 ; e51087 . 37. Jaeger L. , Leontis N.B. Tecto-RNA: One-dimensional self-assembly through tertiary interactions . Angew. Chem. Int. Ed. 2000 ; 39 : 2521 – 2524 . Google Scholar CrossRef Search ADS 38. Ishikawa J. , Furuta H. , Ikawa Y. RNA Tectonics (tectoRNA) for RNA nanostructure design and its application in synthetic biology . Wiley Interdiscip. Rev. RNA . 2013 ; 4 : 651 – 664 . Google Scholar CrossRef Search ADS PubMed 39. Grabow W.W. , Jaeger L. RNA self-assembly and RNA nanotechnology . Acc. Chem. Res. 2014 ; 47 : 1871 – 1880 . Google Scholar CrossRef Search ADS PubMed 40. Geary C. , Chworos A. , Verzemnieks E. , Voss N.R. , Jaeger L. Composing RNA Nanostructures from a Syntax of RNA Structural Modules . Nano Lett. 2017 ; 17 : 7095 – 7101 . Google Scholar CrossRef Search ADS PubMed 41. Steen K.-A. , Rice G.M. , Weeks K.M. Fingerprinting noncanonical and tertiary RNA structures by differential SHAPE reactivity . J. Am. Chem. Soc. 2012 ; 134 : 13160 – 13163 . Google Scholar CrossRef Search ADS PubMed 42. Peng Y. , Curtis J.E. , Fang X. , Woodson S.A. Structural model of an mRNA in complex with the bacterial chaperone Hfq . Proc. Natl. Acad. Sci. U.S.A. 2014 ; 111 : 17134 – 17139 . Google Scholar CrossRef Search ADS PubMed 43. Moazed D. , Stern S. , Noller H.F. Rapid chemical probing of conformation in 16 S ribosomal RNA and 30 S ribosomal subunits using primer extension . J. Mol. Biol. 1986 ; 187 : 399 – 416 . Google Scholar CrossRef Search ADS PubMed 44. Clatterbuck Soper S.F. , Dator R.P. , Limbach P.A. , Woodson S.A. In vivo X-ray footprinting of pre-30S ribosomes reveals chaperone-dependent remodeling of late assembly intermediates . Mol. Cell . 2013 ; 52 : 506 – 516 . Google Scholar CrossRef Search ADS PubMed 45. Talkington M.W.T. , Siuzdak G. , Williamson J.R. An assembly landscape for the 30S ribosomal subunit . Nature . 2005 ; 438 : 628 – 632 . Google Scholar CrossRef Search ADS PubMed 46. McGinnis J.L. , Liu Q. , Lavender C.A. , Devaraj A. , McClory S.P. , Fredrick K. , Weeks K.M. In-cell SHAPE reveals that free 30S ribosome subunits are in the inactive state . Proc. Natl. Acad. Sci. U.S.A. 2015 ; 112 : 2425 – 2430 . Google Scholar CrossRef Search ADS PubMed 47. Holmes K.L. , Culver G.M. Mapping structural differences between 30S ribosomal subunit assembly intermediates . Nat. Struct. Mol. Biol. 2004 ; 11 : 179 – 186 . Google Scholar CrossRef Search ADS PubMed 48. Powers T. , Daubresse G. , Noller H.F. Dynamics of in vitro assembly of 16 S rRNA into 30 S ribosomal subunits . J. Mol. Biol. 1993 ; 232 : 362 – 374 . Google Scholar CrossRef Search ADS PubMed 49. Xu Z. , Culver G.M. Differential assembly of 16S rRNA domains during 30S subunit formation . RNA . 2010 ; 16 : 1990 – 2001 . Google Scholar CrossRef Search ADS PubMed 50. Nord S. , Bhatt M.J. , Tükenmez H. , Farabaugh P.J. , Wikström P.M. Mutations of ribosomal protein S5 suppress a defect in late-30S ribosomal subunit biogenesis caused by lack of the RbfA biogenesis factor . RNA . 2015 ; 21 : 1454 – 1468 . Google Scholar CrossRef Search ADS PubMed 51. Bunner A.E. , Nord S. , Wikström P.M. , Williamson J.R. The effect of ribosome assembly cofactors on in vitro 30S subunit reconstitution . J. Mol. Biol. 2010 ; 398 : 1 – 7 . Google Scholar CrossRef Search ADS PubMed 52. Thurlow B. , Davis J.H. , Leong V. , F. Moraes T. , Williamson J.R. , Ortega J. Binding properties of YjeQ (RsgA), RbfA, RimM and Era to assembly intermediates of the 30S subunit . Nucleic Acids Res. 2016 ; 44 : 9918 – 9932 . Google Scholar PubMed 53. Inoue K. , Alsina J. , Chen J. , Inouye M. Suppression of defective ribosome assembly in a rbfA deletion mutant by overexpression of Era, an essential GTPase in Escherichia coli . Mol. Microbiol. 2003 ; 48 : 1005 – 1016 . Google Scholar CrossRef Search ADS PubMed 54. Mulder A.M. , Yoshioka C. , Beck A.H. , Bunner A.E. , Milligan R.A. , Potter C.S. , Carragher B. , Williamson J.R. Visualizing ribosome biogenesis: parallel assembly pathways for the 30S subunit . Science . 2010 ; 330 : 673 – 677 . Google Scholar CrossRef Search ADS PubMed 55. Kressler D. , Hurt E. , Baßler J. A puzzle of life: crafting ribosomal subunits . Trends Biochem. Sci. 2017 ; 42 : 640 – 654 . Google Scholar CrossRef Search ADS PubMed 56. Berk V. , Zhang W. , Pai R.D. , Cate J.H.D. Structural basis for mRNA and tRNA positioning on the ribosome . Proc. Natl. Acad. Sci. U.S.A. 2006 ; 103 : 15830 – 15834 . Google Scholar CrossRef Search ADS PubMed 57. Melnikov S. , Mailliot J. , Rigger L. , Neuner S. , Shin B. , Yusupova G. , Dever T.E. , Micura R. , Yusupov M. Molecular insights into protein synthesis with proline residues . EMBO Rep. 2016 ; 17 : 1776 – 1784 . Google Scholar CrossRef Search ADS PubMed 58. Rabl J. , Leibundgut M. , Ataide S.F. , Haag A. , Ban N. Crystal structure of the eukaryotic 40S ribosomal subunit in complex with initiation factor 1 . Science . 2011 ; 331 : 730 – 736 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

journal article

Open Access Collection

Engineering altered protein–DNA recognition specificity

Bogdanove, Adam J;Bohm, Andrew;Miller, Jeffrey C;Morgan, Richard D;Stoddard, Barry L

2018 Nucleic Acids Research

doi: 10.1093/nar/gky289pmid: 29718463

Abstract Protein engineering is used to generate novel protein folds and assemblages, to impart new properties and functions onto existing proteins, and to enhance our understanding of principles that govern protein structure. While such approaches can be employed to reprogram protein–protein interactions, modifying protein–DNA interactions is more difficult. This may be related to the structural features of protein–DNA interfaces, which display more charged groups, directional hydrogen bonds, ordered solvent molecules and counterions than comparable protein interfaces. Nevertheless, progress has been made in the redesign of protein–DNA specificity, much of it driven by the development of engineered enzymes for genome modification. Here, we summarize the creation of novel DNA specificities for zinc finger proteins, meganucleases, TAL effectors, recombinases and restriction endonucleases. The ease of re-engineering each system is related both to the modularity of the protein and the extent to which the proteins have evolved to be capable of readily modifying their recognition specificities in response to natural selection. The development of engineered DNA binding proteins that display an ideal combination of activity, specificity, deliverability, and outcomes is not a fully solved problem, however each of the current platforms offers unique advantages, offset by behaviors and properties requiring further study and development. INTRODUCTION Understanding the molecular mechanisms that dictate the affinity and specificity of protein–DNA recognition is an area of investigation that remains critical for many fields of research, including protein engineering. This is particularly important for the purpose of creating novel target specificities for enzymes that act upon DNA targets, including those that are used for targeted genome modification (such as recombinases and endonucleases). The earliest illustrations of protein–DNA recognition mechanisms, provided in part via crystallographic structures of protein–DNA complexes (1–4), emphasized the formation of directional hydrogen bonds between protein side chains and complementary acceptor and donor atoms presented by nucleotide base pairs in the major groove of a DNA duplex. These observations, along with the obvious steric complementarity between protein α-helices and the DNA major groove, led to an appreciation of the importance of such contacts for sequence-specific DNA recognition (5). A broad collection of research literature surrounding protein–DNA binding and recognition refers to the exploitation of these features as ‘direct’ readout of DNA target sequences. Investigators have also recognized the importance of entropic changes (largely driven by the ordering of protein backbone and side chains during binding, as well as the release of ordered water molecules from the protein–DNA interface) and DNA bending as additional factors that further dictate DNA binding affinity and specificity. In particular, the ability of DNA sequences to adopt or prefer unique structural shapes and features (ranging from subtle alteration of duplex dimensions, to more significant short- and/or long-range deformations) provides additional strategies for sequence-specific readout by DNA binding proteins. These collective features have often been referred to in the literature as ‘indirect readout’ and/or as ‘shape-based’ recognition (see (6,7) for more comprehensive reviews). Continuing studies have further enhanced our understanding of the complex balance of contacts and forces that lead to protein–DNA recognition. Examination of highly diverse DNA binding protein systems have demonstrated how recognition of the shape and structural features of a potential DNA target can augment the specificity imparted by contacts to the chemically distinct sequence of individual nucleotide base pairs. This includes recognition of altered minor groove dimensions (and corresponding changes in the surrounding surface electrostatic potential) in response to DNA bending (8,9); recognition of altered DNA conformations as a result of base modifications (such as cytosine methylation) and other epigenetic modifications (10); recognition of the structural effects of non-canonical base pairs in the target (11), and the contribution of flanking DNA sequences on target shape and conformation (12). A relatively recent review article (13) focused on how transcription factors limit their interactions with potential targets in various cell types and tissues, and described how DNA recognition involves the presence and exploitation of many layers of unique structural features beyond DNA sequence (including shape, flexibility, accessibility and cooperativity between multiple DNA binding proteins). Overall, simple codes or correlation between protein and DNA sequences that might be predictive of protein–DNA recognition are largely absent (14), except for rare examples of extremely modular DNA-binding proteins (such as TAL effectors) (15,16). Whereas considerable progress has been made in engineering novel protein folds (17) and protein–protein recognition (18), engineering of protein-nucleic acid recognition remains difficult, and engineering and redesign of the recognition specificity of a DNA-binding protein is currently a challenging area of research and development (19). This disparity is attributable in part to the differing composition of these two types of molecular interfaces, with protein–DNA interactions involving large numbers of directional hydrogen bonds, electrostatic contacts, ordered solvent molecules and bound counterions. As well, the changes in DNA backbone conformation and its base pair geometries that are induced by protein binding are challenging to computationally sample and predict. Many projects that involve the retargeting of protein–DNA specificity thereby require the development and use of selection and screening strategies, usually in concert with structure-based analyses and guidance. However, considerable progress has nonetheless been reported over the past several years on the combined use of structural modeling, structure-based computational engineering, and structurally informed selections and screens to alter the DNA recognition properties of a wide variety of DNA binding proteins and enzymes (20–23). Many of these advances have been driven by activities related to targeted genome engineering and targeted gene modification, which require the creation and use of sequence-specific endonucleases, recombinases and integrases. Here, we summarize recent approaches and successes in the creation of lab-generated DNA binding proteins with altered target specificity. The highlights of engineering studies results for each protein system also summarized in Table 1. These systems range from highly modular TAL effectors, to somewhat modular, but more challenging zinc finger proteins, to several types of distinctly non-modular DNA binding proteins and enzymes (recombinases, homing endonucleases, and restriction endonucleases). Each type of protein displays unique features of ‘evolvability’ (i.e. molecular structures and interactions that allow natural selection and the passage of time to efficiently modify DNA recognition) that clearly influence their corresponding ‘engineerability’ (leading to similar alterations of recognition specificity, executed in a laboratory). Summary of many of the significant attempts to engineer the DNA recognition properties of the protein systems discussed in this review. While not intended to be entirely comprehensive and complete, this table is intended to assist reviewers in following the details of the main text Table 1. Summary of many of the significant attempts to engineer the DNA recognition properties of the protein systems discussed in this review. While not intended to be entirely comprehensive and complete, this table is intended to assist reviewers in following the details of the main text Platform and year(s) Targets and development Engineering approach References Zinc Fingers 1992–1993 Novel DNA triplets Structure-based modeling (49–51) 1994–1995 Novel DNA triplets Phage Display (52–55) 1999 Novel sites with GNN triplets (56) 2000 Novel 9 basepair targets Bacterial two-hybrid selections (61) 2001 Novel triplets with ANN and CNN triplets Phage Display (57,58) 2001 Novel 9 basepair targets Phage Display; hybrid 3 finger library panning (60) 2001 Novel 12 to 18 basepair targets Assembly of two-finger ZFP subunits (64) 2002 - 2003 Drosophia yellow gene target Modular assembly (31,32) 2005 Human IL2Rg gene target Zinc finger selections and assembly (22) 2008 Novel 9 basepair targets Bacterial two-hybrid selections (62) 2011 Novel 9 basepair targets Informatics-driven, Context-dependent ZFP assembly (63) Meganucleases 2002 Single basepair target variants (I-CreI) Bacterial gene elimination assay/screen (97) 2002 Activity-based selection (I-SceI) Bacterial gene elimination assay/screen (98) 2002 Hybrid nuclease generation (I-CreI/I-DmoI –> H-DreI Structure-based computational redesign (110) 2003 Single basepair target variants (PI-SceI) Bacterial two-hybrid selections (96) 2006 Single basepair target variant (I-MsoI Structure-based computational redesign (99) 2006 Multiple base pair target variants (I-CreI) Bacterial ene elmination assay/screen (100) 2006 Multiple basepair target variants (I-CreI) Eukaryotic gene recombination assay/screen (101–103) 2009 Individual and multiple base pair target variants (I-AniI) Structure-based computational redesign and bacterial selections (89) 2009–2010 Monomerization of homodimeric meganuclease (I-CreI) Structure-based modeling and activity-based selections (113,114) 2009 Activity-based selections (I-AniI) Yeast surface display (130) 2010 Maize liguless gene target (I-CreI) Structure-based modeling and activity-based selections (113) 2010 Multiple basepair target variants (I-MsoI) Structure-based computational redesign (105) 2014 Human Brutons Tyrosine Kinase (Btk) get target (I-AniI) Structure-based computational redesign and Yeast Surface Display (107) 2007–2013 Various eukaryotic gene targets (I-CreI) Structure-based modeling; bacterial selections; eukaryotic selections (117–129) 2014–2015 Human TCRa and CCR5 gene targets (I-OnuI) Yeast surface display meganuclease selections and MegaTAL (115,116) 2014–2015 Human CFTR gene target (I-OnuI) In vitro compartmentalization (133,134) 2017 Various eukaryotic gene targets (I-OnuI) Yeast surface display and bacterial selections (23) TAL effectors 2009 TAL effector code determination and first designer TALs Tandem repeat assembly (16,136) 2010–2011 TAL Nuclease Creation and Initial refinement Tandem repeat assembly and FokI fusion (171–173) 2012 Improved design using additional RVDs; G-specific RVDs Tandem repeat assembly using new specificity determinants and data (141–143) 2012–2013 Increasing mismatch tolerance of C-terminal repeats Tand repeat assembly and DNA substrate sequence variation (151,152) 2013–2014 Altered specificity at base 0 Modification of cryptic repeat sequences; RVD at position1, and context (132,149,150) 2014 Aberrant repeats that allow frameshift binding Incorporation of natural repeat variants with small insertions or deletions (191) 2014–2015 Expanded repertoire of RVDs for fine-tuned targeting Characterization of specificities and affinities of 400 RVDs (185,186) 2016–2017 Modularion of TAL effector binding strength and TALEN efficiency Varying the backbone (non-RVD) sequecnes of the repeats (188,189) 2017 Optimized length for maximum specificity Varying the number of repeats (153) Site specific recombinases 1999 Circumvent need for accessory factors (Tn3 resolvase) Error prone PCR, galK-based colored colony selection (203) 2009 Increased efficiency/selectivity (PhiC31 Integrase) Error prone PCR, lacZ selection & GFP expression (204) 1988 Enhanced and altered activity (gin) Chemical mutagenesis and bacterial selection (205) 2000 Circumvent need for accessory factors (lambda integrase) GFP based fluorescence (206) 2015 Targeting CCR5 and AAVS1 safe harbor locus (Bin and Tn21 recombinases) Site specific sequence randomization and error prone PCR, antibotic selection (207) 2003 Altered loxP sequence (Cre) site specific sequence randomization, GFP expression and FACS (218,219) 2001–2011 Altered loxP sequence,HIV LTR sequences (Cre) Substrate linked protein evolution (207,221,222, 224,226) 2013 HIV LTR (improved activity) (Cre) Molecular modeling and dynamics (227) 2017 HIV LTR (improved activity) (Cre) Observations based on the crystal structure (228) 2008 Mutants that promote heterotetramers (Cre) Structure-based selection of interfactial residues to be randomized (232) 2015 Mutants that promote heterotetramers (Cre) Protein design via molecular modeling (233) 2013 Weakened protein–protein interactions to enhance specificity (Cre) Random mutagenesis and bacterial selection (234) 1988 Enhanced activity (Flp) Substrate linked protein evolution (235) 2004 Mutants that promote heterotetramers (Flp) Error prone PCR, blue/white selection (236) 2003–2006 Altered FRT sequence, interleukin 10 target (Flp) Error prone PCR and randomization of specific sites, LacZ and RFP reporters (237,238) 2016 Enhanced activity (R and TD recombinases) Sequence truncation, random mutagenesis (240) 1995 Relaxed specificity (lambda integrase) Analysis of chimeric integrases (245) 2015 Human genome target (lambda integrase) Beta-lactamase inhibitor based screen (246) 2001 Human chromosome 8 target (PhiC31) Blue/white selection (248) 2003–2011 Hybrid reslovase/ZFN targets (serine recombinases) Truncated resolvases with zinc finger fusion (varied linkers) (249–251) 2011–2014 Mutants to promote heterodimers (resolvases) Rational design and directed evolution (255,256) 2011–2014 Altered specificity of catalytic domains (serine recombinases) Random mutagenesis of selected residues and directed evolution (257,258,261) Restriction endonucleases 1987–1999 1st Attempts to alter specificity (EcoRI, EcoRV, BamHI) Structure-based modeling (270–273) 2002–2006 Additional attempts to alter specificity (BstYI, NotI) Directed evolution and selection (274,275) 2003 Alteration of specificity of bifunctional RM enzyme (Eco57I) Directed evolution and selection for altered methylation specificity (278) 2009 Alteration of specificity of type IIG enzyme (MmeI) Informatics covariation analysis and structure-based modeling (21) Platform and year(s) Targets and development Engineering approach References Zinc Fingers 1992–1993 Novel DNA triplets Structure-based modeling (49–51) 1994–1995 Novel DNA triplets Phage Display (52–55) 1999 Novel sites with GNN triplets (56) 2000 Novel 9 basepair targets Bacterial two-hybrid selections (61) 2001 Novel triplets with ANN and CNN triplets Phage Display (57,58) 2001 Novel 9 basepair targets Phage Display; hybrid 3 finger library panning (60) 2001 Novel 12 to 18 basepair targets Assembly of two-finger ZFP subunits (64) 2002 - 2003 Drosophia yellow gene target Modular assembly (31,32) 2005 Human IL2Rg gene target Zinc finger selections and assembly (22) 2008 Novel 9 basepair targets Bacterial two-hybrid selections (62) 2011 Novel 9 basepair targets Informatics-driven, Context-dependent ZFP assembly (63) Meganucleases 2002 Single basepair target variants (I-CreI) Bacterial gene elimination assay/screen (97) 2002 Activity-based selection (I-SceI) Bacterial gene elimination assay/screen (98) 2002 Hybrid nuclease generation (I-CreI/I-DmoI –> H-DreI Structure-based computational redesign (110) 2003 Single basepair target variants (PI-SceI) Bacterial two-hybrid selections (96) 2006 Single basepair target variant (I-MsoI Structure-based computational redesign (99) 2006 Multiple base pair target variants (I-CreI) Bacterial ene elmination assay/screen (100) 2006 Multiple basepair target variants (I-CreI) Eukaryotic gene recombination assay/screen (101–103) 2009 Individual and multiple base pair target variants (I-AniI) Structure-based computational redesign and bacterial selections (89) 2009–2010 Monomerization of homodimeric meganuclease (I-CreI) Structure-based modeling and activity-based selections (113,114) 2009 Activity-based selections (I-AniI) Yeast surface display (130) 2010 Maize liguless gene target (I-CreI) Structure-based modeling and activity-based selections (113) 2010 Multiple basepair target variants (I-MsoI) Structure-based computational redesign (105) 2014 Human Brutons Tyrosine Kinase (Btk) get target (I-AniI) Structure-based computational redesign and Yeast Surface Display (107) 2007–2013 Various eukaryotic gene targets (I-CreI) Structure-based modeling; bacterial selections; eukaryotic selections (117–129) 2014–2015 Human TCRa and CCR5 gene targets (I-OnuI) Yeast surface display meganuclease selections and MegaTAL (115,116) 2014–2015 Human CFTR gene target (I-OnuI) In vitro compartmentalization (133,134) 2017 Various eukaryotic gene targets (I-OnuI) Yeast surface display and bacterial selections (23) TAL effectors 2009 TAL effector code determination and first designer TALs Tandem repeat assembly (16,136) 2010–2011 TAL Nuclease Creation and Initial refinement Tandem repeat assembly and FokI fusion (171–173) 2012 Improved design using additional RVDs; G-specific RVDs Tandem repeat assembly using new specificity determinants and data (141–143) 2012–2013 Increasing mismatch tolerance of C-terminal repeats Tand repeat assembly and DNA substrate sequence variation (151,152) 2013–2014 Altered specificity at base 0 Modification of cryptic repeat sequences; RVD at position1, and context (132,149,150) 2014 Aberrant repeats that allow frameshift binding Incorporation of natural repeat variants with small insertions or deletions (191) 2014–2015 Expanded repertoire of RVDs for fine-tuned targeting Characterization of specificities and affinities of 400 RVDs (185,186) 2016–2017 Modularion of TAL effector binding strength and TALEN efficiency Varying the backbone (non-RVD) sequecnes of the repeats (188,189) 2017 Optimized length for maximum specificity Varying the number of repeats (153) Site specific recombinases 1999 Circumvent need for accessory factors (Tn3 resolvase) Error prone PCR, galK-based colored colony selection (203) 2009 Increased efficiency/selectivity (PhiC31 Integrase) Error prone PCR, lacZ selection & GFP expression (204) 1988 Enhanced and altered activity (gin) Chemical mutagenesis and bacterial selection (205) 2000 Circumvent need for accessory factors (lambda integrase) GFP based fluorescence (206) 2015 Targeting CCR5 and AAVS1 safe harbor locus (Bin and Tn21 recombinases) Site specific sequence randomization and error prone PCR, antibotic selection (207) 2003 Altered loxP sequence (Cre) site specific sequence randomization, GFP expression and FACS (218,219) 2001–2011 Altered loxP sequence,HIV LTR sequences (Cre) Substrate linked protein evolution (207,221,222, 224,226) 2013 HIV LTR (improved activity) (Cre) Molecular modeling and dynamics (227) 2017 HIV LTR (improved activity) (Cre) Observations based on the crystal structure (228) 2008 Mutants that promote heterotetramers (Cre) Structure-based selection of interfactial residues to be randomized (232) 2015 Mutants that promote heterotetramers (Cre) Protein design via molecular modeling (233) 2013 Weakened protein–protein interactions to enhance specificity (Cre) Random mutagenesis and bacterial selection (234) 1988 Enhanced activity (Flp) Substrate linked protein evolution (235) 2004 Mutants that promote heterotetramers (Flp) Error prone PCR, blue/white selection (236) 2003–2006 Altered FRT sequence, interleukin 10 target (Flp) Error prone PCR and randomization of specific sites, LacZ and RFP reporters (237,238) 2016 Enhanced activity (R and TD recombinases) Sequence truncation, random mutagenesis (240) 1995 Relaxed specificity (lambda integrase) Analysis of chimeric integrases (245) 2015 Human genome target (lambda integrase) Beta-lactamase inhibitor based screen (246) 2001 Human chromosome 8 target (PhiC31) Blue/white selection (248) 2003–2011 Hybrid reslovase/ZFN targets (serine recombinases) Truncated resolvases with zinc finger fusion (varied linkers) (249–251) 2011–2014 Mutants to promote heterodimers (resolvases) Rational design and directed evolution (255,256) 2011–2014 Altered specificity of catalytic domains (serine recombinases) Random mutagenesis of selected residues and directed evolution (257,258,261) Restriction endonucleases 1987–1999 1st Attempts to alter specificity (EcoRI, EcoRV, BamHI) Structure-based modeling (270–273) 2002–2006 Additional attempts to alter specificity (BstYI, NotI) Directed evolution and selection (274,275) 2003 Alteration of specificity of bifunctional RM enzyme (Eco57I) Directed evolution and selection for altered methylation specificity (278) 2009 Alteration of specificity of type IIG enzyme (MmeI) Informatics covariation analysis and structure-based modeling (21) View Large Table 1. Summary of many of the significant attempts to engineer the DNA recognition properties of the protein systems discussed in this review. While not intended to be entirely comprehensive and complete, this table is intended to assist reviewers in following the details of the main text Platform and year(s) Targets and development Engineering approach References Zinc Fingers 1992–1993 Novel DNA triplets Structure-based modeling (49–51) 1994–1995 Novel DNA triplets Phage Display (52–55) 1999 Novel sites with GNN triplets (56) 2000 Novel 9 basepair targets Bacterial two-hybrid selections (61) 2001 Novel triplets with ANN and CNN triplets Phage Display (57,58) 2001 Novel 9 basepair targets Phage Display; hybrid 3 finger library panning (60) 2001 Novel 12 to 18 basepair targets Assembly of two-finger ZFP subunits (64) 2002 - 2003 Drosophia yellow gene target Modular assembly (31,32) 2005 Human IL2Rg gene target Zinc finger selections and assembly (22) 2008 Novel 9 basepair targets Bacterial two-hybrid selections (62) 2011 Novel 9 basepair targets Informatics-driven, Context-dependent ZFP assembly (63) Meganucleases 2002 Single basepair target variants (I-CreI) Bacterial gene elimination assay/screen (97) 2002 Activity-based selection (I-SceI) Bacterial gene elimination assay/screen (98) 2002 Hybrid nuclease generation (I-CreI/I-DmoI –> H-DreI Structure-based computational redesign (110) 2003 Single basepair target variants (PI-SceI) Bacterial two-hybrid selections (96) 2006 Single basepair target variant (I-MsoI Structure-based computational redesign (99) 2006 Multiple base pair target variants (I-CreI) Bacterial ene elmination assay/screen (100) 2006 Multiple basepair target variants (I-CreI) Eukaryotic gene recombination assay/screen (101–103) 2009 Individual and multiple base pair target variants (I-AniI) Structure-based computational redesign and bacterial selections (89) 2009–2010 Monomerization of homodimeric meganuclease (I-CreI) Structure-based modeling and activity-based selections (113,114) 2009 Activity-based selections (I-AniI) Yeast surface display (130) 2010 Maize liguless gene target (I-CreI) Structure-based modeling and activity-based selections (113) 2010 Multiple basepair target variants (I-MsoI) Structure-based computational redesign (105) 2014 Human Brutons Tyrosine Kinase (Btk) get target (I-AniI) Structure-based computational redesign and Yeast Surface Display (107) 2007–2013 Various eukaryotic gene targets (I-CreI) Structure-based modeling; bacterial selections; eukaryotic selections (117–129) 2014–2015 Human TCRa and CCR5 gene targets (I-OnuI) Yeast surface display meganuclease selections and MegaTAL (115,116) 2014–2015 Human CFTR gene target (I-OnuI) In vitro compartmentalization (133,134) 2017 Various eukaryotic gene targets (I-OnuI) Yeast surface display and bacterial selections (23) TAL effectors 2009 TAL effector code determination and first designer TALs Tandem repeat assembly (16,136) 2010–2011 TAL Nuclease Creation and Initial refinement Tandem repeat assembly and FokI fusion (171–173) 2012 Improved design using additional RVDs; G-specific RVDs Tandem repeat assembly using new specificity determinants and data (141–143) 2012–2013 Increasing mismatch tolerance of C-terminal repeats Tand repeat assembly and DNA substrate sequence variation (151,152) 2013–2014 Altered specificity at base 0 Modification of cryptic repeat sequences; RVD at position1, and context (132,149,150) 2014 Aberrant repeats that allow frameshift binding Incorporation of natural repeat variants with small insertions or deletions (191) 2014–2015 Expanded repertoire of RVDs for fine-tuned targeting Characterization of specificities and affinities of 400 RVDs (185,186) 2016–2017 Modularion of TAL effector binding strength and TALEN efficiency Varying the backbone (non-RVD) sequecnes of the repeats (188,189) 2017 Optimized length for maximum specificity Varying the number of repeats (153) Site specific recombinases 1999 Circumvent need for accessory factors (Tn3 resolvase) Error prone PCR, galK-based colored colony selection (203) 2009 Increased efficiency/selectivity (PhiC31 Integrase) Error prone PCR, lacZ selection & GFP expression (204) 1988 Enhanced and altered activity (gin) Chemical mutagenesis and bacterial selection (205) 2000 Circumvent need for accessory factors (lambda integrase) GFP based fluorescence (206) 2015 Targeting CCR5 and AAVS1 safe harbor locus (Bin and Tn21 recombinases) Site specific sequence randomization and error prone PCR, antibotic selection (207) 2003 Altered loxP sequence (Cre) site specific sequence randomization, GFP expression and FACS (218,219) 2001–2011 Altered loxP sequence,HIV LTR sequences (Cre) Substrate linked protein evolution (207,221,222, 224,226) 2013 HIV LTR (improved activity) (Cre) Molecular modeling and dynamics (227) 2017 HIV LTR (improved activity) (Cre) Observations based on the crystal structure (228) 2008 Mutants that promote heterotetramers (Cre) Structure-based selection of interfactial residues to be randomized (232) 2015 Mutants that promote heterotetramers (Cre) Protein design via molecular modeling (233) 2013 Weakened protein–protein interactions to enhance specificity (Cre) Random mutagenesis and bacterial selection (234) 1988 Enhanced activity (Flp) Substrate linked protein evolution (235) 2004 Mutants that promote heterotetramers (Flp) Error prone PCR, blue/white selection (236) 2003–2006 Altered FRT sequence, interleukin 10 target (Flp) Error prone PCR and randomization of specific sites, LacZ and RFP reporters (237,238) 2016 Enhanced activity (R and TD recombinases) Sequence truncation, random mutagenesis (240) 1995 Relaxed specificity (lambda integrase) Analysis of chimeric integrases (245) 2015 Human genome target (lambda integrase) Beta-lactamase inhibitor based screen (246) 2001 Human chromosome 8 target (PhiC31) Blue/white selection (248) 2003–2011 Hybrid reslovase/ZFN targets (serine recombinases) Truncated resolvases with zinc finger fusion (varied linkers) (249–251) 2011–2014 Mutants to promote heterodimers (resolvases) Rational design and directed evolution (255,256) 2011–2014 Altered specificity of catalytic domains (serine recombinases) Random mutagenesis of selected residues and directed evolution (257,258,261) Restriction endonucleases 1987–1999 1st Attempts to alter specificity (EcoRI, EcoRV, BamHI) Structure-based modeling (270–273) 2002–2006 Additional attempts to alter specificity (BstYI, NotI) Directed evolution and selection (274,275) 2003 Alteration of specificity of bifunctional RM enzyme (Eco57I) Directed evolution and selection for altered methylation specificity (278) 2009 Alteration of specificity of type IIG enzyme (MmeI) Informatics covariation analysis and structure-based modeling (21) Platform and year(s) Targets and development Engineering approach References Zinc Fingers 1992–1993 Novel DNA triplets Structure-based modeling (49–51) 1994–1995 Novel DNA triplets Phage Display (52–55) 1999 Novel sites with GNN triplets (56) 2000 Novel 9 basepair targets Bacterial two-hybrid selections (61) 2001 Novel triplets with ANN and CNN triplets Phage Display (57,58) 2001 Novel 9 basepair targets Phage Display; hybrid 3 finger library panning (60) 2001 Novel 12 to 18 basepair targets Assembly of two-finger ZFP subunits (64) 2002 - 2003 Drosophia yellow gene target Modular assembly (31,32) 2005 Human IL2Rg gene target Zinc finger selections and assembly (22) 2008 Novel 9 basepair targets Bacterial two-hybrid selections (62) 2011 Novel 9 basepair targets Informatics-driven, Context-dependent ZFP assembly (63) Meganucleases 2002 Single basepair target variants (I-CreI) Bacterial gene elimination assay/screen (97) 2002 Activity-based selection (I-SceI) Bacterial gene elimination assay/screen (98) 2002 Hybrid nuclease generation (I-CreI/I-DmoI –> H-DreI Structure-based computational redesign (110) 2003 Single basepair target variants (PI-SceI) Bacterial two-hybrid selections (96) 2006 Single basepair target variant (I-MsoI Structure-based computational redesign (99) 2006 Multiple base pair target variants (I-CreI) Bacterial ene elmination assay/screen (100) 2006 Multiple basepair target variants (I-CreI) Eukaryotic gene recombination assay/screen (101–103) 2009 Individual and multiple base pair target variants (I-AniI) Structure-based computational redesign and bacterial selections (89) 2009–2010 Monomerization of homodimeric meganuclease (I-CreI) Structure-based modeling and activity-based selections (113,114) 2009 Activity-based selections (I-AniI) Yeast surface display (130) 2010 Maize liguless gene target (I-CreI) Structure-based modeling and activity-based selections (113) 2010 Multiple basepair target variants (I-MsoI) Structure-based computational redesign (105) 2014 Human Brutons Tyrosine Kinase (Btk) get target (I-AniI) Structure-based computational redesign and Yeast Surface Display (107) 2007–2013 Various eukaryotic gene targets (I-CreI) Structure-based modeling; bacterial selections; eukaryotic selections (117–129) 2014–2015 Human TCRa and CCR5 gene targets (I-OnuI) Yeast surface display meganuclease selections and MegaTAL (115,116) 2014–2015 Human CFTR gene target (I-OnuI) In vitro compartmentalization (133,134) 2017 Various eukaryotic gene targets (I-OnuI) Yeast surface display and bacterial selections (23) TAL effectors 2009 TAL effector code determination and first designer TALs Tandem repeat assembly (16,136) 2010–2011 TAL Nuclease Creation and Initial refinement Tandem repeat assembly and FokI fusion (171–173) 2012 Improved design using additional RVDs; G-specific RVDs Tandem repeat assembly using new specificity determinants and data (141–143) 2012–2013 Increasing mismatch tolerance of C-terminal repeats Tand repeat assembly and DNA substrate sequence variation (151,152) 2013–2014 Altered specificity at base 0 Modification of cryptic repeat sequences; RVD at position1, and context (132,149,150) 2014 Aberrant repeats that allow frameshift binding Incorporation of natural repeat variants with small insertions or deletions (191) 2014–2015 Expanded repertoire of RVDs for fine-tuned targeting Characterization of specificities and affinities of 400 RVDs (185,186) 2016–2017 Modularion of TAL effector binding strength and TALEN efficiency Varying the backbone (non-RVD) sequecnes of the repeats (188,189) 2017 Optimized length for maximum specificity Varying the number of repeats (153) Site specific recombinases 1999 Circumvent need for accessory factors (Tn3 resolvase) Error prone PCR, galK-based colored colony selection (203) 2009 Increased efficiency/selectivity (PhiC31 Integrase) Error prone PCR, lacZ selection & GFP expression (204) 1988 Enhanced and altered activity (gin) Chemical mutagenesis and bacterial selection (205) 2000 Circumvent need for accessory factors (lambda integrase) GFP based fluorescence (206) 2015 Targeting CCR5 and AAVS1 safe harbor locus (Bin and Tn21 recombinases) Site specific sequence randomization and error prone PCR, antibotic selection (207) 2003 Altered loxP sequence (Cre) site specific sequence randomization, GFP expression and FACS (218,219) 2001–2011 Altered loxP sequence,HIV LTR sequences (Cre) Substrate linked protein evolution (207,221,222, 224,226) 2013 HIV LTR (improved activity) (Cre) Molecular modeling and dynamics (227) 2017 HIV LTR (improved activity) (Cre) Observations based on the crystal structure (228) 2008 Mutants that promote heterotetramers (Cre) Structure-based selection of interfactial residues to be randomized (232) 2015 Mutants that promote heterotetramers (Cre) Protein design via molecular modeling (233) 2013 Weakened protein–protein interactions to enhance specificity (Cre) Random mutagenesis and bacterial selection (234) 1988 Enhanced activity (Flp) Substrate linked protein evolution (235) 2004 Mutants that promote heterotetramers (Flp) Error prone PCR, blue/white selection (236) 2003–2006 Altered FRT sequence, interleukin 10 target (Flp) Error prone PCR and randomization of specific sites, LacZ and RFP reporters (237,238) 2016 Enhanced activity (R and TD recombinases) Sequence truncation, random mutagenesis (240) 1995 Relaxed specificity (lambda integrase) Analysis of chimeric integrases (245) 2015 Human genome target (lambda integrase) Beta-lactamase inhibitor based screen (246) 2001 Human chromosome 8 target (PhiC31) Blue/white selection (248) 2003–2011 Hybrid reslovase/ZFN targets (serine recombinases) Truncated resolvases with zinc finger fusion (varied linkers) (249–251) 2011–2014 Mutants to promote heterodimers (resolvases) Rational design and directed evolution (255,256) 2011–2014 Altered specificity of catalytic domains (serine recombinases) Random mutagenesis of selected residues and directed evolution (257,258,261) Restriction endonucleases 1987–1999 1st Attempts to alter specificity (EcoRI, EcoRV, BamHI) Structure-based modeling (270–273) 2002–2006 Additional attempts to alter specificity (BstYI, NotI) Directed evolution and selection (274,275) 2003 Alteration of specificity of bifunctional RM enzyme (Eco57I) Directed evolution and selection for altered methylation specificity (278) 2009 Alteration of specificity of type IIG enzyme (MmeI) Informatics covariation analysis and structure-based modeling (21) View Large ZINC FINGER AND ZINC FINGER NUCLEASE ENGINEERING Overview The C2H2 zinc finger is one of the most common DNA binding motifs in multicellular organisms (24). Individual fingers contain about 30 amino acids and these units typically occur as tandem repeats of two or more fingers (Figure 1A) (25). When the structure of this motif bound to DNA was first solved, its modularity immediately suggested that a ‘mix and match’ strategy could be used to bind essentially any desired sequence in a complex genome (Figure 1B) (26). Fusing the nonspecific DNA cleavage domain of the Type II restriction enzyme, FokI, to a zinc finger protein (ZFP) allows the resulting zinc finger nuclease (ZFN) to cleave DNA at a sequence determined by the ZFP (Figure 1C) (27,28). A previous demonstration that mammalian genomes, engineered to contain a target site for the homing endonuclease I-SceI, could be manipulated at positions near a nuclease-induced double-strand break (29,30) raised the possibility that ZFNs could also be used to manipulate the genomes of complex organisms at will. Work from multiple laboratories on both ZFP engineering and additional aspects of ZFN function and specificity eventually led to the targeting of endogenous loci in Drosophila melanogaster (31,32) and in human cells (22). Successful targeting of endogenous loci in other organisms soon followed including a wide variety of model organisms including Arabidopsis (33,34), C. elegans (35), zebrafish (36–38), mice (39), rabbits (40), and rats (41,42). Important crop species such as corn (43) and soybean (44) and economically important animals such as pigs (45) and cattle (46) have also been targeted with ZFNs. The California-based biotechnology company Sangamo Therapeutics is also testing zinc finger nucleases in human clinical trials as potential treatments for HIV/AIDS (47), Hemophilia B (https://clinicaltrials.gov/ct2/show/NCT02695160), and lysosomal storage disorders (https://clinicaltrials.gov/ct2/show/NCT02702115 and https://clinicaltrials.gov/ct2/show/NCT03041324). Figure 1. View largeDownload slide Structure and mode of action of zinc fingers and zinc finger nucleases. Panel A:Structure of the Zif268-DNA complex showing the three zinc fingers of Zif268 bound in the major groove of the DNA. Fingers are spaced at 3-bp intervals. The DNA is grey; the zinc ions are dark teal spheres. The structure and primary DNA contacting residues of zinc finger #2 (ZF2) are indicated to the right. A sequence alignment of the 3 fingers of Zif268 is shown below. The zinc binding Cys2-His2 motif is indicated with blue bold font; the canonical DNA-contacting residues are indicated by arrows. Panel B: Modular assembly of a three-finger protein from individual fingers. To generate a zinc finger protein (ZFP) with specificity for the sequence GGGGGTGAC, three fingers are identified that each bind a component triplet. These fingers are then linked. Panel C:Sketch of a pair of zinc finger nuclease (ZFN) subunits bound to two halves of a DNA target. Each ZFN contains the cleavage domain of FokI linked to an array of three to six zinc fingers (four are shown here) that have been designed to specifically recognize sequences (blue and red boxes) that flank the cleavage site. A small number of bases separate the ZFN targets. The FokI nuclease domains transiently dimerize across those central bases and cleave each DNA strand to generate a double strand break with 5′ overhangs averaging 4 bases in length. Figure 1. View largeDownload slide Structure and mode of action of zinc fingers and zinc finger nucleases. Panel A:Structure of the Zif268-DNA complex showing the three zinc fingers of Zif268 bound in the major groove of the DNA. Fingers are spaced at 3-bp intervals. The DNA is grey; the zinc ions are dark teal spheres. The structure and primary DNA contacting residues of zinc finger #2 (ZF2) are indicated to the right. A sequence alignment of the 3 fingers of Zif268 is shown below. The zinc binding Cys2-His2 motif is indicated with blue bold font; the canonical DNA-contacting residues are indicated by arrows. Panel B: Modular assembly of a three-finger protein from individual fingers. To generate a zinc finger protein (ZFP) with specificity for the sequence GGGGGTGAC, three fingers are identified that each bind a component triplet. These fingers are then linked. Panel C:Sketch of a pair of zinc finger nuclease (ZFN) subunits bound to two halves of a DNA target. Each ZFN contains the cleavage domain of FokI linked to an array of three to six zinc fingers (four are shown here) that have been designed to specifically recognize sequences (blue and red boxes) that flank the cleavage site. A small number of bases separate the ZFN targets. The FokI nuclease domains transiently dimerize across those central bases and cleave each DNA strand to generate a double strand break with 5′ overhangs averaging 4 bases in length. Re-targeting zinc finger proteins The structure of Zif268 bound to DNA (Figure 1A) provided the first detailed view of how Cys2His2 zinc fingers interact with DNA (26). Each finger is comprised of a simple ββα fold, wherein the α helix fits into the major groove of the DNA target. Adjacent fingers are spaced at 3 bp intervals along the DNA, and the residues at four key positions of the α-helix make base-specific contacts to each finger's portion of the DNA target site. In principle, altering these four amino acid residues in each finger should allow targeting of any desired sequence. But in practice, altering three additional residues interspersed between the four key residues usually gives the best results and allows protein–DNA contacts that do not match the ‘canonical’ pattern observed in Zif268 (48). An additional complexity of engineering zinc finger proteins is that individual zinc fingers do not behave in a completely modular fashion, and strategies to deal with this ‘context dependence’ are critical to achieving optimal results (25). Initial attempts to retarget zinc finger proteins relied upon a small panel of alternative amino acid residues at key DNA base contacting positions, based upon a bioinformatic search of naturally occurring ZFPs. This yielded some ZFP variants with altered specificity (49–51), but it was not clear that this approach could be extended to recognize all possible DNA sequences. Around the same time, multiple groups began testing a more powerful approach to engineer zinc fingers that involved using phage display to simultaneously test up to 109 variant zinc finger sequences for binding to a desired sequence (25). These initial efforts involved randomizing the four key residues of a single finger of Zif268 and yielded zinc finger variants specific for some, but not all of the targeted sites (52–55). But even library sizes of 109 variants are not large enough to properly sample all possible variants in multiple zinc fingers simultaneously so other strategies had to be employed to target a zinc finger protein to a completely novel site. Choo and Klug were the first to target a completely novel sequence of interest by combining individual zinc fingers that had been selected separately (52), but the resulting protein had somewhat modest affinity and was only used to target expression of a reporter gene on a plasmid with multiple copies of the targeted binding site (25). The Barbas lab achieved greater success by combining fingers from separate selections to target sites of the form GNNGNNGNN (56), but a pair of such sites separated by 6 bp was required for the first generation of ZNFs; such pairs occur only about once per 4096 bp. They attempted to extend this approach to additional types of sites (57,58), but this resulted in less promising results (59). Another strategy involved separate selections corresponding to the N-terminal or C-terminal half of a three-finger protein (one and a half fingers per selection), followed by combining the results of two such selections to create a completely novel three finger protein (60). A hybrid approach, that took advantage of a newly developed bacteria selection system (61), was more tractable (62), but was still too labor-intensive for widespread adoption. A refined version of this system used a computational approach to determine viable module-module pairings (63). But these methods were primarily geared towards generation of 3-finger ZFPs with 9 bp binding sites. Other groups pursued an approach that involved mixing and matching pre-selected two-finger units (64); this approach yields ZFPs with between four and six zinc fingers that have been used to successfully target a wide variety of endogenous genes (65,66). Based mainly on these improvements in retargeting ZFPs, the precision of targeting desired regions of the genome has gradually increased from one in 4096 bp to coding sequences of genes of interest (22,62,67) to the ability to generally target individual point mutations or short regulatory regions (68,69). However, even the most successful version of this strategy still relies on assembling and testing multiple pairs of ZFNs to generate optimal activity and results (65). Cleaving DNA While the application of zinc finger nucleases (ZFNs) for genome engineering and gene editing is not the focus of this review, examination of ZFNs have provided the most detailed analyses of the specificity of engineered ZFPs, relying upon genome-wide DNA cleavage assays as a reporter of zinc finger DNA recognition specificity. Because ZFPs and corresponding ZFNs were among the very first engineered protein systems of this type, they stand out as some of the most rigorously characterized of all designed DNA binding proteins. Creation of ZFNs involves fusing ZFPs possessing desired DNA binding properties to a DNA cleavage domain (typically the non-specific cleavage domain of FokI (27)). Although it was not understood initially, FokI must dimerize in order to cleave DNA (70,71) and thus a pair of ZFNs that bind their target sequence with the appropriate orientation and spacing are required to cleave DNA. It was not originally known if the prokaryotic cleavage domain from FokI would function on DNA in eukaryotic cells, but experiments in Xenopus oocytes using an extrachromosomal substrate (72) and experiments in human cells using an integrated reporter construct (73) both demonstrated that engineered ZFNs could indeed target reporter constructs in higher eukaryotes. Two initial questions were exactly how to connect the FokI cleavage domain to the engineered ZFP and how much of a gap was required between the binding sites for the left and right ZFNs. Initial studies showed that a short linker and half-sites spaced by 6 bp worked well in Xenopus oocytes (72). Additional work explored some variant linker sequences and demonstrated that gaps of 5–7 bp can be targeted (74,75). A FokI cleavage domain variant with increased catalytic activity has also been generated (76). However, the majority of the FokI engineering work has focused on reducing off-target DNA cleavage. One approach has been to engineer obligate heterodimer variants of FokI (77–79). The basic concept of obligate heterodimer ZFNs is to build two variants of the FokI cleavage domain that can cut DNA when paired with each other, but can’t cut DNA when paired with a second copy of themselves. This approach has shown a substantial reduction of off-target cleavage for a ZFN pair that targets the human CCR5 gene (67). Another approach to engineer FokI domains to reduce off-target cleavage is to create a ZFN that can only nick DNA rather than cleave both strands (80,81). This is desirable in the presence of a homologous donor DNA construct because nicked DNA can still potentiate homology directed repair of DNA without leading to DNA double-strand breaks. However, published zinc finger nickase constructs tend to exhibit lower levels of the desired homology directed repair activity than comparable zinc finger nucleases. Specificity Genome engineering with artificial nucleases is premised on the fact that DNA cleavage is focused mainly at the desired target site. High levels of off-target DNA cleavage could be toxic to the cells and even low levels of off-target cleavage at certain genomic loci could be problematic for applications such as human therapeutics. Some early ZFN work monitored off-target cleavage by measuring various effects of overall DNA cleavage in cells. This included staining cells for foci of DNA repair proteins that are indicative of a DNA double-strand break (67) and monitoring the amount of phosphorylated H2AX histone in a cell by flow cytometry since H2AX is phosphorylated in response to DNA damage (78). However, these methods become less effective for comparing different nucleases as the specificity of ZFNs improved to the point where ZFN-induced breaks are not detectable over the background level of DNA breaks in the cells of interest (67). Thus, many groups started developing assays to monitor double-strand breaks (DSBs) at specific off-target loci. Initial attempts to identify individual ZFN off-target loci used either a bioinformatics approach to search the relevant genome for sites homologous to the intended target or a combination of bioinformatics and a biochemical DNA specificity assay (67). However, in practice this method still missed numerous off-target sites that were identified by more sophisticated methods developed later. The first method capable of directly monitoring ZFN cleavage of a complex mixture of potential target sites used a large library of potential ZFN cleavage sites. This library of potential sites was digested in a cell-free system and the results were obtained by high throughput DNA sequencing (82). Later, a method was developed that could identify sites of ZFN cleavage genome-wide in human cells. This method relies on the observation that DSBs can capture exogenous DNA with a mechanism that doesn’t rely on sequence homology (83). Briefly, integrase deficient lentivirus (IDLV) and ZFNs are co-transfected into human cells, IDLV is captured at sites of DSBs, and then Linear Amplification Mediated PCR (LAM-PCR) followed by high-throughput sequencing is used to identify genomic loci where IDLV integration has occurred. This analysis was applied to variants of ZFN pairs targeted to the human CCR5 and IL2Rg genes. For CCR5-targeted ZFN variants with heterodimer FokI variants, off-target sites were identified, but even in aggregate these off-target sites were less than the cleavage at the intended target. This type of assay probably won’t be able to detect weak off-target sites and it is formally possible that a ‘long tail’ of weak off-target sites could still cause a substantial amount of off-target DSBs in each cell. But sequencing the genome of a ZFN treated C. elegans did not identify any ZFN induced indels other than the intended target (84) and exome sequencing of ZFN-treated induced pluripotent stem (iPS) cells did not identify any ZFN induced changes other than the intended change (69). HOMING ENDONUCLEASE (MEGANUCLEASE) ENGINEERING Overview Homing endonucleases (now usually termed ‘meganucleases’) are primarily associated with mobile self-splicing elements (introns and inteins) that display genetic mobility and evolutionary persistence within all known forms of microbial life. Like zinc finger nucleases, meganucleases have been studied since the mid-1990s as for use in genome editing applications (29,30,85). As the drivers of DNA invasion events, meganucleases face strong evolutionary pressure to continuously alter their DNA recognition specificity. By doing so, they increase their ability to invade new DNA sequences and targets, while also persisting in their current host genes. This (along with various mechanisms to control the timing and level of their expression) provides the endonuclease with sufficient specificity to avoid overt toxicity to their host organism, while still accommodating individual polymorphisms within their targets that naturally occur over the course of host genetic drift (86). At least five distinct structural families of meganucleases have been visualized and extensively characterized. Of these proteins, those from the ‘LAGLIDADG’ family (Figure 2) have been extensively developed and engineered for genome engineering. LAGLIDADG endonucleases correspond both to homodimeric and single-chain monomeric proteins (for the latter, the N- and C-terminal domains display considerable structural similarity). In each case, residues corresponding to the interface between the protein domains, as well as metal-binding active site residues, form a 10-residue sequence motif represented by the consensus ‘LAGLIDADG’ nomenclature. Figure 2. View largeDownload slide Structure and reprogramming of a meganuclease. Panel A:Structure and original target site of the I-OnuI meganuclease. The protein is comprised of a single protein chain of 290 residues and is bound to a 22 base pair DNA target site. The N- and C-terminal domains of the endonuclease, which possess the same overall protein fold related by a pseudo two-fold symmetry axis, recognize and interact with the 5′ and 3′ half-sites of the DNA target site, respectively. The interface between the target 5′ half-site and the protein N-terminal domain is indicated by the oval. Panel B:Schematic of immediate contacts between the DNA 5′ half-site and the meganuclease N-terminal domain (corresponding to oval in panel a above). Bases and protein residues in blue boxes correspond to elements shown in panel C. Panel C:Region corresponding to the contacts between two consecutive base pairs in DNA target site (indicated by the blue box in panel B) and the six most near-neighboring protein side chains (also indicated in panel B with blue boxes). In a typical selection experiment, a cluster of at least six such residues are simultaneously randomized and incorporated into a combinatorial protein library for subsequent screening against a DNA substrate containing the desired base pairs at the corresponding nucleotide positions. Panel D:DNA-bound structure of a fully reprogrammed variant of the I-OnuI enzyme, harboring selected point mutations at 50 residues in the protein–DNA interface (corresponding to ∼17% of the total protein sequence; indicated with red spheres spanning the side chains of each altered residue). The engineered protein, which recognizes a DNA sequence that differs from the original target at over half of its base pair positions (12 out of 22; the altered basepairs are indicated by lower case letters) displays an rmsd across all backbone atoms of only 0.6 Å. A structural superposition of the wild-type enzyme and its fully redesigned variant is shown to the right (engineered enzyme is colored blue). Figure 2. View largeDownload slide Structure and reprogramming of a meganuclease. Panel A:Structure and original target site of the I-OnuI meganuclease. The protein is comprised of a single protein chain of 290 residues and is bound to a 22 base pair DNA target site. The N- and C-terminal domains of the endonuclease, which possess the same overall protein fold related by a pseudo two-fold symmetry axis, recognize and interact with the 5′ and 3′ half-sites of the DNA target site, respectively. The interface between the target 5′ half-site and the protein N-terminal domain is indicated by the oval. Panel B:Schematic of immediate contacts between the DNA 5′ half-site and the meganuclease N-terminal domain (corresponding to oval in panel a above). Bases and protein residues in blue boxes correspond to elements shown in panel C. Panel C:Region corresponding to the contacts between two consecutive base pairs in DNA target site (indicated by the blue box in panel B) and the six most near-neighboring protein side chains (also indicated in panel B with blue boxes). In a typical selection experiment, a cluster of at least six such residues are simultaneously randomized and incorporated into a combinatorial protein library for subsequent screening against a DNA substrate containing the desired base pairs at the corresponding nucleotide positions. Panel D:DNA-bound structure of a fully reprogrammed variant of the I-OnuI enzyme, harboring selected point mutations at 50 residues in the protein–DNA interface (corresponding to ∼17% of the total protein sequence; indicated with red spheres spanning the side chains of each altered residue). The engineered protein, which recognizes a DNA sequence that differs from the original target at over half of its base pair positions (12 out of 22; the altered basepairs are indicated by lower case letters) displays an rmsd across all backbone atoms of only 0.6 Å. A structural superposition of the wild-type enzyme and its fully redesigned variant is shown to the right (engineered enzyme is colored blue). DNA recognition by this meganuclease family (Figure 2) is noteworthy with respect to the length of their target sites (usually 22 base pairs) and the size and complexity of their DNA-contacting surface (involving upwards of 50 amino acids). The mechanism of DNA recognition by these meganucleases, like many other DNA-binding proteins, involves a mixture of (i) contacts between protein side chains and nucleotide bases (largely concentrated within the major groove of the target site), (ii) significant DNA bending that results in a distortion of both major and minor groove dimensions, as well as alteration of molecular surface electrostatic distribution near the center of the site (87,88) and (iii) a variety of additional contacts within and near the minor groove, particularly at the bent target site center (87,88). The latter two features impose considerable specificity across the central 4 base pairs of the target site, in the absence of direct contacts to the protein. Specificity of recognition is strongly enforced during catalysis (i.e. at the DNA cleavage transition state) as well as through reduction of binding affinity (88). In some cases, the contribution of binding affinity versus cleavage activity towards overall nuclease specificity is strongly segregated between the two protein domains and corresponding DNA half-sites (89). Overall, meganucleases display non-uniform recognition at individual positions across the DNA target site (ranging from nearly exclusive recognition at some nucleotide positions, to considerable promiscuity at nearby or adjacent positions) (86,89). The mechanism by which overall recognition specificity is enforced includes both considerable amounts of indirect ‘shape readout’ near the center of the DNA target, to a stronger reliance on direct readout of DNA base chemistries across the more distal ends of the DNA target sequence (88). Recent studies have demonstrated that even relatively moderate divergence of meganuclease sequences allows them to establish significantly altered DNA specificities, thereby enabling recognition and cleavage of new genomic targets (88). The ability of these proteins to efficiently generate new DNA target specificities, with relatively minor resculpting of their structures, is probably a consequence of their function as the catalysts of gene invasion, mobility and genetic persistence. Some of the initial demonstrations that the action of a site-specific nuclease at a unique target within a mammalian genome could increase targeted gene modification events involved the use of I-SceI LAGLIDADG endonuclease (29,30,90). In those studies, the natural target site of that enzyme was first introduced into a desired chromosomal allele, prior to the subsequent expression and action of the meganuclease. Additional experiments using integrated I-SceI target site and introduction of wild-type enzyme have demonstrated correction of an exon disruption in the Artemis gene in mouse hematopoietic stem cells (91) and in vivo targeted recombination in mouse liver (92). Subsequent to those experiments, it became clear that use of meganucleases for targeted genome modification would require substantial alteration of their recognition specificity. The first crystallographic structures of meganucleases (I-PpoI and I-CreI in 1998 (93,94) followed by I-MsoI, I-AniI and I-SceI in 2003 (87,95)) allowed identification of the amino acids in each system that were found within contact distance of base pairs in their DNA targets. With such information now available, it is possible to extensively and routinely retarget a meganuclease for the modification of unique genomic targets. The history of experiments that have led to this capability are summarized below. Alteration of meganuclease target specificity at individual base pairs Initial studies focused on the systematic alteration of individual residues within a meganuclease DNA-binding surface that might cause a change in specificity at a corresponding single base pair, coupled to in vitro or cellular assays of cleavage activity (96,97). These early investigations generally used either DNA binding reporter systems (such as bacterial two-hybrid screens (96)) or methods that coupled site-specific DNA cleavage to the elimination of a reporter gene (97,98). The results of these experiments indicated that at a limited number of individual DNA target positions and corresponding endonuclease contact residues, point mutants of the meganuclease could be identified that displayed a strong shift in specificity without a significant decrease in recognition fidelity at that position. Such positions in the protein–DNA interface typically corresponded to the most distal (outer-most) positions in the target site, where individual amino acid side chains (extending from protein loops at the periphery of the folded protein) often contact single nucleotide base pairs. At the same time, a purely computational approach to accomplish the same purpose was also reported, using the Rosetta computational protein design algorithm. Similar to the results above, the altered enzyme cleaved its corresponding DNA target site (containing a single altered base pair) several orders of magnitude more effectively than did the wild-type enzyme, along with wild-type ability to discriminate between the two targets (99). Combined alteration of specificity at multiple, adjacent base pairs By 2006 it was clear that mutation of individual DNA-contacting residues (while otherwise maintaining an unchanged protein sequence) at certain positions in the DNA target site might sometimes result in desired changes in specificity at unique base pairs in the DNA target sequence (100). However, it was not obvious whether such changes in endonuclease sequence and function might be readily combined. To address these questions, a selection method to screen meganuclease libraries for altered DNA cleavage specificity was developed, in which endonuclease activity was coupled to reconstitution of a reporter gene via DSB-induced homologous recombination (101–103). This approach was used to systematically screen semi-randomized libraries of the I-CreI meganuclease. In this way, investigators were able to identify protein variants containing multiple alterations in its amino acid sequence, that could allow recognition of a DNA target site containing multiple adjacent base pair substitutions in a genomic target site (102,103). These experiments indicated that individual protein mutations that reduce activity or specificity on their own might function well in more extensively altered protein variants; conversely, some mutations of DNA-contacting residues that functioned well on their own were incompatible with protein mutations at nearby positions (reviewed in (104)). A similar effort to alter specificity across multiple consecutive base pairs, using a structure-based computational approach, further illustrated the importance of the context-dependent of protein–DNA interactions (105). A series of studies to further improve the ability of structure-based computational redesign approaches was subsequently reported from 2009 through 2014. In those studies, the overall contribution of contacts to each base pair position on binding and/or DNA cleavage was determined (89), followed by improvements in the computational prediction of side-chain base interactions and conformations in the protein–DNA interface (106). This work led to the eventual redesign of the I-AniI meganuclease and its use in genome modification experiments both in mammalian cells (107) and in mosquitos (108). Details of strategies for the combined use of computations (using the Rosetta program suite) combined with selection experiments for nuclease activity, is documented in (19). Hybrid meganucleases A series of studies also demonstrated that while the DNA contacting elements and surfaces of meganucleases are distinctly non-modular, their N- and C-terminal domains can be structurally separated and recombined to form ‘hybrid’ meganuclease scaffolds that recognize chimeric DNA target sites (109–112). In addition, a homodimeric meganuclease (I-CreI) was turned into a functionally equivalent ‘single chain’ monomeric protein via introduction of a peptide linker between the two subunits of the dimeric enzyme (113,114). These experiments further enabled the development of novel meganuclease recognition, both by creating new starting protein scaffolds and specificities, and by reducing the process of engineering new specificity into two separate experimental tasks, for which the output could be fused into a final nuclease construct. Complete retargeting of meganuclease specificity and application to genome editing Multiple groups (in both industry and in academia) have exploited the results and observations summarized above to create extensively retargeted meganucleases for genome engineering and targeted gene modification. In all cases, the use of direct structure-based redesign and structure-based selection methods have each found a significant role in the engineering process, but the need for selection experiments as a fundamental requirement for high activity and requisite specificity has not been eliminated. In certain cases, the incorporation of such engineered meganuclease constructs into chimeric ‘MegaTAL’ architectures (comprised of N-terminal TAL effector domains tethered to C-terminal engineered nucleases; (115)) has facilitated the use of such enzymes for highly demanding applications in primary human cells as part of various therapeutic approaches (116). Two separate biotechnology companies (Cellectis Inc. and Precision Biosciences Inc.) have described the generation of fully redesigned meganucleases, based on single chain versions of the I-CreI enzyme, and their subsequent use for targeted gene editing applications. Engineering and selection steps were separately focused on the N- and C-terminal protein domains (each targeting a half-site within the final genomic target) and then combined into a single polypeptide which is further refined for best in vivo performance. These two approaches largely converged on mutations of the same DNA-contacting protein side chains for various alterations of DNA recognition specificity. The variants of single-chain I-CreI endonuclease created by these group include engineered meganucleases used for correction of the human XPC gene for the treatment of Xeroderma Pigmentosum (117–119), generation of cell lines harboring precisely generated genetic insertions and alterations (120,121), creation of genetically modified maize containing heritable disruptions of the ligueleless-1 and MS26 loci (113,122), modification of defined genomic regions in Arabidopsis (123), insertion and stacking of multiple trait genes in cotton (124), generation of Rag1 gene knockouts in human cell lines (125,126) and in transgenic rodents (127), disruption of integrated viral genomic targets in human cell lines (128), targeted exon deletions in the human DMD gene associated with Duchenne Muscular Dystrophy (129). Crystallographic structures of two of these fully reengineered variants (against the human Rag1 and XPC targets) have been solved and described (119,126). Yet another biotechnology company (Pregenen, Inc.), in concert with several academic research labs, developed a high-throughput flow cytometric approach to screen semi-randomized endonuclease libraries for altered binding and cleavage specificity (130). Using this strategy, gene targeting nucleases have been created that cleave unrelated targets in human, viral or insect host genes. The resulting meganucleases have again been shown to be highly active in transfected primary human cells and transgenic insects, and display specificity profiles that rival or exceed the parental meganuclease. These enzymes drive the disruption of fertility-related genes as part of a gene drive strategy for the control of insect disease vectors (131), disrupt the gene encoding T-cell receptor α-chain gene (as part of a broader strategy to create engineered T-cells that can be used as anticancer immunotherapeutic reagents) or disrupt the gene encoding the human CCR5 gene that acts as a co-receptor for HIV (116). The details of the methods used by these latter investigators have been described in detail previously (109,132). A complementary strategy for the purpose of retargeting of meganuclease specificity utilizes a technique known as in vitro compartmentalization (‘IVC’) (133,134). In this approach, the meganuclease is redesigned via activity selections within compartmentalized aqueous droplets. The method was illustrated by engineering several different meganucleases to cleave multiple human genomic sites, as well as variants that discriminates between single nucleotide polymorphic (SNP) variants. Structural and functional outcomes of meganuclease engineering Crystallographic and biophysical analyses of five different extensively retargeted variants of a single meganuclease, that have been shown to function efficiently in ex vivo and in vivo applications, has been more recently reported (23). The redesigned proteins harbor mutations at up to 53 residues (18% of their amino acid sequence), primarily distributed across the DNA binding surface, making them among the most significantly reengineered ligand-binding proteins to date (Figure 2D). Other than maintaining their original specificities across the central four base pairs of each target site (a constraint that is related to bending of the DNA), the base pair identities are changed liberally throughout the remainder of the DNA target, and many base pairs are present at least once at each position. The reorganization and structural changes in these proteins that facilitate recognition of alternate DNA targets can be described as the sum of: (i) small protein backbone motions involving DNA-contacting β-sheets (that contribute the largest share of contacts to nucleotide bases throughout the major groove); (ii) much larger reorganization of flanking protein loops at both ends of the β-sheets, and (iii) extensive role-swapping throughout the entirety of the protein–DNA interface. ‘Role-swapping’ refers to protein residues that, after mutation, have switched from interacting with DNA to instead interacting solely with surrounding protein side chains, or vice-versa. Changes in overall DNA recognition specificity are facilitated by the ability of residues in or near the protein–DNA interface to readily exchange both form and function in this manner. The fidelity of recognition is not precisely correlated with the fraction or total number of residues in the protein–DNA interface that are actually involved in DNA contacts, including directional hydrogen bonds. The plasticity of the DNA-recognition surface of this protein, which allows substantial retargeting of recognition specificity without requiring significant alteration of the surrounding protein architecture, reflects the ability of the corresponding genetic elements to maintain mobility and persistence in the face of genetic drift within potential host target sites. This demonstrates the extent to which a single meganuclease protein can be substantially reprogrammed for recognition of multiple unique genomic target sites, without the need for significant alteration of the surrounding protein scaffold. TRANSCRIPTION ACTIVATOR-LIKE (TAL) EFFECTOR ENGINEERING Overview Similar to meganucleases, transcription activator-like (TAL) effectors evolved under a unique set of selective forces that shaped their unique DNA recognition properties. Despite the ‘-like’ in their name, TAL effectors are indeed transcription activators. Made by plant pathogenic bacteria in the genera Xanthomonas and Ralstonia, they enter host cells via the bacterial type III secretion system, translocate to the nucleus by virtue of C-terminal nuclear localization signals, bind to individual sequences in the host genome determined by a DNA recognition domain distinct for each effector, and directly upregulate downstream genes by virtue of a C-terminal acidic activation domain. TAL effectors have been selected that activate host genes whose expression facilitates bacterial multiplication and spread. Such genes are referred to as disease susceptibility or ‘S’ genes. The presence of an effector binding element (EBE) in the promoter of a so-called plant ‘executor resistance (R) gene’, however, will result in TAL effector-triggered host immunity, and exert negative selection pressure on the corresponding TAL effector. At the same time, sequence variation in the EBE of a major S gene can render a plant effectively resistant by preventing binding and activation by the TAL effector, resulting in loss of susceptibility and selection for TAL effectors that can accommodate the sequence variation. Several examples of both such host adaptations have been characterized in diverse plant species (135). Thus, TAL effectors can be presumed as a group to have been selected for the ability to rapidly evolve new specificities in order to probe the host genome for beneficial targets, and individually to have been subject to contrasting selective pressures - for stringent specificity to discriminate between potential EBE sequences in S vs. R genes, and for lax specificity to accommodate minor sequence polymorphism at S gene EBEs across different host genotypes. The result of this selection (and/or the result of selection on ancestral proteins in other functional contexts not yet discovered) is a DNA recognition domain that functions via a modular mechanism, which allows evolution of new specificities by recombination-based shuffling and by point mutations within modules, and variation in the specificity profiles and affinity contributions of individual modules that confers plasticity in targeting stringency (16,136–138). The TAL effector DNA binding domain forms a superhelical, monomeric protein chain that wraps around B form DNA in a right-handed manner, tracking the major groove without inducing any bend or other substantial structural distortion (Figure 3A and B). Its repeated modules form contiguous, two-helix bundles (Figure 3C and D), each of which comprises a highly conserved sequence of typically 33–35 amino acids and interacting with a single base, contiguously, on one strand of the DNA. The contribution of each module to specificity and affinity is determined predominantly and predictably by two residues within the repeat that vary, at positions 12 and 13, together referred to as the ‘repeat variable di-residue’ (RVD). The RVD resides in the loop connecting the two helices of a module together. Residue 12 interacts with the backbone at position 8 to stabilize and position the loop, and the side chain at position 13, which has been called the base-specifying residue (BSR), projects into the major groove to interact with the base at that position. RVDs HD, NG, NI and NN are the most common, and they are the most commonly used in engineering. HD specifies cytosine through van der Waals interaction and hydrogen bonding between the aspartic acid side chain and the base, resulting in high specificity and high affinity. NG specifies thymine via nonpolar van der Waals interaction between the backbone α carbon of the glycine residue and the methyl group of the thymine, again high specificity, but lower affinity. NI specifies adenine through nonpolar van der Waals contacts of the isoleucine side chain to the purine ring, which appear to desolvate at least one polar atom in that ring; though specific, this likely makes the interaction decidedly low-affinity. NN has dual specificity, making high affinity contact with guanine or adenine by hydrogen bonding between the BSR and N7 of either opposing pyrimidine. Figure 3. View largeDownload slide Structure of the TAL effector–DNA association and the basis of specificity. Panels A and B:The structure of PthXo1 binding region (comprised of 22 TAL effector repeats distributed along a single monomeric protein chain) bound to its DNA target site is shown from the side of the DNA duplex and looking down the axis of the DNA. The effector contains 22.5 repeat modules, each colored separately. In the side view, the N-terminal end of the protein is leftmost. The structure also contains two cryptic N-terminal repeats that engage the DNA backbone via a series of basic residues, and that contact a strongly conserved thymine at the 5′ position of the binding site. Panel C illustrates the contacts made by the HD RVD (residues 12 and 13) in repeat number 14. The histidine at position 12 in the repeat forms a hydrogen bond to the backbone carbonyl oxygen of residue 8 in the first a-helix, while the aspartate at position 13 forms a hydrogen bond to the extracyclic amino nitrogen of the cytosine base. Panel D shows repeats 14, 15 and 16 interacting with the DNA, illustrating that consecutive RVDs (HD, NG and NN, respectively in these repeats) contact consecutive bases (in this case cytosine, thymine, and guanine) on the same DNA strand. Figure adapted with permission from Figure 1 in Doyle et al. (2013) Trends in Cell Biology23 (8):390–398. Figure 3. View largeDownload slide Structure of the TAL effector–DNA association and the basis of specificity. Panels A and B:The structure of PthXo1 binding region (comprised of 22 TAL effector repeats distributed along a single monomeric protein chain) bound to its DNA target site is shown from the side of the DNA duplex and looking down the axis of the DNA. The effector contains 22.5 repeat modules, each colored separately. In the side view, the N-terminal end of the protein is leftmost. The structure also contains two cryptic N-terminal repeats that engage the DNA backbone via a series of basic residues, and that contact a strongly conserved thymine at the 5′ position of the binding site. Panel C illustrates the contacts made by the HD RVD (residues 12 and 13) in repeat number 14. The histidine at position 12 in the repeat forms a hydrogen bond to the backbone carbonyl oxygen of residue 8 in the first a-helix, while the aspartate at position 13 forms a hydrogen bond to the extracyclic amino nitrogen of the cytosine base. Panel D shows repeats 14, 15 and 16 interacting with the DNA, illustrating that consecutive RVDs (HD, NG and NN, respectively in these repeats) contact consecutive bases (in this case cytosine, thymine, and guanine) on the same DNA strand. Figure adapted with permission from Figure 1 in Doyle et al. (2013) Trends in Cell Biology23 (8):390–398. Several other, less common RVDs are found in native TAL effectors (15,16,139,140). Of particular note for engineering are NK, NH, N* and NS. NK and NH provide better specificity for guanine than the dual-specificity NN, though NK, and to a lesser extent NH, weaken overall interaction relative to NN (141–143). N*, in which the asterisk designates a missing amino acid at position 13 resulting in a slightly retracted interhelical loop (137), is found most often associated with thymine or cytosine in nature (16) and was shown to be a suitable alternative for HD when cytosine at the target might be methylated (144). In native TAL effector–target alignments, NS in native TAL effector–target alignments can be found in association with any of the four bases (16), and can be considered a ‘wildcard’ RVD for engineering. The array of modules that determines DNA-binding specificity constitutes a domain called the central repeat region (CRR). Immediately N-terminal to the CRR, four additional two-helix bundles are present that do not match the repeat consensus sequence (137,138,145,146). Through lysine and arginine contacts with the DNA backbone, these ‘cryptic repeats’ can bind DNA independently, and are thought to nucleate interaction with the DNA for sequence-specific binding mediated by the CRR (146–148). The cryptic repeat closest to the CRR plays an important role in the general requirement of Xanthomonas TAL effectors for a thymine at position ‘0’ of the EBE, immediately 5′ of the first RVD-specified base: a tryptophan residue (W232) in the cryptic repeat coordinates with the thymine (137,149). Ralstonia TAL effectors (called ‘RipTALs’), which present an arginine instead of tryptophan at that position, require a guanine base opposite (140,149). The structural and biochemical basis for these specificities is poorly understood though. Influence of the CRR composition, particularly the RVD of the first repeat, and of the experimental context have been observed (149,150). Efforts to engineer altered specificities for base 0 have, nonetheless, met with some success (132). The cryptic repeats may exert an effect on another property of TAL effector DNA recognition, increasing mismatch tolerance closer to the C-terminal end of the CRR (151,152). TAL effectors acquire their targets via a rotationally decoupled, linear search mechanism in which the cryptic repeats provide the major contribution to non-specific association with the DNA (147,148). Given this anchoring role of the cryptic repeats, it seems likely that the interaction transitions to specific binding via BSR-nucleotide contacts initiating from the end closest to the cryptic repeats, and that once the specific binding state is initiated by a certain number of these contacts, subsequent RVD-nucleotide pairings diminish in their influence on overall binding energy (153). Notably, proteins closely related to TAL effectors but found in the fungal endosymbiotic bacterium Burkholderia rhizoxinica lack two of the cryptic repeats, but make additional, non-specific contacts with the DNA backbone via non-RVD residues, often arginines and lysines, throughout the CRR (154). This observation suggests that it may be possible to engineer patterns of mismatch tolerance by modifying, moving, or replacing the cryptic repeats, or by modifying backbone residues of the CRR. Assembly of custom TAL effector DNA recognition domains The modularity of TAL effectors makes them easy to engineer for specificities of choice: coding sequences for the necessary modules are simply assembled in the correct order into a genetic backbone construct, which may include translational fusions to any of a variety of other protein domains. Many cloning kits and protocols are publicly available for such assembly. The earliest and among the most widely used of these take the Golden Gate approach (155), in which type IIS restriction enzymes (which cut at a distance from their recognition sites) are used to release cloned, module-spanning fragments each containing an RVD and all staggered such that cleavage results in sequentially matching 5′ overhangs; the overhangs drive ordered and oriented assembly of the fragments into a module array in a single tube ligation (156–159). PCR-based and other ligation-independent methods have also been developed (e.g., 160), as have sequential ligation but scalable assembly strategies for high throughput (161,162). Sakuma and Yamamoto (163) provide a comprehensive review. Several web-based tools for design and off-target prediction have also been made available (see 164,165 for reviews). The earliest of these scores potential binding sites using a position weight matrix based on observed association frequencies (166). Others add a parameter to reflect the increasing mismatch tolerance of RVDs close to the C-terminal end of the array (167). The best performing tool, SIFTED, derives from extensive protein binding microarray data for 20 custom TALE effector proteins, and factors in observed, minor effects of neighbors, position in the array, and length of the array on individual RVD specificities (168). However, the TAL effectors assayed to develop SIFTED contained only the four most common RVDs, so the utility of the tool is limited to such proteins. TAL effector-based DNA targeting applications The first artificial TAL effectors were generated as part of the study that determined experimentally the RVD-nucleotide relationship ‘code’ that governs TAL effector DNA recognition (15). Apart from the customized CRR, these retained their native features as transcription activators and were assayed using a reporter gene assay in Nicotiana benthamiana leaves. Since then, such ‘ArtTALs’ (141), alternatively called ‘dTALEs’ (designer TAL effectors) (159), have been used extensively for functional validation of targets and candidate targets of native TAL effectors in the context of plant disease: if a dTALE activating a gene from a promoter binding site distinct from the EBE of the native TAL effector phenocopies that TAL effector, one can conclude that the gene is the relevant target of the native TAL effector (e.g.169,170). Another early and widespread application of customized TAL effector DNA recognition domains is in TAL effector nucleases (TALENs) for genome editing. Originally developed by replacing the C-terminal TAL effector activation domain with a monomer of the catalytic domain of the type IIS restriction enzyme FokI, TALENs cleave in pairs, targeted to sequences on opposing DNA strands across a spacer (171–173). FokI fusions to full length TAL effectors, N- or C-terminal, and C-terminal fusions to TAL effectors missing the first 152 aa and all but the first 63 aa of the C-terminus (after the CRR) were also shown to function (172,173). The latter, compact configuration, referred to as the ‘Miller architecture’ has been widely adopted. Indeed, TAL effectors, the Miller architecture in particular, have proven amenable to a variety of fusions, including alternative activation domains, repressor domains, affinity and fluorescent tags, epigenetic modifiers, and others (149,174). They have also been tested as fusions to the restriction enzyme PvuII, the catalytic domain of the meganuclease TevI, and to the CRISPR-associated nuclease Cas9 toward developing monomeric TALENs for genome editing (175–177). As already mentioned, fusions of TAL effector DNA recognition domains to meganucleases have shown great promise as highly specific tools for therapeutic genome editing (115). Future prospects: engineering targeting stringency Most TAL effector assembly platforms use the Xanthomonas TAL effector backbone and repeat consensus sequence with the four most common RVDs, and in some cases one or more of the other RVDs discussed above. While the simplicity of assembling with such platforms TAL effectors that bind sequences of choice led to their widespread adoption and transformative impact in basic research, agriculture, and medicine (e.g. 178–184), several discoveries suggest important additional research directions and engineering approaches that could further enhance the utility of TAL effectors by taking advantage of underexploited properties to fine-tune binding specificity. This fine-tuning has the potential to be not only qualitative, i.e., to match a given target sequence, but quantitative, to modulate the stringency of that match, even non-uniformly across the target site. First, two studies profiled the specificity and functionality of all 400 possible RVDs (185,186). The results revealed a large number of functional RVDs representing a striking diversity of specificity profiles that could be used in engineering. Protein-binding microarray analysis of dTALEs incorporating RVDs of interest, of the sort used to develop the SIFTED software for design and target prediction, would be an important further step toward precisely defining the behavior of these RVDs in different contexts. Second, polymorphism in the backbone repeat sequence relative to Xanthomonas TAL effectors has been observed in RipTALEs (140) as well as the Burkholderia TAL-like proteins (BTLs) mentioned earlier (174), and in TAL effector-like sequences fished out of metagenomic data from marine samples (187). Engineering based on these polymorphisms achieved variation in strength of dTALE-DNA interactions (188). Separately, modification of backbone residues at positions 4 and 32 resulted in higher efficiency TALENs, ostensibly due to greater mobility along the superhelical axis that allowed better positioning of BSRs to interact with corresponding nucleotides throughout the array and limit the spatial distribution of the fused FokI domains for higher efficiency dimerization (189). Systematic characterization of the influence of backbone variation on the individual behaviors of different RVDs and on the overall dynamics of the TAL effector-DNA interaction could further expand the capacity to engineer specificity quantitatively. Incorporating variation in repeat backbone sequences could also guard against recombination of TAL effector constructs, which can be problematic in the context of viral systems for delivery (190). Third, the earlier noted discovery of the relationship between structural variation outside the CRR and specificity for the base at position 0, as well as the degree and pattern of mismatch within the CRR, suggests that base 0 specificity and mismatch tolerance could be robustly engineered. Structure-function studies to better understand those relationships are needed however. Fourth, so-called ‘aberrant repeats’ observed in some TAL effectors of the species Xanthomonas oryzae afford the opportunity, to our knowledge unique among characterized DNA binding proteins, to accommodate single base indels in a target (or set of targets). These aberrant repeats have short insertions or deletions of amino acids in the region that connects one repeat to the next. Though the repeats are functional, the particular indels observed render them capable of disengaging if a single base is missing at that position in a way that allows in-register binding of the remainder of the CRR to the DNA (191). Understanding the mechanistic basis for that capacity to disengage would inform use of such aberrant repeats in combination with other engineering-based modifications. Finally, a recently characterized feature of TAL effectors that could be exploited in engineering is the influence of their length (number of RVDs) on overall specificity. With increasing length, TAL effectors exhibit exponentially decreasing gain in affinity for target DNA, yet less rapid deterioration of gain in affinity for non-target DNA (153). Using experimental data and simulations, plotting specificity as the affinity for target DNA relative to the affinity for non-target DNA across varying lengths results in a Gaussian curve with a peak centered between 14 and 22 RVDs depending on the overall RVD composition. Thus, length variation could be used to modulate specificity quantitatively. Further experimentation to better understand how RVD composition determines the optimum length for maximum specificity would benefit this goal. Though not broadly distributed in nature, TAL effectors have been shaped by a unique set of selection pressures and provide a versatile platform for engineering protein–DNA interactions with precision and flexibility. As described above, to fully realize their potential, further characterization of the mechanistic basis for their unique properties and the influences of sequence variations on those properties is important. Not to be overlooked however, is the importance of continuing to identify and characterize related proteins, not only to understand the evolutionary origin of TAL effectors but to gain further insight into useful structural variation. RECOMBINASE ENGINEERING Overview Members of the Int recombinase/topoisomerase family, also known as site specific recombinases (SSRs), have been used to edit DNA both in vitro and in vivo for decades, and some have been commercially developed to simplify cloning tasks (i.e. the Invitrogen Flp-In™ and Gateway® systems, based on Flp and λ-integrase, respectively). These proteins recognize, cleave and ligate double-stranded DNA during a multi-step reaction wherein the intermediate is covalently bound to the enzyme (Figure 4A and B). The details of this reaction differ significantly when comparing and contrasting tyrosine recombinases (such as Cre, Flp and λ-integrase, Figure 4C),versus serine recombinases (such as φC31 integrase and Gin, Figure 4D) (reviewed in (192)). Unlike tyrosine recombinases, where the determinants of DNA specificity generally lie within the same domains required for catalysis, serine recombinases are significantly modular in their form and function; their DNA binding domains can be replaced without affecting catalytic function (reviewed in (193)]). In recent years, a growing number of naturally-occurring recombinases with differing sequence specificities have been identified. Some of these will likely also prove useful in genome engineering applications (reviewed in (194)). Figure 4. View largeDownload slide Site specific recombinase modes of action. Panel A: SSRs are capable of catalyzing excision, insertion, or inversion reactions. Panel B: Tyrosine recombinases such as Cre, Flp and λ integrase proceed through a Holliday junction intermediate. The catalytic domains have pseudo four-fold symmetry and engage palindromic sequences (arrows). Panel C: The topology of the λ integrase catalytic domains is similar to that of simple tyrosine recombinases like Cre. However, λ integrase also contains N-terminal DNA-binding domains (top set with arrows) that are critical for site-specific DNA recognition. Panel D:The catalytic domains (center) of serine integrases break both DNA strands before rotating relative to one another and religation of the DNA. DNA binding domains (top and bottom) of the wild-type enzymes can be replaced with zinc fingers to customize specificity. Figure 4. View largeDownload slide Site specific recombinase modes of action. Panel A: SSRs are capable of catalyzing excision, insertion, or inversion reactions. Panel B: Tyrosine recombinases such as Cre, Flp and λ integrase proceed through a Holliday junction intermediate. The catalytic domains have pseudo four-fold symmetry and engage palindromic sequences (arrows). Panel C: The topology of the λ integrase catalytic domains is similar to that of simple tyrosine recombinases like Cre. However, λ integrase also contains N-terminal DNA-binding domains (top set with arrows) that are critical for site-specific DNA recognition. Panel D:The catalytic domains (center) of serine integrases break both DNA strands before rotating relative to one another and religation of the DNA. DNA binding domains (top and bottom) of the wild-type enzymes can be replaced with zinc fingers to customize specificity. In contrast to nuclease-based DNA editing schemes, Cre, Flp and some other SSRs do not require cellular machinery or ancillary proteins for efficient recombination (195–198), and it has been shown with λ integrase and various serine recombinases, that even SSRs that require additional proteins and host factors can often be engineered to function in a cofactor-independent manner (199–203). Thus, many SSRs are well suited for use in heterologous organisms. Perhaps most importantly, SSRs generally act with single-nucleotide resolution, and they maintain their hold on the cut DNA ends throughout the reaction cycle. For this reason, they may prove safer in clinical applications than nuclease-based editing schemes which often result in unpredictable indels at the edited locus. Recombinase catalysis can result in insertion, deletion or inversion of large DNA fragments (>100 kb). SSR technology has, for instance, been used to exchange large segments of the mouse genome with the equivalent human region (204). With two different SSRs, cassette exchange reactions and translocations can also be efficiently catalyzed (205,206). The primary limitation of recombinase-based genome editing arises from the requirement that enzyme-specific sequences be present in the DNA to be edited. To overcome this, a variety of recombinases, some with dramatically altered target specificities, have been engineered using genetic selection schemes and structure-based modeling. These enzymes (also called integrases, resolvases and invertases depending on their primary natural function) are broadly separated into two classes; those with DNA intermediates covalently bound to an active site tyrosine and those where the intermediate is bound to serine. Both classes have been the subject of engineering. Tyrosine recombinase engineering P1 bacteriophage Cre recombinase, so named because it catalyzes recombination, is the prototypical member of the tyrosine recombinase family, and it is arguably the enzyme that has been the subject of the most engineering. For this reason, it will be discussed in somewhat more detail than the others. Like most other SSRs discussed here, wild-type Cre forms a homotetrameric complex with its DNA substrate (Figure 4C) (reviewed in (207)). The preferred substrate for wild-type Cre, called loxP, contains palindromic sequences 13 nucleotides long separated by a central, non-palindromic 8 bp ‘spacer’ region which imparts directionality to the recombination site. Each 13 bp ‘half-site’ is engaged by one copy of the recombinase. For recombination to occur, two loxP sites (hence four half-sites and four Cre monomers) must come together. Cleavage and ligation occurs after the first base of the spacer region on the top strand and after the seventh base on the bottom strand. Most of the DNA-protein interactions involve the palindromic segments. Having 13 bp palindromic regions and an 8 bp spacer is not universal. A number of tyrosine recombinases have 7 bp spacers, and the length of the palindromic regions varies from 12 to 18 bp (reviewed in (194)). Some tyrosine recombinases exhibit a degree of flexibility. For instance Flp, but not Cre, can act on substrates with spacers that are one nucleotide longer or shorter (208). Cre and Flp are both composed of two globular domains that are connected by an extended linker segment. These domains form a C-shaped clamp that completely encircles their respective DNA targets. In both proteins, the globular domains make extensive interactions, mostly in the major groove of the DNA, though the C-terminal domain makes additional minor groove interactions. The specific sites of DNA interaction are widely distributed across the protein sequence and there are at least a dozen sequence-specific DNA contacts made by each recombinase monomer (reviewed in (194)). Many more residues make non-specific interactions with the DNA backbone, and there are water-mediated interactions with some of the DNA bases. In the case of Cre, it has been shown that disruption of these indirect interactions also alters sequence-specificity (192,209,210). loxP has been subject to extensive mutation and analysis. Many mutations in this DNA sequence, particularly those in the linker region, do not abrogate Cre's activity (211,212). This sequence promiscuity could be the reason that chromosomal abnormalities have sometimes been attributed to Cre-based genome editing ((213) and references therein). Both site-directed and random mutagenesis approaches have been successfully used to alter Cre's specificity such that sites other than loxP are targeted for recombination. Most, but not all of the successful studies involved screening large pools of mutant recombinases. Notably, Santoro and Schultz succeeded in changing Cre's specificity by generating a library of mutated enzymes where the amino acid sequence diversity was limited to just a handful of amino acids involved in sequence-specific binding to DNA (214). A reporter plasmid encoding fluorescent proteins flanked by altered loxP sites allowed fluorescence activated cell sorting of recombined cells and discovery of two Cre variants that were able to recombine loxM7, a sequence that differs at three positions in the loxP half-site. The loxM7 sequence is not acted on efficiently by the wild-type enzyme (215). Crystal structures of the mutated Cre with loxM7 revealed a complex network of interactions; water molecules and molecular flexibility played important roles in the DNA recognition (209). Santoro and Schulz also showed that both positive and negative selection were important; without the latter, the most likely result was relaxed specificity rather than altered specificity. Rüfer and Sauer reached a similar conclusion when they used a selection scheme wherein Kanamycin resistance was triggered by recombination of an altered site, loxK2, which has 13 changes relative to loxP (210). In the later study mutations throughout the Cre gene were combined through DNA shuffling (216). The results highlight the importance of Glu 262, a critical residue which was independently identified in subsequent work (194). Buchholz and Stewart used a different approach to identifying mutations that alter Cre specificity. They used error-prone PCR and DNA shuffling to generate random mutations throughout Cre, and they developed a method known as substrate-linked protein evolution (SLiPE) to separate active variants from inactive ones. In this scheme, the mutant enzymes are encoded by the same plasmid (pEVO) as the substrate sequence. pEVO recombination after induction with arabinose results in the loss of an Nde1 restriction site. Successful recombinants remain circular upon Nde1 treatment, while those that have not excised the restriction site are linearized. PCR primers that only yield products from the circularized plasmids are then used to recover the mutated recombination-competent Cre sequences. This process can be repeated, and still more mutations can be introduced into the pool of sequences that yield recombined products (217). This approach was initially used to develop a recombinase named Fre22, which has minimal activity against loxP. Fre22 contains 15 mutations relative to Cre and targets a sequence named loxH, with 13 changes relative to loxP (6 are within the symmetric, palindromic regions). The SLiPE approach was later used to generate Tre and Brec1, engineered recombinases that target a 34 bp sequence within the LTRs that flank the genes of HIV after genome integration. Tre, which recognizes a sequence that is 50% identical to that of loxP has 19 mutations relative to Cre. Brec1, which recognizes a sequence that is only 32% identical, has 45 mutations. Sequencing the pools of successful variants has provided significant insight into the regions of Cre most important for substrate specificity (Figure 5). (20,218–222). Tre recombinase was further engineered using a structure-guided approach. Molecular dynamics simulations suggested that changing the lysines at positions 43 and 86 to glutamates should improve specificity for the HIV-derived target DNA, and this was confirmed experimentally, highlighting the power of combining selection-based and structural approaches (223). In addition, the crystal structure of Tre in complex with its LTR-derived target allowed Meinke et al. to correctly predict that reverting Val 30 back to its original amino acid, Met, would make Tre more active (224). Using a mouse model system both Tre and Brec1 have been shown to efficiently excise HIV provirus from human cells. No deleterious effects or chromosomal abnormalities were observed, even when the Brec1 recombinase was constitutively expressed for a period of 18 months (20,225). In contrast to the earlier, evolved recombinases, Tre and Brec1 were designed to target sequences that are highly asymmetric. Tre's target half-sites differ at 8 out of 13 positions, and are thus no longer palindromic. The sequence targeted by Brec1 differs in 6 out of 13 half-site positions. The crystal structure of Tre in complex with its LTR-derived target has clarified the structural basis for much of this dual specificity (224). Figure 5. View largeDownload slide Regions of Cre recombinase particularly important for recognition. The DNA is shown in grey, and key regions of the protein involved in DNA recognition as described in the main text are highlighted with labels and side chain atom spheres. Figure 5. View largeDownload slide Regions of Cre recombinase particularly important for recognition. The DNA is shown in grey, and key regions of the protein involved in DNA recognition as described in the main text are highlighted with labels and side chain atom spheres. Targeting asymmetrical sequences with Cre has been a longstanding goal of the recombinase engineering field. One approach, as described above, evolves a single recombinase that acts on two different half-sites. A second approach is to engineer heterodimeric recombinase complexes. The XerC/XerD recombinase from Escherichia coli, is a natural example of such a heterodimeric complex (reviewed in (226)). It was shown in 2006 that mixtures of two Cre-like molecules, each with a different half-site specificity, can recombine asymmetrical sites that neither molecule alone acts on efficiently (227). To facilitate catalysis on asymmetrical substrates, obligate Cre heterodimers have been engineered. Gelato et al. accomplished this in three steps: (i) They created a library wherein all 20 amino acids were present at three key positions within the Cre protein–protein interface. The positions chosen (299, 304 and 334) are all hydrophobic in Cre, and they form a contiguous cluster. (ii) They designed a screen wherein bacterial survival is contingent upon recombination. This revealed an alternative, functional interface wherein the three key residues were mutated. (iii) They identified obligate heterodimers via visual analysis of the modeled protein–protein interface. The resulting, engineered interfaces are still hydrophobic, but the relative sizes of the interfacial sidechains are changed such that heterodimers are favored (228). Zhang et al. redesigned a different region of the Cre interface (229). They used the Rosetta program suite to model changes at 10 positions, and made a series of mutations based on the computational results that selectively form heterodimers. The mutated residues were in Cre's N-terminal helix and the interacting amino acids across the interface. No selection scheme was used in the Zhang study; the mutated proteins were purified and assayed one at a time. Eroshenko et al. also altered the N-terminal helix of Cre, but for a different reason. They showed that recombinase accuracy can be increased by decreasing cooperative binding (230). An antibiotic selection scheme allowed them to identify mutants of R32, which makes a salt bridge across the protein–protein interface in the wild-type, as important for increased accuracy. Flp, a recombinase encoded by the 2μ plasmid of Saccharomyces cerevisiae, has also been the subject of significant protein engineering efforts. Its thermostability has been improved using random mutagenesis and DNA shuffling (231), and as with Cre, engineering of the inter-protomer interface has allowed heterodimers to be favored over homodimers, facilitating the recognition of asymmetric DNA targets (232). The specificity of Flp has been altered through a bacterial selection scheme wherein lacZ or RFP reporters were flanked by mutated versions of FRT, the 34 bp Flp target (233). In this study, both PCR-based mutagenesis and randomization of specific codons provided the necessary genetic variability to recognize target sequences with single-base changes. Using a similar approach, specificity for two sequences close to the human interleukin 10 (IL-10) gene have also been engineered (234). As is the case with the HIV sequences targeted by Cre, these IL 10 targets are both asymmetrical and significantly different from the natural Flp target. Mutated Flp recombinases have also been shown to act in the context of mammalian genomes and to efficiently catalyze integration reactions. Furthermore, when paired with another recombinase the Flp variant targeting IL-10 was able to perform a recombinase mediated cassette exchange (RMCE) (235). Two other Flp-like recombinases have also been the subject of protein engineering efforts. The activity of the yeast R and TD recombinases has been enhanced, largely though truncating the sequences so they match Flp and other, shorter homologs and also through random mutagenesis (236). Engineering of λ integrase and serine recombinases The architectures of the tyrosine recombinase λ integrase and the serine integrases discussed below are distinct from each other and from the Cre-like enzymes. As in Cre, the λ integrase catalytic domains form tetramers and have Holliday junction intermediates (237). In contrast, serine integrases have intermediates with double-stranded breaks and are thought to undergo large swiveling motions during catalysis (238). More importantly from the standpoint of protein engineering, the integrases also contain separate DNA-binding domains that are largely responsible for sequence specificity. In λ integrase the catalytic domain is near the C-terminal end; in the serine integrases, the catalytic domain is at or near the N-terminal end (239,240). In both cases, the wild-type integrases typically join plasmid-encoded attP sites with bacterial attB sites creating distinct attL and attR sites that flank the integrated plasmid DNA. In contrast to those of the Cre-type reactions, the pre- and post- recombination sites after integrase action are distinct. Additional factors are typically required to excise the plasmid after integration. This makes integrases ideal for gene insertion. A variety of integrases have also been the subject of protein engineering. Notably, using a series of chimeras between the related integrases from λ and HK022 phages, Yagil et al. showed that just five key residues were largely responsible for a substrate specificity difference, and that subsets of the key amino acids identified from screening yielded more relaxed specificity (241). Recently, the specificity of λ integrase was altered to target a sequence from the human genome, attH, which differs from the original AttB site at four positions, and the λ integrase activity was also enhanced using a novel selection scheme involving β-lactamase and a fused β-lactamase inhibitor protein that was removed upon recombination (242). As in most of the recombinase work cited here, genetic variability was introduced by error prone PCR, with staggered extension process (StEP) PCR (243) (rather than DNA shuffling) used to combine the mutations within the surviving clones. The activity and specificity of the serine recombinase φC31 integrase, which naturally catalyzes unidirectional integration of phage DNA, has also been enhanced. This was accomplished through random PCR mutagenesis combined with alanine scanning of charged amino acids in the N-terminal domain (200). Mutations that allow φC31 integrase to target a human sequence on chromosome 8 were identified using a blue/white selection scheme (244). Other serine integrases, including Tn3, Bin, Tn21, have also been engineered for genome editing. To make them better suited for this task, they have had their activity enhanced and their requirement for accessory factors has been overcome (199,201,203). These molecules differ in their target sequences, and as with the Cre and Flp homologs mentioned above, this work expands the repertoire of well-characterized starting points for future recombinase design studies. An important series of different protein engineering studies have capitalized on the modular nature of serine recombinases to alter their specificity as well. Notably, the specificity of Tn3 resolvase was changed by fusing an engineered version of its catalytic domain to Zif268, a mouse transcription factor with a zinc finger fold (245). This recombinase has been studied in some detail to better understand the linker lengths and DNA sequence requirements for efficient catalysis (246). Barbas and coworkers developed their own versions of zinc-finger fusions with Gin and Tn3, and they showed that these molecules are useful for transferring genes into mouse and human genomes (247). Based on the crystal structure of the γδ resolvase with DNA (248), and the structures of Sin and Gin recombinases without DNA (249,250) they also identified residues that, when mutated, allow heterodimer formation (251,252). In addition, using overlap extension PCR to randomize the sidechains of five key amino acids and an assay that relies on recombinase-based assembly of an antibiotic resistance gene, Barbas and coworkers developed a collection of Gin recombinase catalytic domains capable to recognizing and recombining millions of 20 bp ‘core sequences’ (253). To extend the utility of these fused recombinase molecules even further, the same investigators used directed evolution to identify variants of Sin and β recombinases that target core sequences that the Gin variants cannot act on. In this case, variants of the catalytic domain were generated through error-prone PCR with approximately three mutations per selection cycle. After four rounds of selection, sequencing revealed that some key mutations were present in over 70% of the recombination-competent plasmids (254). Procedures for developing novel zinc-finger recombinases as well as expressing, purifying, and assaying them have been clearly summarized in two detailed methods papers (255,256). This class of recombinases has been designed to target safe harbor sites in the human genome (203), the bovine β-casein gene (for transgenic protein production in milk) (257), and a variety of other human sequences (254). Many other applications are sure to follow. Collectively, the recombinase engineering efforts described here clearly demonstrate that a broad range of selection schemes and approaches can be effectively employed to identify specificity-altering recombinase mutations. The work to date clearly demonstrates that it is possible to engineer enzymes that act on sequences that bear little, if any, resemblance to the natural targets of the wild-type enzymes. It is notable that despite the availability of high resolution structures and/or detailed molecular models, with few exceptions, the engineering efforts thus far have relied primarily on screens rather than carefully designed mutations to alter DNA recognition. Given the wealth of data regarding the effects of specific mutations on DNA specificity, it is likely that a hybrid approach will prove most effective going forward. Such an approach might use existing, unbiased screening data to identify the specific residues that should be mutated given the target sequence and then use experimental structures and computational models to determine what subsets of residues are most likely to yield results at each mutated site. A second important development involves the large and growing number of enzymes that are suitable for use in heterologous systems. The diversity of natural targets provides numerous starting points for protein engineering and synthetic evolution efforts. Some recombinases (i.e. Cre) are naturally better suited to excising DNA fragments than introducing new ones. Wild-type versions of others (i.e. the serine integrases) have the opposite activity. Thus, the desired genome modification, along with the specific sequences to be targeted will help decide the optimal structural platform and starting point. Recombinase engineering is not as easy as CRISPR-based gene targeting, but the efficiency of gene integration, combined with the precision of the molecular editing will likely continue to make recombinase-based genome engineering highly attractive for many applications. Restriction endonuclease engineering Overview Restriction endonuclease (REases) are one component of genomic defense systems encoded within bacteria and archaea. REases provide a form of ‘innate immunity’ for their bacterial hosts, by recognizing and cleaving short DNA target sequences that are found randomly within phage genomes and other forms of mobilized invasive DNA. When coupled with the activity of corresponding methyltransferases (MTases), such restriction-modification (‘RM’) systems confer resistance to phage infection and transformation by foreign DNA, while protecting the host genome from similar enzymatic degradation. Since their discovery (258) restriction endonucleases have been developed and used as workhorse tools for molecular and cell biology research. Their ability to recognize and cleave defined target sequences with exceptional fidelity, and to generate a wide variety of DNA products (including 5′ or 3′ ‘sticky ends’ of defined sequence) allows various REases to be employed for routine cloning purposes, analyses of methylation status (259), SNP detection (260,261), serial analyses of gene expression (‘SAGE’) (262,263), and preparation of DNA for high-throughput DNA sequencing (264). REases vary greatly with respect to their DNA target specificity, catalytic mechanism, structural organization, protein sequence and size. These differences are the basis for their classification into four major groups, or ‘Types’, each with multiple sub-classes (defined in (265) and organized into the restriction endonuclease database (‘REBASE’) as described in (266)). Types I and III R-M enzymes (reviewed in (267,268)) are multi-subunit assemblages that combine cleavage and DNA-modification together into large multifunctional molecular machines. Type II systems (reviewed in (269)) are generally simpler, and for the most part comprise separate endonuclease and methyltransferase enzymes, each with all the elements needed for independent sequence-recognition and catalysis acting at the same DNA target. Despite their simplicity, Type II endonucleases are highly diverse, having many different folds for DNA recognition integrated with several different folds and catalytic motifs that employ distinct DNA-hydrolysis mechanisms. They display a wide variety of structural organizations and are often embellished with additional structural domains. They assemble into various quaternary arrangements that can lead to complex cooperative and allosteric behaviors. Whereas the specificity of the nucleic acid binding proteins and enzymes described above have become amenable (with various and often significant investments of time and effort) to reprogramming, REases have generally proven to be highly recalcitrant to such efforts. This difference may be in large part attributable to the underlying biological function and purpose of these various DNA binding protein systems. The protein families described in the prior sections have biological functions that might reasonably lead one to expect that they can be reprogrammed during evolution: two (meganucleases, recombinases) are largely associated with the mobilization and transfer of their own coding sequences, another (TAL effectors) is responsible for the hostile takeover of a gene's expression and activity, and one (zinc fingers) are found to be employed in a highly combinatorial and ubiquitous manner to dictate DNA binding specificity for a wide variety of factors involved in disparate biological and genetic processes. In contrast, many restriction endonucleases (particularly the classic Type II endonucleases which operate as ‘stand-alone’ enzymes) cannot readily alter their recognition and cleavage specificity, due to the resulting toxicity that would likely result from cleavage at the new, unprotected target sites throughout the host bacterial genome absent a simultaneous alteration of their companion MTase to the same new specificity. Thus REases from R-M systems having separate DNA recognition moieties for modification and restriction are under evolutionary pressure to avoid changing DNA target specificity, as a change in either REase or MTase is likely to be lethal to the host. Initial engineering efforts A number of attempts have been reported to change the specificity of type II REases, including in particular work conducted using the EcoRI and EcoRV enzymes (which were two of the first REases to be visualized bound to their DNA target sites). Those experiments (270–272) involved both the substitution of individual amino acids observed to form interactions with individual nucleotide bases in the enzyme's target site, and the addition of structural elements and residues to attempt to increase the length of target site read-out. The amino acid substitutions were largely generated based on suggestions of ‘canonical’ complementarity between certain combinations of residues and bases (for example, asparagine or glutamine versus adenine, or arginine versus guanine). Investigators now know that such preferences have very little predictive power for unique enzyme-target site combinations, due to the complexities associated with protein–DNA interfaces and contacts. Not surprisingly (in retrospect) these experiments largely resulted in enzyme variants with greatly reduced catalytic power, and little to no shift in target specificity. Similar attempts to alter BamHI specificity have also been described, using an in vivo selection for binding with an inactive BamHI construct was employed in an attempt to change the BamHI recognition sequence (273). No variants that could cleave a new target sequence were created, although one that requires that a methylated adenine base was identified. A different approach, using directed evolution, was subsequently described to attempt the alteration of BstYI specificity (5′-RGATCY-3′) to recognize only 5′ - AGATCT-3′. An REase construct that no longer cut GGATCC, and that displayed moderate fidelity of recognition (preferring the target site AGATCT 12-fold over AGATCC) was obtained, but a complete change in specificity was not accomplished (274). In another study, an approach that used random mutagenesis coupled with a genetic screen was used to to alter the specificity of NotI (GCGGCCGC). Similar to the prior studies with BstYI, constructs that cleaved the wild-type sequence plus several miscognate sequences (that differ at one base) were identified, however a specific new DNA target was not generated (275). Like the meganucleases, certain types of ‘unorthodox’ type II REases that recognize split DNA target sequences have been amenable to the structural recombination of their half-site recognition domains into new combinations. This had previously been exploited to generate new specificities for certain type I R-M systems (276), and was then extended to type II REs that recognize split sequences (277). A different strategy was used to alter the recognition specificity of the Eco57I REase (which recognizes 5′-CTGAAG-3′) (278). Eco57I is a variant of the ‘type IIG’ enzyme subtype, wherein a single polypeptide harbors both an endonuclease and a DNA methyltransferase, each targeted to the same site by a common target recognition domain (TRD). Here an alteration of target specificity was generated by using a nuclease-deficient construct and selecting a randomly mutated library for altered methylase activity (indicated by protection against cleavage by an unrelated endonuclease) to thereby create a corresponding altered endonuclease specificity (at the same target site) having altered recognition at the fourth position: 5′-CTGRAG-3′. These moderate successes were made in R-M systems that use a common DNA recognition domain to target both protective modification and REase activities. Finally, rational engineering of new Type II REase enzyme variants that recognize and cut at predictable new DNA sequences, while maintaining activity and fidelity comparable to the wild-type enzymes, was achieved in a large family of Type IIG REases related to MmeI (Figure 6) (21). MmeI is an unusual type II endonuclease that cuts DNA two turns of the helix away from its asymmetric recognition sequence and possesses both DNA methyltransferase and endonuclease activities in the same polypeptide (279–282). The enzyme was found to have many homologs that share considerable protein sequence and structural similarity yet recognize different DNA target site. Investigators realized that the strong overall conservation of sequence and function in this REase subgroup, considered jointly with their highly diverse substrate recognition (Figure 6A), suggested that DNA specificity in this family is undergoing rapid evolution, promoted by limited numbers of protein substitutions. The protein positions that contact and determine DNA recognition for each base pair within the DNA targets recognized were identified through covariation analysis between the aligned DNA target sequences and the aligned protein sequences. Generally, the amino acid residues at a pair of positions were found to make direct contact to a base pair within the DNA target to specify recognition. By identifying the amino acid combinations specifying recognition for each base pair, the residue positions correlated with each base pair could be reliably mutated to produce a desired new recognition specificity. The subsequent determination of the crystal structure of MmeI in complex with its DNA target site (Figure 6B) (283) demonstrated the interactions that underlie DNA recognition (Figure 6C) and explained the basis for the results of the engineering study described above. The same covariation analysis approach has been successfully applied to rationally alter specificity in other families of Type IIG REases having TRD domains that differ from MmeI, as well as to classic Type I and Type I-SP systems (Morgan, R.D. unpublished observations, (284)). Figure 6. View largeDownload slide Engineering altered DNA specificity of the MmeI restriction endonuclease via a bioinformatics-driven approach. Panel A:Sequence alignment of target sites and key specificity determining region of C-terminal domain of MmeI and 19 homologues that have known specificities. The positions of base pair 6 in each enzyme's target site, and the residues in the enzyme that display significant covariation against that target position, are highlighted. Panel B:Ribbon diagram of the crystal structure of Mme bound to its DNA target, demonstrating the distribution and separation of the enzyme's endonuclease catalytic site (‘REase’), methyltransferase active site (‘MTase’) and target recognition domain (‘TRD’). Panel C:Close ups illustrating (left) the experimentally observed positions of residues E806 and R808 that were found to display direct contacts to base pair 6 in the wild-type DNA-bound crystal structure of MmeI, and (right) a corresponding model of the same two residues, after introduction of mutations (to K and D, respectively) that were predicted and later found to alter specificity from a G:C to a C:G base pair. The ability to systematically alter the specificity of this enzyme is facilitated by the availability of a large number of sequenced enzyme homologues with corresponding known target sites and by a protein architecture in which endonuclease catalytic activity is less intimately coupled to target recognition and binding. Figure 6. View largeDownload slide Engineering altered DNA specificity of the MmeI restriction endonuclease via a bioinformatics-driven approach. Panel A:Sequence alignment of target sites and key specificity determining region of C-terminal domain of MmeI and 19 homologues that have known specificities. The positions of base pair 6 in each enzyme's target site, and the residues in the enzyme that display significant covariation against that target position, are highlighted. Panel B:Ribbon diagram of the crystal structure of Mme bound to its DNA target, demonstrating the distribution and separation of the enzyme's endonuclease catalytic site (‘REase’), methyltransferase active site (‘MTase’) and target recognition domain (‘TRD’). Panel C:Close ups illustrating (left) the experimentally observed positions of residues E806 and R808 that were found to display direct contacts to base pair 6 in the wild-type DNA-bound crystal structure of MmeI, and (right) a corresponding model of the same two residues, after introduction of mutations (to K and D, respectively) that were predicted and later found to alter specificity from a G:C to a C:G base pair. The ability to systematically alter the specificity of this enzyme is facilitated by the availability of a large number of sequenced enzyme homologues with corresponding known target sites and by a protein architecture in which endonuclease catalytic activity is less intimately coupled to target recognition and binding. The REase engineering efforts described here clearly demonstrate that those Type II REase enzymes that utilize a single DNA recognition domain to direct both their protective MTase and restrictive REase activities have proved amenable to specificity engineering. However, the more familiar Type II restriction REases, which have separate REase and MTase proteins, are quite resistant to specificity alteration for clear functional and evolutionary reasons. RM systems that recognize split DNA target sequences, such as Type I or Type IIB enzymes, can be altered by exchanging one or other half site TRD, with new target specificities limited to combinations of the naturally occuring half site targets. REases that recognize contiguous DNA targets using a single TRD, such as the Type IIG or Type ISP enzymes, appear to have evolved the ability to readily alter their recognition specificity, often through subtle mutation involving just one or two residues. When these REases can be grouped into families having highly similar protein sequences yet diverged DNA targets, analysis of the correlation between amino acid position and residues with varying DNA target recognition can be used to direct rational engineering of REases with new specificity. This approach generally allows the alteration of one or two base pair positions within the native 5 to 8 base recognition site. Having a diversity of natural targets provides numerous starting points for such protein engineering efforts and expands the number of DNA targets that can be recognized. While it is not currently possible to engineer an REase to recognize all possible DNA target sequences, the engineering described can be used to expand the number of available Type II REase DNA targets by several orders of magnitude to provide an expanded toolkit for these molecular biology workhorses. CONCLUSIONS As mentioned at the beginning of this review, the field of protein engineering is experiencing a rapid increase in its ability to create new types of protein folds, topologies and assemblages. This discipline is now poised to address the critical issue of incorporating novel functions, such as ligand recognition, catalytic behaviors, and predictable structural and functional responses to changes in environmental conditions or to the addition of effector molecules. The notable accomplishments within this field over the past 15 years can be attributed to (i) the development of increasingly powerful and accurate computational algorithms to create and refine ab initio atomic models of protein folds and interactions, and (ii) the simultaneous development of reliable methods to generate and screen extremely large and complex protein libraries. This latter capability has benefitted from new and improved in cellulo and cell-free protein expression systems, enhanced methods for combining surface display platforms with robust flow cytometric screening approaches, and the advent of high-throughput sequencing strategies (which can be used to accurately determine patterns of sequence co-variation and context dependence that dictate form and function during the course of protein selection experiments). The fact that each these technologies have matured over the same relative time-frame has led to a current state of the art for protein engineering that might reasonably be described as almost limitless in its potential for creating novel biomolecules. The capabilities of protein engineering have been further enhanced by the rapid accumulation of genomic sequence information for the types of protein folds and scaffolds summarized above. Protein engineering for each of these systems has been greatly facilitated by the identification of large collections of homologous proteins, in numbers sufficient to derive significant predictive understanding of their structure-function relationships. Such analyses were important for the rapid development of gene targeting proteins using both zinc fingers and TAL effectors. Similar types of analyses are now becoming possible for more complex DNA binding protein families such as meganucleases (for which many hundreds of proteins can be found in microbial sequence databases) and restriction endonucleases (which provided the information that enabled investigators to engineer new specificities onto the Mme family of REase enzymes). While it is difficult to predict when and how the engineering of novel protein–DNA recognition properties might become significantly automated and reliable across multiple protein folds and families, given the pace of discovery and development in these fields it is not unreasonable to believe that such abilities will arrive in the not-too-distant future. While the development of improved methods and approaches for protein engineering is an important and useful area of research in general, the explosion of activity and results involving CRISPR-based, RNA-guided DNA targeting technologies might very well raise the question of whether further research and development in engineering recognition specificity of DNA-binding proteins in particular is justifiable. In addressing this question, it is useful to consider the example of DNA targeting for genome editing. The development of therapeutic targeted nucleases that display an ideal combination of activity, specificity, deliverability, and gene modification outcomes is not a fully solved problem, and each of the current platforms (certainly including CRISPR) offers unique advantages for such applications, offset by behaviors and properties requiring further study and development. While CRISPR offers the advantages of speed and scale for experimentation, important questions for its therapeutic use (such as its specificity, packaging, delivery, and controllability of repair outcomes) remain outstanding. The same questions exist for each of the protein-based DNA targeting systems described above, but in each case those questions appear to yield different answers, reflecting the distinct properties and unique advantages of the different platforms. The most salient of these are highlighted below: Zinc finger nucleases have been subjected to an extensive program of research to optimize their activity and specificity, and currently have the longest track record of therapeutic use in patients, both ex vivo and in vivo. Meganucleases and MegaTALs are the most difficult of gene targeting nucleases to engineer. However, they exhibit small size, single-chain structures, generation of uniquely reactive 3′ DNA product overhangs, and specificity profiles that appear highly desirable for certain genome editing applications. TAL effectors offer remarkable potential for fine-tuned targeting specificity at individual DNA base pairs, even non-uniformly across the target site, afforded by the variation in specificity profiles and affinity contributions of individual RVDs, the influences of the repeat backbone residues, and the effects of the anchoring cryptic repeats. This promises the ability to engineer the proteins to recruit enzymatic activities to a range of targets, from large sets of related sequences that vary across the genome, to a single specified target within such a set. Site specific recombinases offer the potential for genome editing activities and outcomes that are solely the result of enzymatic function, without the need to invoke and control cellular DSB repair processes. As a result, different biotechnology and gene therapy companies and their partners are aggressively pursuing different or multiple platforms for DNA targeting in medicine, agriculture, and industry. Given the importance and breadth of these applications, and the unique properties and advantages of the different platforms, continued research and development in engineering altered protein–DNA recognition specificity seems likely and well justified. ACKNOWLEDGEMENTS The authors of this review are supported for research in this area by the National Institutes of Health (B.L.S., A.B.), the National Science Foundation (A.J.B.), the Bill and Melinda Gates Foundation (B.L.S.), bluebird bio inc. (B.L.S.), Sangamo Inc. (J.C.M.) and New England Biolabs (R.D.M.). The authors thank their many colleagues, past and present, for the work and insights that are presented in this review article. FUNDING National Institute of General Medical Sciences [R01 GM105691]. Funding for open access charge: Discretionary and Endowment funds provided by the Fred Hutchinson Cancer Research Center. Conflict of interest statement. J.C.M. and R.D.M. are employees of Sangamo, Inc. and New England Biolabs, Inc., respectively; those companies create engineered nucleases for potential commercial and therapeutic applications. The B.L.S. laboratory is funded in part by bluebird bio Inc., which also creates engineered gene targeting nucleases for commercial use; B.L.S. receives royalties from them as an inventor on licensed meganuclease patents. A.J.B. receives royalties from Cellectis, Inc., as an inventor on TALEN patents licensed to that company. B.L.S. is Senior Executive Editor of Nucleic Acids Research. REFERENCES 1. Anderson J.E. , Ptashne M. , Harrison S.C. Structure of the repressor-operator complex of bacteriophage 434 . Nature . 1987 ; 326 : 846 – 852 . Google Scholar CrossRef Search ADS PubMed 2. Jordan S.R. , Whitcombe T.V. , Berg J.M. , Pabo C.O. Systematic variation in DNA length yields highly ordered repressor-operator cocrystals . Science . 1985 ; 230 : 1383 – 1385 . Google Scholar CrossRef Search ADS PubMed 3. Otwinowski Z. , Schevitz R.W. , Zhang R.G. , Lawson C.L. , Joachimiak A. , Marmorstein R.Q. , Luisi B.F. , Sigler P.B. Crystal structure of trp repressor/operator complex at atomic resolution . Nature . 1988 ; 335 : 321 – 329 . Google Scholar CrossRef Search ADS PubMed 4. Schultz S.C. , Shields G.C. , Steitz T.A. Crystallization of Escherichia coli catabolite gene activator protein with its DNA binding site. The use of modular DNA . J. Mol. Biol. 1990 ; 213 : 159 – 166 . Google Scholar CrossRef Search ADS PubMed 5. Luscombe N.M. , Laskowski R.A. , Thornton J.M. Amino acid-base interactions: a three-dimensional analysis of protein–DNA interactions at an atomic level . Nucleic Acids Res. 2001 ; 29 : 2860 – 2874 . Google Scholar CrossRef Search ADS PubMed 6. Rohs R. , Jin X. , West S.M. , Joshi R. , Honig B. , Mann R.S. Origins of specificity in protein–DNA recognition . Annu. Rev. Biochem. 2010 ; 79 : 233 – 269 . Google Scholar CrossRef Search ADS PubMed 7. Smith N.C. , Matthews J.M. Mechanisms of DNA-binding specificity and functional gene regulation by transcription factors . Curr. Opin. Struct. Biol. 2016 ; 38 : 68 – 74 . Google Scholar CrossRef Search ADS PubMed 8. Joshi R. , Passner J.M. , Rohs R. , Jain R. , Sosinsky A. , Crickmore M.A. , Jacob V. , Aggarwal A.K. , Honig B. , Mann R.S. Functional specificity of a Hox protein mediated by the recognition of minor groove structure . Cell . 2007 ; 131 : 530 – 543 . Google Scholar CrossRef Search ADS PubMed 9. Rohs R. , West S.M. , Sosinsky A. , Liu P. , Mann R.S. , Honig B. The role of DNA shape in protein–DNA recognition . Nature . 2009 ; 461 : 1248 – 1253 . Google Scholar CrossRef Search ADS PubMed 10. Lazarovici A. , Zhou T. , Shafer A. , Dantas Machado A.C. , Riley T.R. , Sandstrom R. , Sabo P.J. , Lu Y. , Rohs R. , Stamatoyannopoulos J.A. et al. Probing DNA shape and methylation state on a genomic scale with DNase I . Proc. Natl. Acad. Sci. U.S.A. 2013 ; 110 : 6376 – 6381 . Google Scholar CrossRef Search ADS PubMed 11. Kitayner M. , Rozenberg H. , Rohs R. , Suad O. , Rabinovich D. , Honig B. , Shakked Z. Diversity in DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs . Nat. Struct. Mol. Biol. 2010 ; 17 : 423 – 429 . Google Scholar CrossRef Search ADS PubMed 12. Gordan R. , Shen N. , Dror I. , Zhou T. , Horton J. , Rohs R. , Bulyk M.L. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape . Cell Rep. 2013 ; 3 : 1093 – 1104 . Google Scholar CrossRef Search ADS PubMed 13. Slattery M. , Zhou T. , Yang L. , Dantas Machado A.C. , Gordan R. , Rohs R. Absence of a simple code: how transcription factors read the genome . Trends Biochem. Sci. 2014 ; 39 : 381 – 399 . Google Scholar CrossRef Search ADS PubMed 14. Lavery R. Recognizing DNA . Q. Rev. Biophys. 2005 ; 38 : 339 – 344 . Google Scholar CrossRef Search ADS PubMed 15. Boch J. , Bonas U. Xanthomonas AvrBs3 family-type III effectors: discovery and function . Annu. Rev. Phytopathol. 2010 ; 48 : 419 – 436 . Google Scholar CrossRef Search ADS PubMed 16. Moscou M.J. , Bogdanove A.J. A simple cipher governs DNA recognition by TAL effectors . Science . 2009 ; 326 : 1501 . Google Scholar CrossRef Search ADS PubMed 17. Kuhlman B. , Dantas G. , Ireton G.C. , Varani G. , Stoddard B.L. , Baker D. Design of a novel globular protein fold with atomic-level accuracy . Science . 2003 ; 302 : 1364 – 1368 . Google Scholar CrossRef Search ADS PubMed 18. Procko E. , Berguig G.Y. , Shen B.W. , Song Y. , Frayo S. , Convertine A.J. , Margineantu D. , Booth G. , Correia B.E. , Cheng Y. et al. A computationally designed inhibitor of an Epstein-Barr viral Bcl-2 protein induces apoptosis in infected cells . Cell . 2014 ; 157 : 1644 – 1656 . Google Scholar CrossRef Search ADS PubMed 19. Thyme S. , Baker D. Redesigning the specificity of protein–DNA interactions with Rosetta . Methods Mol. Biol. 2014 ; 1123 : 265 – 282 . Google Scholar CrossRef Search ADS PubMed 20. Karpinski J. , Hauber I. , Chemnitz J. , Schafer C. , Paszkowski-Rogacz M. , Chakraborty D. , Beschorner N. , Hofmann-Sieber H. , Lange U.C. , Grundhoff A. et al. Directed evolution of a recombinase that excises the provirus of most HIV-1 primary isolates with high specificity . Nat. Biotechnol. 2016 ; 34 : 401 – 409 . Google Scholar CrossRef Search ADS PubMed 21. Morgan R.D. , Luyten Y.A. Rational engineering of type II restriction endonuclease DNA binding and cleavage specificity . Nucleic Acids Res. 2009 ; 37 : 5222 – 5233 . Google Scholar CrossRef Search ADS PubMed 22. Urnov F.D. , Miller J.C. , Lee Y.-L. , Beausejour C.M. , Rock J.M. , Augustus S. , Jamieson A.C. , Porteus M.H. , Gregory P.D. , Holmes M.C. Highly efficient endogenous human gene correction using designed zinc-finger nucleases . Nature . 2005 ; 435 : 646 – 651 . Google Scholar CrossRef Search ADS PubMed 23. Werther R. , Hallinan J.P. , Lambert A.R. , Havens K. , Pogson M. , Jarjour J. , Galizi R. , Windbichler N. , Crisanti A. , Nolan T. et al. Crystallographic analyses illustrate significant plasticity and efficient recoding of meganuclease target specificity . Nucleic Acids Res. 2017 ; 45 : 8621 – 8634 . Google Scholar CrossRef Search ADS PubMed 24. Tupler R. , Perini G. , Green M.R. Expressing the human genome . Nature . 2001 ; 409 : 832 – 833 . Google Scholar CrossRef Search ADS PubMed 25. Pabo C.O. , Peisach E. , Grant R.A. Design and selection of novel Cys2His2 zinc finger proteins . Annu. Rev. Biochem. 2001 ; 70 : 313 – 340 . Google Scholar CrossRef Search ADS PubMed 26. Pavletich N.P. , Pabo C.O. Zinc finger-DNA recognition: crystal structure of a Zif268-DNA complex at 2.1 A . Science . 1991 ; 252 : 809 – 817 . Google Scholar CrossRef Search ADS PubMed 27. Kim Y.G. , Cha J. , Chandrasegaran S. Hybrid restriction enzymes: zinc finger fusions to Fok I cleavage domain . Proc. Natl. Acad. Sci. U.S.A. 1996 ; 93 : 1156 – 1160 . Google Scholar CrossRef Search ADS PubMed 28. Smith J. , Berg J.M. , Chandrasegaran S. A detailed study of the substrate specificity of a chimeric restriction enzyme . Nucleic Acids Res. 1999 ; 27 : 674 – 681 . Google Scholar CrossRef Search ADS PubMed 29. Rouet P. , Smih F. , Jasin M. Introduction of double-strand breaks into the genome of mouse cells by expression of a rare-cutting endonuclease . Mol. Cell. Biol. 1994 ; 14 : 8096 – 8106 . Google Scholar CrossRef Search ADS PubMed 30. Rouet P. , Smih F. , Jasin M. Expression of a site-specific endonuclease stimulates homologous recombination in mammalian cells . Proc. Natl. Acad. Sci. U.S.A. 1994 ; 91 : 6064 – 6068 . Google Scholar CrossRef Search ADS PubMed 31. Bibikova M. , Beumer K. , Trautman J.K. , Carroll D. Enhancing gene targeting with designed zinc finger nucleases . Science . 2003 ; 300 : 764 . Google Scholar CrossRef Search ADS PubMed 32. Bibikova M. , Golic M. , Golic K.G. , Carroll D. Targeted chromosomal cleavage and mutagenesis in Drosophila using zinc-finger nucleases . Genetics . 2002 ; 161 : 1169 – 1175 . Google Scholar PubMed 33. Osakabe K. , Osakabe Y. , Toki S. Site-directed mutagenesis in Arabidopsis using custom-designed zinc finger nucleases . Proc. Natl. Acad. Sci. U.S.A. 2010 ; 107 : 12034 – 12039 . Google Scholar CrossRef Search ADS PubMed 34. Zhang F. , Maeder M.L. , Unger-Wallace E. , Hoshaw J.P. , Reyon D. , Christian M. , Li X. , Pierick C.J. , Dobbs D. , Peterson T. et al. High frequency targeted mutagenesis in Arabidopsis thaliana using zinc finger nucleases . Proc. Natl. Acad. Sci. U.S.A. 2010 ; 107 : 12028 – 12033 . Google Scholar CrossRef Search ADS PubMed 35. Morton J. , Davis M.W. , Jorgensen E.M. , Carroll D. Induction and repair of zinc-finger nuclease-targeted double-strand breaks in Caenorhabditis elegans somatic cells . Proc. Natl. Acad. Sci. U.S.A. 2006 ; 103 : 16370 – 16375 . Google Scholar CrossRef Search ADS PubMed 36. Doyon Y. , McCammon J.M. , Miller J.C. , Faraji F. , Ngo C. , Katibah G.E. , Amora R. , Hocking T.D. , Zhang L. , Rebar E.J. et al. Heritable targeted gene disruption in zebrafish using designed zinc-finger nucleases . Nat. Biotechnol. 2008 ; 26 : 702 – 708 . Google Scholar CrossRef Search ADS PubMed 37. Foley J.E. , Yeh J.R. , Maeder M.L. , Reyon D. , Sander J.D. , Peterson R.T. , Joung J.K. Rapid mutation of endogenous zebrafish genes using zinc finger nucleases made by Oligomerized Pool ENgineering (OPEN) . PLoS One . 2009 ; 4 : e4348 . Google Scholar CrossRef Search ADS PubMed 38. Meng X. , Noyes M.B. , Zhu L.J. , Lawson N.D. , Wolfe S.A. Targeted gene inactivation in zebrafish using engineered zinc-finger nucleases . Nat. Biotechnol. 2008 ; 26 : 695 – 701 . Google Scholar CrossRef Search ADS PubMed 39. Goldberg A.D. , Banaszynski L.A. , Noh K.M. , Lewis P.W. , Elsaesser S.J. , Stadler S. , Dewell S. , Law M. , Guo X. , Li X. et al. Distinct factors control histone variant H3.3 localization at specific genomic regions . Cell . 2010 ; 140 : 678 – 691 . Google Scholar CrossRef Search ADS PubMed 40. Flisikowska T. , Thorey I.S. , Offner S. , Ros F. , Lifke V. , Zeitler B. , Rottmann O. , Vincent A. , Zhang L. , Jenkins S. et al. Efficient immunoglobulin gene disruption and targeted replacement in rabbit using zinc finger nucleases . PLoS One . 2011 ; 6 : e21045 . Google Scholar CrossRef Search ADS PubMed 41. Geurts A.M. , Cost G.J. , Freyvert Y. , Zeitler B. , Miller J.C. , Choi V.M. , Jenkins S.S. , Wood A. , Cui X. , Meng X. et al. Knockout rats via embryo microinjection of zinc-finger nucleases . Science . 2009 ; 325 : 433 . Google Scholar CrossRef Search ADS PubMed 42. Mashimo T. , Takizawa A. , Voigt B. , Yoshimi K. , Hiai H. , Kuramoto T. , Serikawa T. Generation of knockout rats with X-linked severe combined immunodeficiency (X-SCID) using zinc-finger nucleases . PLoS One . 2010 ; 5 : e8870 . Google Scholar CrossRef Search ADS PubMed 43. Shukla V.K. , Doyon Y. , Miller J.C. , DeKelver R.C. , Moehle E.A. , Worden S.E. , Mitchell J.C. , Arnold N.L. , Gopalan S. , Meng X. et al. Precise genome modification in the crop species Zea mays using zinc-finger nucleases . Nature . 2009 ; 459 : 437 – 441 . Google Scholar CrossRef Search ADS PubMed 44. Curtin S.J. , Zhang F. , Sander J.D. , Haun W.J. , Starker C. , Baltes N.J. , Reyon D. , Dahlborg E.J. , Goodwin M.J. , Coffman A.P. et al. Targeted mutagenesis of duplicated genes in soybean with zinc-finger nucleases . Plant Physiol. 2011 ; 156 : 466 – 473 . Google Scholar CrossRef Search ADS PubMed 45. Hauschild J. , Petersen B. , Santiago Y. , Queisser A.L. , Carnwath J.W. , Lucas-Hahn A. , Zhang L. , Meng X. , Gregory P.D. , Schwinzer R. et al. Efficient generation of a biallelic knockout in pigs using zinc-finger nucleases . Proc. Natl. Acad. Sci. U.S.A. 2011 ; 108 : 12013 – 12017 . Google Scholar CrossRef Search ADS PubMed 46. Yu S. , Luo J. , Song Z. , Ding F. , Dai Y. , Li N. Highly efficient modification of beta-lactoglobulin (BLG) gene via zinc-finger nucleases in cattle . Cell Res. 2011 ; 21 : 1638 – 1640 . Google Scholar CrossRef Search ADS PubMed 47. Tebas P. , Stein D. , Tang W.W. , Frank I. , Wang S.Q. , Lee G. , Spratt S.K. , Surosky R.T. , Giedlin M.A. , Nichol G. et al. Gene editing of CCR5 in autologous CD4 T cells of persons infected with HIV . N. Engl. J. Med. 2014 ; 370 : 901 – 910 . Google Scholar CrossRef Search ADS PubMed 48. Wolfe S.A. , Grant R.A. , Elrod-Erickson M. , Pabo C.O. Beyond the “recognition code”: structures of two Cys2His2 zinc finger/TATA box complexes . Structure . 2001 ; 9 : 717 – 723 . Google Scholar CrossRef Search ADS PubMed 49. Desjarlais J.R. , Berg J.M. Redesigning the DNA-binding specificity of a zinc finger protein: a data base-guided approach . Proteins . 1992 ; 13 : 272 . Google Scholar CrossRef Search ADS PubMed 50. Desjarlais J.R. , Berg J.M. Toward rules relating zinc finger protein sequences and DNA binding site preferences . Proc. Natl. Acad. Sci. U.S.A. 1992 ; 89 : 7345 – 7349 . Google Scholar CrossRef Search ADS PubMed 51. Desjarlais J.R. , Berg J.M. Use of a zinc-finger consensus sequence framework and specificity rules to design specific DNA binding proteins . Proc. Natl. Acad. Sci. U.S.A. 1993 ; 90 : 2256 – 2260 . Google Scholar CrossRef Search ADS PubMed 52. Choo Y. , Klug A. Selection of DNA binding sites for zinc fingers using rationally randomized DNA reveals coded interactions . Proc. Natl. Acad. Sci. U.S.A. 1994 ; 91 : 11168 – 11172 . Google Scholar CrossRef Search ADS PubMed 53. Jamieson A.C. , Kim S.H. , Wells J.A. In vitro selection of zinc fingers with altered DNA-binding specificity . Biochemistry . 1994 ; 33 : 5689 – 5695 . Google Scholar CrossRef Search ADS PubMed 54. Rebar E.J. , Pabo C.O. Zinc finger phage: affinity selection of fingers with new DNA-binding specificities . Science . 1994 ; 263 : 671 – 673 . Google Scholar CrossRef Search ADS PubMed 55. Wu H. , Yang W.P. , Barbas C.F. 3rd Building zinc fingers by selection: toward a therapeutic application . Proc. Natl. Acad. Sci. U.S.A. 1995 ; 92 : 344 – 348 . Google Scholar CrossRef Search ADS PubMed 56. Segal D.J. , Dreier B. , Beerli R.R. , Barbas C.F. 3rd Toward controlling gene expression at will: selection and design of zinc finger domains recognizing each of the 5′-GNN-3′ DNA target sequences . Proc. Natl. Acad. Sci. U.S.A. 1999 ; 96 : 2758 – 2763 . Google Scholar CrossRef Search ADS PubMed 57. Dreier B. , Beerli R.R. , Segal D.J. , Flippin J.D. , Barbas C.F. Development of zinc finger domains for recognition of the 5′-ANN-3′ family of DNA sequences and their use in the construction of artificial transcription factors . J. Biol. Chem. 2001 ; 276 : 29466 – 29478 . Google Scholar CrossRef Search ADS PubMed 58. Dreier B. , Fuller R.P. , Segal D.J. , Lund C.V. , Blancafort P. , Huber A. , Koksch B. , Barbas C.F. 3rd Development of zinc finger domains for recognition of the 5′-CNN-3′ family DNA sequences and their use in the construction of artificial transcription factors . J. Biol. Chem. 2005 ; 280 : 34488 – 35597 . Google Scholar CrossRef Search ADS 59. Ramirez C.L. , Foley J.E. , Wright D.A. , Muller-Lerch F. , Rahman S.H. , Cornu T.I. , Winfrey R.J. , Sander J.D. , Fu F. , Townsend J.A. et al. Unexpected failure rates for modular assembly of engineered zinc fingers . Nat. Methods . 2008 ; 5 : 374 – 375 . Google Scholar CrossRef Search ADS PubMed 60. Isalan M. , Klug A. , Choo Y. A rapid, generally applicable method to engineer zinc fingers illustrated by targeting the HIV-1 promoter . Nat. Biotechnol. 2001 ; 19 : 656 – 660 . Google Scholar CrossRef Search ADS PubMed 61. Joung J.K. , Ramm E.I. , Pabo C.O. A bacterial two-hybrid selection system for studying protein–DNA and protein–protein interactions . Proc. Natl. Acad. Sci. U.S.A. 2000 ; 97 : 7382 – 7387 . Google Scholar CrossRef Search ADS PubMed 62. Maeder M.L. , Thibodeau-Beganny S. , Osiak A. , Wright D.A. , Anthony R.M. , Eichtinger M. , Jiang T. , Foley J.E. , Winfrey R.J. , Townsend J.A. et al. Rapid “open-source” engineering of customized zinc-finger nucleases for highly efficient gene modification . Mol. Cell . 2008 ; 31 : 294 – 301 . Google Scholar CrossRef Search ADS PubMed 63. Sander J.D. , Dahlborg E.J. , Goodwin M.J. , Cade L. , Zhang F. , Cifuentes D. , Curtin S.J. , Blackburn J.S. , Thibodeau-Beganny S. , Qi Y. et al. Selection-free zinc-finger-nuclease engineering by context-dependent assembly (CoDA) . Nat. Methods . 2011 ; 8 : 67 – 69 . Google Scholar CrossRef Search ADS PubMed 64. Moore M. , Klug A. , Choo Y. Improved DNA binding specificity from polyzinc finger peptides by using strings of two-finger units . Proc. Natl. Acad. Sci. U.S.A. 2001 ; 98 : 1437 – 1441 . Google Scholar CrossRef Search ADS PubMed 65. Carroll D. Genome engineering with zinc-finger nucleases . Genetics . 2011 ; 188 : 773 – 782 . Google Scholar CrossRef Search ADS PubMed 66. Urnov F.D. , Rebar E.J. , Holmes M.C. , Zhang H.S. , Gregory P.D. Genome editing with engineered zinc finger nucleases . Nat. Rev. Genet. 2010 ; 11 : 636 – 646 . Google Scholar CrossRef Search ADS PubMed 67. Perez E.E. , Wang J. , Miller J.C. , Jouvenot Y. , Kim K.A. , Liu O. , Wang N. , Lee G. , Bartsevich V.V. , Lee Y.L. et al. Establishment of HIV-1 resistance in CD4+ T cells by genome editing using zinc-finger nucleases . Nat. Biotechnol. 2008 ; 26 : 808 – 816 . Google Scholar CrossRef Search ADS PubMed 68. Chang K.H. , Smith S.E. , Sullivan T. , Chen K. , Zhou Q. , West J.A. , Liu M. , Liu Y. , Vieira B.F. , Sun C. et al. Long-Term engraftment and fetal globin induction upon BCL11A gene editing in Bone-Marrow-Derived CD34(+) hematopoietic stem and progenitor cells . Mol. Ther. Methods Clin. Dev. 2017 ; 4 : 137 – 148 . Google Scholar CrossRef Search ADS PubMed 69. Yusa K. , Rashid S.T. , Strick-Marchand H. , Varela I. , Liu P.Q. , Paschon D.E. , Miranda E. , Ordonez A. , Hannan N.R. , Rouhani F.J. et al. Targeted gene correction of alpha1-antitrypsin deficiency in induced pluripotent stem cells . Nature . 2011 ; 478 : 391 – 394 . Google Scholar CrossRef Search ADS PubMed 70. Bitinaite J. , Wah D.A. , Aggarwal A.K. , Schildkraut I. FokI dimerization is required for DNA cleavage . Proc. Natl. Acad. Sci. U.S.A. 1998 ; 95 : 10570 – 10575 . Google Scholar CrossRef Search ADS PubMed 71. Smith J. , Bibikova M. , Whitby F.G. , Reddy A.R. , Chandrasegaran S. , Carroll D. Requirements for double-strand cleavage by chimeric restriction enzymes with zinc finger DNA-recognition domains . Nucleic Acids Res. 2000 ; 28 : 3361 – 3369 . Google Scholar CrossRef Search ADS PubMed 72. Bibikova M. , Carroll D. , Segal D.J. , Trautman J.K. , Smith J. , Kim Y.G. , Chandrasegaran S. Stimulation of homologous recombination through targeted cleavage by chimeric nucleases . Mol. Cell. Biol. 2001 ; 21 : 289 – 297 . Google Scholar CrossRef Search ADS PubMed 73. Porteus M.H. , Baltimore D. Chimeric nucleases stimulate gene targeting in human cells . Science . 2003 ; 300 : 763 . Google Scholar CrossRef Search ADS PubMed 74. Handel E.M. , Alwin S. , Cathomen T. Expanding or restricting the target site repertoire of zinc-finger nucleases: the inter-domain linker as a major determinant of target site selectivity . Mol. Ther. 2009 ; 17 : 104 – 111 . Google Scholar CrossRef Search ADS PubMed 75. Shimizu Y. , Bhakta M.S. , Segal D.J. Restricted spacer tolerance of a zinc finger nuclease with a six amino acid linker . Bioorg. Med. Chem. Lett. 2009 ; 19 : 3970 – 3972 . Google Scholar CrossRef Search ADS PubMed 76. Guo J. , Gaj T. , Barbas C.F. 3rd Directed evolution of an enhanced and highly efficient FokI cleavage domain for zinc finger nucleases . J. Mol. Biol. 2010 ; 400 : 96 – 107 . Google Scholar CrossRef Search ADS PubMed 77. Doyon Y. , Vo T.D. , Mendel M.C. , Greenberg S.G. , Wang J. , Xia D.F. , Miller J.C. , Urnov F.D. , Gregory P.D. , Holmes M.C. Enhancing zinc-finger-nuclease activity with improved obligate heterodimeric architectures . Nat. Methods . 2011 ; 8 : 74 – 79 . Google Scholar CrossRef Search ADS PubMed 78. Miller J.C. , Holmes M.C. , Wang J. , Guschin D.Y. , Lee Y.L. , Rupniewski I. , Beausejour C.M. , Waite A.J. , Wang N.S. , Kim K.A. et al. An improved zinc-finger nuclease architecture for highly specific genome editing . Nat. Biotechnol. 2007 ; 25 : 778 – 785 . Google Scholar CrossRef Search ADS PubMed 79. Szczepek M. , Brondani V. , Buchel J. , Serrano L. , Segal D.J. , Cathomen T. Structure-based redesign of the dimerization interface reduces the toxicity of zinc-finger nucleases . Nat. Biotechnol. 2007 ; 25 : 786 – 793 . Google Scholar CrossRef Search ADS PubMed 80. Ramirez C.L. , Certo M.T. , Mussolino C. , Goodwin M.J. , Cradick T.J. , McCaffrey A.P. , Cathomen T. , Scharenberg A.M. , Joung J.K. Engineered zinc finger nickases induce homology-directed repair with reduced mutagenic effects . Nucleic Acids Res. 2012 ; 40 : 5560 – 5568 . Google Scholar CrossRef Search ADS PubMed 81. Wang J. , Friedman G. , Doyon Y. , Wang N.S. , Li C.J. , Miller J.C. , Hua K.L. , Yan J.J. , Babiarz J.E. , Gregory P.D. et al. Targeted gene addition to a predetermined site in the human genome using a ZFN-based nicking enzyme . Genome Res. 2012 ; 22 : 1316 – 1326 . Google Scholar CrossRef Search ADS PubMed 82. Pattanayak V. , Ramirez C.L. , Joung J.K. , Liu D.R. Revealing off-target cleavage specificities of zinc-finger nucleases by in vitro selection . Nat. Methods . 2011 ; 8 : 765 – 770 . Google Scholar CrossRef Search ADS PubMed 83. Gabriel R. , Lombardo A. , Arens A. , Miller J.C. , Genovese P. , Kaeppel C. , Nowrouzi A. , Bartholomae C.C. , Wang J. , Friedman G. et al. An unbiased genome-wide analysis of zinc-finger nuclease specificity . Nat. Biotechnol. 2011 ; 29 : 816 – 823 . Google Scholar CrossRef Search ADS PubMed 84. Wood A.J. , Lo T.W. , Zeitler B. , Pickle C.S. , Ralston E.J. , Lee A.H. , Amora R. , Miller J.C. , Leung E. , Meng X. et al. Targeted genome editing across species using ZFNs and TALENs . Science . 2011 ; 333 : 307 . Google Scholar CrossRef Search ADS PubMed 85. Choulika A. , Perrin A. , Dujon B. , Nicolas J.F. Induction of homologous recombination in mammalian chromosomes by using the I-SceI system of Saccharomyces cerevisiae . Mol. Cell. Biol. 1995 ; 15 : 1968 – 1973 . Google Scholar CrossRef Search ADS PubMed 86. Scalley-Kim M. , McConnell-Smith A. , Stoddard B.L. Coevolution of homing endonuclease specificity and its host target sequence . J. Mol. Biol. 2007 ; 372 : 1305 – 1319 . Google Scholar CrossRef Search ADS PubMed 87. Chevalier B. , Turmel M. , Lemieux C. , Monnat R.J. Jr , Stoddard B.L. Flexible DNA target site recognition by divergent homing endonuclease isoschizomers I-CreI and I-MsoI . J. Mol. Biol. 2003 ; 329 : 253 – 269 . Google Scholar CrossRef Search ADS PubMed 88. Lambert A.R. , Hallinan J.P. , Shen B.W. , Chik J.K. , Bolduc J.M. , Kulshina N. , Robins L.I. , Kaiser B.K. , Jarjour J. , Havens K. et al. Indirect DNA sequence recognition and its impact on nuclease cleavage activity . Structure . 2016 ; 24 : 862 – 873 . Google Scholar CrossRef Search ADS PubMed 89. Thyme S.B. , Jarjour J. , Takeuchi R. , Havranek J.J. , Ashworth J. , Scharenberg A.M. , Stoddard B.L. , Baker D. Exploitation of binding energy for catalysis and design . Nature . 2009 ; 461 : 1300 – 1304 . Google Scholar CrossRef Search ADS PubMed 90. Choulika A. , Perrin A. , Dujon B. , Nicolas J.F. The yeast I-Sce I meganuclease induces site-directed chromosomal recombination in mammalian cells . C. R. Acad. Sci. III . 1994 ; 317 : 1013 – 1019 . Google Scholar PubMed 91. Riviere J. , Hauer J. , Poirot L. , Brochet J. , Souque P. , Mollier K. , Gouble A. , Charneau P. , Fischer A. , Paques F. et al. Variable correction of Artemis deficiency by I-Sce1-meganuclease-assisted homologous recombination in murine hematopoietic stem cells . Gene Ther. 2014 ; 21 : 529 – 532 . Google Scholar CrossRef Search ADS PubMed 92. Gouble A. , Smith J. , Bruneau S. , Perez C. , Guyot V. , Cabaniols J.-P. , Leduc S. , Fiette L. , Ave P. , Micheau B. et al. Efficient in toto targeted recombination in mouse liver by meganuclease-induced double-strand break . J. Gene Med. 2006 ; 8 : 616 – 622 . Google Scholar CrossRef Search ADS PubMed 93. Flick K.E. , Jurica M.S. , Monnat R.J. Jr , Stoddard B.L. DNA binding and cleavage by the nuclear intron-encoded homing endonuclease I-PpoI . Nature . 1998 ; 394 : 96 – 101 . Google Scholar CrossRef Search ADS PubMed 94. Jurica M.S. , Monnat R.J. Jr , Stoddard B.L. DNA recognition and cleavage by the LAGLIDADG homing endonuclease I-CreI . Mol. Cell . 1998 ; 2 : 469 – 476 . Google Scholar CrossRef Search ADS PubMed 95. Moure C.M. , Gimble F.S. , Quiocho F.A. The crystal structure of the gene targeting homing endonuclease I-SceI reveals the origins of its target site specificity . J. Mol. Biol. 2003 ; 334 : 685 – 695 . Google Scholar CrossRef Search ADS PubMed 96. Gimble F.S. , Moure C.M. , Posey K.L. Assessing the plasticity of DNA target site recognition of the PI-SceI homing endonuclease using a bacterial two-hybrid selection system . J. Mol. Biol. 2003 ; 334 : 993 – 1008 . Google Scholar CrossRef Search ADS PubMed 97. Seligman L. , Chisholm K.M. , Chevalier B.S. , Chadsey M.S. , Edwards S.T. , Savage J.H. , Veillet A.L. Mutations altering the cleavage specificity of a homing endonuclease . Nucleic Acids Res. 2002 ; 30 : 3870 – 3879 . Google Scholar CrossRef Search ADS PubMed 98. Gruen M. , Chang K. , Serbanescu I. , Liu D.R. An in vivo selection system for homing endonuclease activity . Nucleic Acids Res. 2002 ; 30 : e29 . Google Scholar CrossRef Search ADS PubMed 99. Ashworth J. , Havranek J.J. , Duarte C.M. , Sussman D. , Monnat R.J. Jr , Stoddard B.L. , Baker D. Computational redesign of endonuclease DNA binding and cleavage specificity . Nature . 2006 ; 441 : 656 – 659 . Google Scholar CrossRef Search ADS PubMed 100. Rosen L.E. , Morrison H.A. , Masri S. , Brown M.J. , Springstubb B. , Sussman D. , Stoddard B.L. , Seligman L.M. Homing endonuclease I-CreI derivatives with novel DNA target specificities . Nucleic Acids Res. 2006 ; 34 : 4791 – 4800 . Google Scholar CrossRef Search ADS PubMed 101. Chames P. , Epinat J.C. , Guillier S. , Patin A. , Lacroix E. , Paques F. In vivo selection of engineered homing endonucleases using double-strand break induced homologous recombination . Nucleic Acids Res. 2005 ; 33 : e178 . Google Scholar CrossRef Search ADS PubMed 102. Arnould S. , Chames P. , Perez C. , Lacroix E. , Duclert A. , Epinat J.C. , Stricher F. , Petit A.S. , Patin A. , Guillier S. et al. Engineering of large numbers of highly specific homing endonucleases that induce recombination on novel DNA targets . J. Mol. Biol. 2006 ; 355 : 443 – 458 . Google Scholar CrossRef Search ADS PubMed 103. Smith J. , Grizot S. , Arnould S. , Duclert A. , Epinat J.C. , Chames P. , Prieto J. , Redondo P. , Blanco F.J. , Bravo J. et al. A combinatorial approach to create artificial homing endonucleases cleaving chosen sequences . Nucleic Acids Res. 2006 ; 34 : e149 . Google Scholar CrossRef Search ADS PubMed 104. Paques F. , Duchateau P. Meganucleases and DNA double-strand break-induced recombination: perspectives for gene therapy . Curr. Gene Ther. 2007 ; 7 : 49 – 66 . Google Scholar CrossRef Search ADS PubMed 105. Ashworth J. , Taylor G.K. , Havranek J.J. , Quadri S.A. , Stoddard B.L. , Baker D. Computational reprogramming of homing endonuclease specificity at multiple adjacent base pairs . Nucleic Acids Res. 2010 ; 38 : 5601 – 5608 . Google Scholar CrossRef Search ADS PubMed 106. Thyme S.B. , Baker D. , Bradley P. Improved modeling of side-chain . base interactions and plasticity in protein–DNA . interface design . J. Mol. Biol. 2012 ; 419 : 255 – 274 . Google Scholar CrossRef Search ADS PubMed 107. Wang Y. , Khan I.F. , Boissel S. , Jarjour J. , Pangallo J. , Thyme S. , Baker D. , Scharenberg A.M. , Rawlings D.J. Progressive engineering of a homing endonuclease genome editing reagent for the murine X-linked immunodeficiency locus . Nucleic Acids Res. 2014 ; 42 : 6463 – 6475 . Google Scholar CrossRef Search ADS PubMed 108. Windbichler N. , Menichelli M. , Papathanos P.A. , Thyme S.B. , Li H. , Ulge U.Y. , Hovde B.T. , Baker D. , Monnat R.J. Jr , Burt A. et al. A synthetic homing endonuclease-based gene drive system in the human malaria mosquito . Nature . 2011 ; 473 : 212 – 215 . Google Scholar CrossRef Search ADS PubMed 109. Baxter S.K. , Scharenberg A.M. , Lambert A.R. Engineering and flow-cytometric analysis of chimeric LAGLIDADG homing endonucleases from homologous I-OnuI-family enzymes . Methods Mol Biol . 2014 ; 1123 : 191 – 221 . Google Scholar CrossRef Search ADS PubMed 110. Chevalier B.S. , Kortemme T. , Chadsey M.S. , Baker D. , Monnat R.J. , Stoddard B.L. Design, activity and structure of a highly specific artificial endonuclease . Mol. Cell . 2002 ; 10 : 895 – 905 . Google Scholar CrossRef Search ADS PubMed 111. Epinat J.C. , Arnould S. , Chames P. , Rochaix P. , Desfontaines D. , Puzin C. , Patin A. , Zanghellini A. , Paques F. , Lacroix E. A novel engineered meganuclease induces homologous recombination in yeast and mammalian cells . Nucleic Acids Res. 2003 ; 31 : 2952 – 2962 . Google Scholar CrossRef Search ADS PubMed 112. Silva G.H. , Belfort M. , Wende W. , Pingoud A. From monomeric to homodimeric endonucleases and back: engineering novel specificity of LAGLIDADG enzymes . J. Mol. Biol. 2006 ; 361 : 744 – 754 . Google Scholar CrossRef Search ADS PubMed 113. Gao H. , Smith J. , Yang M. , Jones S. , Djukanovic V. , Nicholson M.G. , West A. , Bidney D. , Falco S.C. , Jantz D. et al. Heritable targeted mutagenesis in maize using a designed endonuclease . Plant J. 2010 ; 61 : 176 – 187 . Google Scholar CrossRef Search ADS PubMed 114. Li H. , Pellenz S. , Ulge U. , Stoddard B.L. , Monnat R.J. Jr Generation of single-chain LAGLIDADG homing endonucleases from native homodimeric precursor proteins . Nucleic Acids Res. 2009 ; 37 : 1650 – 1662 . Google Scholar CrossRef Search ADS PubMed 115. Boissel S. , Jarjour J. , Astrakhan A. , Adey A. , Gouble A. , Duchateau P. , Shendure J. , Stoddard B.L. , Certo M.T. , Baker D. et al. megaTALs: a rare-cleaving nuclease architecture for therapeutic genome engineering . Nucleic Acids Res. 2014 ; 42 : 2591 – 2601 . Google Scholar CrossRef Search ADS PubMed 116. Sather B.D. , Romano Ibarra G.S. , Sommer K. , Curinga G. , Hale M. , Khan I.F. , Singh S. , Song Y. , Gwiazda K. , Sahni J. et al. Efficient modification of CCR5 in primary human hematopoietic cells using a megaTAL nuclease and AAV donor template . Sci. Transl. Med. 2015 ; 7 : 307ra156 . Google Scholar CrossRef Search ADS PubMed 117. Arnould S. , Perez C. , Cabaniols J.-P. , Smith J. , Gouble A. , Grizot S. , Epinat J.-C. , Duclert A. , Duchateau P. , Paques F. Engineered I-CreI derivatives cleaving sequences from the human XPC gene can induce highly efficient gene correction in mammalian cells . J. Mol. Biol. 2007 ; 371 : 49 – 65 . Google Scholar CrossRef Search ADS PubMed 118. Dupuy A. , Valton J. , Leduc S. , Armier J. , Galetto R. , Gouble A. , Lebuhotel C. , Stary A. , Paques F. , Duchateau P. et al. Targeted gene therapy of xeroderma pigmentosum cells using meganuclease and TALEN . PLoS One . 2013 ; 8 : e78678 . Google Scholar CrossRef Search ADS PubMed 119. Redondo P. , Prieto J. , Munoz I.G. , Alibes A. , Stricher F. , Serrano L. , Cabaniols J.P. , Daboussi F. , Arnould S. , Perez C. et al. Molecular basis of xeroderma pigmentosum group C DNA recognition by engineered meganucleases . Nature . 2008 ; 456 : 107 – 111 . Google Scholar CrossRef Search ADS PubMed 120. Cabaniols J.P. , Ouvry C. , Lamamy V. , Fery I. , Craplet M.L. , Moulharat N. , Guenin S.P. , Bedut S. , Nosjean O. , Ferry G. et al. Meganuclease-driven targeted integration in CHO-K1 cells for the fast generation of HTS-compatible cell-based assays . J. Biomol. Screen. 2010 ; 15 : 956 – 967 . Google Scholar CrossRef Search ADS PubMed 121. Cabaniols J.P. , Paques F. Robust cell line development using meganucleases . Methods Mol. Biol. 2008 ; 435 : 31 – 45 . Google Scholar CrossRef Search ADS PubMed 122. Djukanovic V. , Smith J. , Lowe K. , Yang M. , Gao H. , Jones S. , Nicholson M.G. , West A. , Lape J. , Bidney D. et al. Male-sterile maize plants produced by targeted mutagenesis of the cytochrome P450-like gene (MS26) using a re-designed I-CreI homing endonuclease . Plant J. 2013 ; 76 : 888 – 899 . Google Scholar CrossRef Search ADS PubMed 123. Antunes M.S. , Smith J.J. , Jantz D. , Medford J.I. Targeted DNA excision in Arabidopsis by a re-engineered homing endonuclease . BMC Biotechnol. 2012 ; 12 : 86 . Google Scholar CrossRef Search ADS PubMed 124. D’Halluin K. , Vanderstraeten C. , Van Hulle J. , Rosolowska J. , Van Den Brande I. , Pennewaert A. , D’Hont K. , Bossut M. , Jantz D. , Ruiter R. et al. Targeted molecular trait stacking in cotton through targeted double-strand break induction . Plant Biotechnol. J. 2013 ; 11 : 933 – 941 . Google Scholar CrossRef Search ADS PubMed 125. Grizot S. , Smith J. , Daboussi F. , Prieto J. , Redondo P. , Merino N. , Villate M. , Thomas S. , Lemaire L. , Montoya G. et al. Efficient targeting of a SCID gene by an engineered single-chain homing endonuclease . Nucleic Acids Res. 2009 ; 37 : 5405 – 5419 . Google Scholar CrossRef Search ADS PubMed 126. Munoz I.G. , Prieto J. , Subramanian S. , Coloma J. , Redondo P. , Villate M. , Merino N. , Marenchino M. , D’Abramo M. , Gervasio F.L. et al. Molecular basis of engineered meganuclease targeting of the endogenous human RAG1 locus . Nucleic Acids Res. 2011 ; 39 : 729 – 743 . Google Scholar CrossRef Search ADS PubMed 127. Menoret S. , Fontaniere S. , Jantz D. , Tesson L. , Thinard R. , Remy S. , Usal C. , Ouisse L.H. , Fraichard A. , Anegon I. Generation of Rag1-knockout immunodeficient rats and mice using engineered meganucleases . FASEB J. 2013 ; 27 : 703 – 711 . Google Scholar CrossRef Search ADS PubMed 128. Grosse S. , Huot N. , Mahiet C. , Arnould S. , Barradeau S. , Clerre D.L. , Chion-Sotinel I. , Jacqmarcq C. , Chapellier B. , Ergani A. et al. Meganuclease-mediated Inhibition of HSV1 infection in cultured cells . Mol. Ther. 2011 ; 19 : 694 – 702 . Google Scholar CrossRef Search ADS PubMed 129. Popplewell L. , Koo T. , Leclerc X. , Duclert A. , Mamchaoui K. , Gouble A. , Mouly V. , Voit T. , Paques F. , Cedrone F. et al. Gene correction of a duchenne muscular dystrophy mutation by meganuclease-enhanced exon knock-in . Hum. Gene Ther. 2013 ; 24 : 692 – 701 . Google Scholar CrossRef Search ADS PubMed 130. Jarjour J. , West-Foyle H. , Certo M.T. , Hubert C.G. , Doyle L. , Getz M.M. , Stoddard B.L. , Scharenberg A.M. High-resolution profiling of homing endonuclease binding and catalytic specificity using yeast surface display . Nucleic Acids Res. 2009 ; 37 : 6871 – 6880 . Google Scholar CrossRef Search ADS PubMed 131. Chan Y.S. , Takeuchi R. , Jarjour J. , Huen D.S. , Stoddard B.L. , Russell S. The design and in vivo evaluation of engineered I-OnuI-Based enzymes for HEG gene drive . PLoS One . 2013 ; 8 : e74254 . Google Scholar CrossRef Search ADS PubMed 132. Baxter S.K. , Lambert A.R. , Scharenberg A.M. , Jarjour J. Flow cytometric assays for interrogating LAGLIDADG homing endonuclease DNA-binding and cleavage properties . Methods Mol. Biol . 2013 ; 978 : 45 – 61 . Google Scholar CrossRef Search ADS PubMed 133. Takeuchi R. , Choi M. , Stoddard B.L. Redesign of extensive protein–DNA interfaces of meganucleases using iterative cycles of in vitro compartmentalization . Proc. Natl. Acad. Sci. U.S.A. 2014 ; 111 : 4061 – 4066 . Google Scholar CrossRef Search ADS PubMed 134. Takeuchi R. , Choi M. , Stoddard B.L. Engineering of customized meganucleases via in vitro compartmentalization and in cellulo optimization . Methods Mol. Biol. 2015 ; 1239 : 105 – 132 . Google Scholar CrossRef Search ADS PubMed 135. Hutin M. , Perez-Quintero A.L. , Lopez C. , Szurek B. MorTAL Kombat: the story of defense against TAL effectors through loss-of-susceptibility . Front. Plant Sci. 2015 ; 6 : 535 . Google Scholar PubMed 136. Boch J. , Scholze H. , Schornack S. , Landgraf A. , Hahn S. , Kay S. , Lahaye T. , Nickstadt A. , Bonas U. Breaking the code of DNA binding specificity of TAL-type III effectors . Science . 2009 ; 326 : 1509 – 1512 . Google Scholar CrossRef Search ADS PubMed 137. Mak A.N. , Bradley P. , Cernadas R.A. , Bogdanove A.J. , Stoddard B.L. The crystal structure of TAL effector PthXo1 bound to its DNA target . Science . 2012 ; 335 : 716 – 719 . Google Scholar CrossRef Search ADS PubMed 138. Deng D. , Yan C. , Pan X. , Mahfouz M. , Wang J. , Zhu J.K. , Shi Y. , Yan N. Structural basis for sequence-specific recognition of DNA by TAL effectors . Science . 2012 ; 335 : 720 – 723 . Google Scholar CrossRef Search ADS PubMed 139. Falahi Charkhabi N. , Booher N.J. , Peng Z. , Wang L. , Rahimian H. , Shams-Bakhsh M. , Liu Z. , Liu S. , White F.F. , Bogdanove A.J. Complete genome sequencing and targeted mutagenesis reveal virulence contributions of Tal2 and Tal4b of Xanthomonas translucens pv. undulosa ICMP11055 in bacterial leaf streak of wheat . Front. Microbiol. 2017 ; 8 : 1488 . Google Scholar CrossRef Search ADS PubMed 140. de Lange O. , Schreiber T. , Schandry N. , Radeck J. , Braun K.H. , Koszinowski J. , Heuer H. , Strauss A. , Lahaye T. Breaking the DNA-binding code of Ralstonia solanacearum TAL effectors provides new possibilities to generate plant resistance genes against bacterial wilt disease . New Phytol. 2013 ; 199 : 773 – 786 . Google Scholar CrossRef Search ADS PubMed 141. Streubel J. , Blucher C. , Landgraf A. , Boch J. TAL effector RVD specificities and efficiencies . Nat. Biotechnol. 2012 ; 30 : 593 – 595 . Google Scholar CrossRef Search ADS PubMed 142. Christian M.L. , Demorest Z.L. , Starker C.G. , Osborn M.J. , Nyquist M.D. , Zhang Y. , Carlson D.F. , Bradley P. , Bogdanove A.J. , Voytas D.F. Targeting G with TAL effectors: a comparison of activities of TALENs constructed with NN and NK repeat variable di-residues . PLoS One . 2012 ; 7 : e45383 . Google Scholar CrossRef Search ADS PubMed 143. Cong L. , Zhou R. , Kuo Y.C. , Cunniff M. , Zhang F. Comprehensive interrogation of natural TALE DNA-binding modules and transcriptional repressor domains . Nat. Commun. 2012 ; 3 : 968 . Google Scholar CrossRef Search ADS PubMed 144. Valton J. , Dupuy A. , Daboussi F. , Thomas S. , Marechal A. , Macmaster R. , Melliand K. , Juillerat A. , Duchateau P. Overcoming transcription activator-like effector (TALE) DNA binding domain sensitivity to cytosine methylation . J. Biol. Chem. 2012 ; 287 : 38427 – 38432 . Google Scholar CrossRef Search ADS PubMed 145. Nakagawa S. , Gisselbrecht S.S. , Rogers J.M. , Hartl D.L. , Bulyk M.L. DNA-binding specificity changes in the evolution of forkhead transcription factors . Proc. Natl. Acad. Sci. U.S.A. 2013 ; 110 : 12349 – 12354 . Google Scholar CrossRef Search ADS PubMed 146. Gao H. , Wu X. , Chai J. , Han Z. Crystal structure of a TALE protein reveals an extended N-terminal DNA binding region . Cell Res. 2012 ; 22 : 1716 – 1720 . Google Scholar CrossRef Search ADS PubMed 147. Cuculis L. , Abil Z. , Zhao H. , Schroeder C.M. Direct observation of TALE protein dynamics reveals a two-state search mechanism . Nat. Commun. 2015 ; 6 : 7277 . Google Scholar CrossRef Search ADS PubMed 148. Cuculis L. , Abil Z. , Zhao H. , Schroeder C.M. TALE proteins search DNA using a rotationally decoupled mechanism . Nat. Chem. Biol. 2016 ; 12 : 831 – 837 . Google Scholar CrossRef Search ADS PubMed 149. Doyle E.L. , Hummel A.W. , Demorest Z.L. , Starker C.G. , Voytas D.F. , Bradley P. , Bogdanove A.J. TAL effector specificity for base 0 of the DNA target is altered in a complex, effector- and assay-dependent manner by substitutions for the tryptophan in cryptic repeat -1 . PLoS One . 2013 ; 8 : e82120 . Google Scholar CrossRef Search ADS PubMed 150. Schreiber T. , Bonas U. Repeat 1 of TAL effectors affects target specificity for the base at position zero . Nucleic Acids Res. 2014 ; 42 : 7160 – 7169 . Google Scholar CrossRef Search ADS PubMed 151. Garg A. , Lohmueller J.J. , Silver P.A. , Armel T.Z. Engineering synthetic TAL effectors with orthogonal target sites . Nucleic Acids Res. 2012 ; 40 : 7584 – 7595 . Google Scholar CrossRef Search ADS PubMed 152. Meckler J.F. , Bhakta M.S. , Kim M.S. , Ovadia R. , Habrian C.H. , Zykovich A. , Yu A. , Lockwood S.H. , Morbitzer R. , Elsaesser J. et al. Quantitative analysis of TALE-DNA interactions suggests polarity effects . Nucleic Acids Res. 2013 ; 41 : 4118 – 4128 . Google Scholar CrossRef Search ADS PubMed 153. Rinaldi F.C. , Doyle L.A. , Stoddard B.L. , Bogdanove A.J. The effect of increasing numbers of repeats on TAL effector DNA binding specificity . Nucleic Acids Res. 2017 ; 45 : 6960 – 6970 . Google Scholar CrossRef Search ADS PubMed 154. Stella S. , Molina R. , Bertonatti C. , Juillerrat A. , Montoya G. Expression, purification, crystallization and preliminary X-ray diffraction analysis of the novel modular DNA-binding protein BurrH in its apo form and in complex with its target DNA . Acta Crystallogr. F Struct. Biol. Commun. 2014 ; 70 : 87 – 91 . Google Scholar CrossRef Search ADS PubMed 155. Engler C. , Gruetzner R. , Kandzia R. , Marillonnet S. Golden gate shuffling: a one-pot DNA shuffling method based on type IIs restriction enzymes . PLoS One . 2009 ; 4 : e5553 . Google Scholar CrossRef Search ADS PubMed 156. Cermak T. , Doyle E.L. , Christian M. , Wang L. , Zhang Y. , Schmidt C. , Baller J.A. , Somia N.V. , Bogdanove A.J. , Voytas D.F. Efficient design and assembly of custom TALEN and other TAL effector-based constructs for DNA targeting . Nucleic Acids Res. 2011 ; 39 : e82 . Google Scholar CrossRef Search ADS PubMed 157. Li T. , Huang S. , Zhao X. , Wright D.A. , Carpenter S. , Spalding M.H. , Weeks D.P. , Yang B. Modularly assembled designer TAL effector nucleases for targeted gene knockout and gene replacement in eukaryotes . Nucleic Acids Res. 2011 ; 39 : 6315 – 6325 . Google Scholar CrossRef Search ADS PubMed 158. Weber E. , Gruetzner R. , Werner S. , Engler C. , Marillonnet S. Assembly of designer TAL effectors by Golden Gate cloning . PLoS One . 2011 ; 6 : e19722 . Google Scholar CrossRef Search ADS PubMed 159. Morbitzer R. , Elsaesser J. , Hausner J. , Lahaye T. Assembly of custom TALE-type DNA binding domains by modular cloning . Nucleic Acids Res. 2011 ; 39 : 5790 – 5799 . Google Scholar CrossRef Search ADS PubMed 160. Schmid-Burgk J.L. , Schmidt T. , Kaiser V. , Honing K. , Hornung V. A ligation-independent cloning technique for high-throughput assembly of transcription activator-like effector genes . Nat. Biotechnol. 2013 ; 31 : 76 – 81 . Google Scholar CrossRef Search ADS PubMed 161. Briggs A.W. , Rios X. , Chari R. , Yang L. , Zhang F. , Mali P. , Church G.M. Iterative capped assembly: rapid and scalable synthesis of repeat-module DNA such as TAL effectors from individual monomers . Nucleic Acids Res. 2012 ; 40 : e117 . Google Scholar CrossRef Search ADS PubMed 162. Reyon D. , Maeder M.L. , Khayter C. , Tsai S.Q. , Foley J.E. , Sander J.D. , Joung J.K. Engineering customized TALE nucleases (TALENs) and TALE transcription factors by fast ligation-based automatable solid-phase high-throughput (FLASH) assembly . Curr. Protoc. Mol. Biol. 2013 ; doi:10.1002/0471142727.mb1216s103 . 163. Sakuma T. , Yamamoto T. Current Overview of TALEN Construction Systems . Methods Mol. Biol. 2017 ; 1630 : 25 – 36 . Google Scholar CrossRef Search ADS PubMed 164. Booher N.J. , Bogdanove A.J. Tools for TAL effector design and target prediction . Methods . 2014 ; 69 : 121 – 127 . Google Scholar CrossRef Search ADS PubMed 165. Bogdanove A.J. , Booher N.J. Kühn R , Wurst W , Wefers B TALENs: Methods and Protocols . 2016 ; 1338 : NY Humana Press Springer 43 – 47 . 166. Doyle E.L. , Booher N.J. , Standage D.S. , Voytas D.F. , Brendel V.P. , Vandyk J.K. , Bogdanove A.J. TAL effector-nucleotide targeter (TALE-NT) 2.0: tools for TAL effector design and target prediction . Nucleic Acids Res. 2012 ; 40 : W117 – W122 . Google Scholar CrossRef Search ADS PubMed 167. Perez-Quintero A.L. , Rodriguez R.L. , Dereeper A. , Lopez C. , Koebnik R. , Szurek B. , Cunnac S. An improved method for TAL effectors DNA-binding sites prediction reveals functional convergence in TAL repertoires of Xanthomonas oryzae strains . PLoS One . 2013 ; 8 : e68464 . Google Scholar CrossRef Search ADS PubMed 168. Rogers J.M. , Barrera L.A. , Reyon D. , Sander J.D. , Kellis M. , Joung J.K. , Bulyk M.L. Context influences on TALE-DNA binding revealed by quantitative profiling . Nat. Commun. 2015 ; 6 : 7440 . Google Scholar CrossRef Search ADS PubMed 169. Cernadas R.A. , Doyle E.L. , Nino-Liu D.O. , Wilkins K.E. , Bancroft T. , Wang L. , Schmidt C.L. , Caldo R. , Yang B. , White F.F. et al. Code-assisted discovery of TAL effector targets in bacterial leaf streak of rice reveals contrast with bacterial blight and a novel susceptibility gene . PLoS Pathogens . 2014 ; 10 : e1003972 . Google Scholar CrossRef Search ADS PubMed 170. Streubel J. , Pesce C. , Hutin M. , Koebnik R. , Boch J. , Szurek B. Five phylogenetically close rice SWEET genes confer TAL effector-mediated susceptibility to Xanthomonas oryzae pv. oryzae . New Phytol. 2013 ; 200 : 808 – 819 . Google Scholar CrossRef Search ADS PubMed 171. Christian M. , Cermak T. , Doyle E.L. , Schmidt C. , Zhang F. , Hummel A. , Bogdanove A.J. , Voytas D.F. Targeting DNA double-strand breaks with TAL effector nucleases . Genetics . 2010 ; 186 : 757 – 761 . Google Scholar CrossRef Search ADS PubMed 172. Li T. , Huang S. , Jiang W.Z. , Wright D. , Spalding M.H. , Weeks D.P. , Yang B. TAL nucleases (TALNs): hybrid proteins composed of TAL effectors and FokI DNA-cleavage domain . Nucleic Acids Res. 2011 ; 39 : 359 – 372 . Google Scholar CrossRef Search ADS PubMed 173. Miller J.C. , Tan S. , Qiao G. , Barlow K.A. , Wang J. , Xia D.F. , Meng X. , Paschon D.E. , Leung E. , Hinkley S.J. et al. A TALE nuclease architecture for efficient genome editing . Nat. Biotechnol. 2011 ; 29 : 143 – 148 . Google Scholar CrossRef Search ADS PubMed 174. de Lange O. , Binder A. , Lahaye T. From dead leaf, to new life: TAL effectors as tools for synthetic biology . Plant J. 2014 ; 78 : 753 – 771 . Google Scholar CrossRef Search ADS PubMed 175. Yanik M. , Alzubi J. , Lahaye T. , Cathomen T. , Pingoud A. , Wende W. TALE-PvuII fusion proteins - novel tools for gene targeting . PLoS One . 2013 ; 8 : e82539 . Google Scholar CrossRef Search ADS PubMed 176. Kleinstiver B.P. , Wang L. , Wolfs J.M. , Kolaczyk T. , McDowell B. , Wang X. , Schild-Poulter C. , Bogdanove A.J. , Edgell D.R. The I-TevI nuclease and linker domains contribute to the specificity of monomeric TALENs . G3 (Bethesda) . 2014 ; 4 : 1155 – 1165 . Google Scholar CrossRef Search ADS PubMed 177. Bolukbasi M.F. , Gupta A. , Oikemus S. , Derr A.G. , Garber M. , Brodsky M.H. , Zhu L.J. , Wolfe S.A. DNA-binding-domain fusions enhance the targeting range and precision of Cas9 . Nat. Methods . 2015 ; 12 : 1150 – 1156 . Google Scholar CrossRef Search ADS PubMed 178. Christian M. , Qi Y.P. , Zhang Y. , Voytas D.F. Targeted mutagenesis of Arabidopsis thaliana using engineered TAL effector nucleases . G3-Genes Genom. Genet. 2013 ; 3 : 1697 – 1705 . 179. Gil-Humanes J. , Wang Y. , Liang Z. , Shan Q. , Ozuna C.V. , Sanchez-Leon S. , Baltes N.J. , Starker C. , Barro F. , Gao C. et al. High-efficiency gene targeting in hexaploid wheat using DNA replicons and CRISPR/Cas9 . Plant J. 2017 ; 89 : 1251 – 1262 . Google Scholar CrossRef Search ADS PubMed 180. Xiao A. , Wang Z. , Hu Y. , Wu Y. , Luo Z. , Yang Z. , Zu Y. , Li W. , Huang P. , Tong X. et al. Chromosomal deletions and inversions mediated by TALENs and CRISPR/Cas in zebrafish . Nucleic Acids Res. 2013 ; 41 : e141 . Google Scholar CrossRef Search ADS PubMed 181. Kim Y. , Kweon J. , Kim A. , Chon J.K. , Yoo J.Y. , Kim H.J. , Kim S. , Lee C. , Jeong E. , Chung E. et al. A library of TAL effector nucleases spanning the human genome . Nat. Biotechnol. 2013 ; 31 : 251 – 258 . Google Scholar CrossRef Search ADS PubMed 182. Blount B.A. , Weenink T. , Vasylechko S. , Ellis T. Rational diversification of a promoter providing fine-tuned expression and orthogonal regulation for synthetic biology . PLoS One . 2012 ; 7 : e33279 . Google Scholar CrossRef Search ADS PubMed 183. Carlson D.F. , Tan W. , Lillico S.G. , Stverakova D. , Proudfoot C. , Christian M. , Voytas D.F. , Long C.R. , Whitelaw C.B. , Fahrenkrug S.C. Efficient TALEN-mediated gene knockout in livestock . Proc. Natl. Acad. Sci. U.S.A. 2012 ; 109 : 17382 – 17387 . Google Scholar CrossRef Search ADS PubMed 184. Qasim W. , Zhan H. , Samarasinghe S. , Adams S. , Amrolia P. , Stafford S. , Butler K. , Rivat C. , Wright G. , Somana K. et al. Molecular remission of infant B-ALL after infusion of universal TALEN gene-edited CAR T cells . Sci. Transl. Med. 2017 ; 9 : eaaj2013 . Google Scholar CrossRef Search ADS PubMed 185. Miller J.C. , Zhang L. , Xia D.F. , Campo J.J. , Ankoudinova I.V. , Guschin D.Y. , Babiarz J.E. , Meng X. , Hinkley S.J. , Lam S.C. et al. Improved specificity of TALE-based genome editing using an expanded RVD repertoire . Nat. Methods . 2015 ; 12 : 465 – 471 . Google Scholar CrossRef Search ADS PubMed 186. Yang J. , Zhang Y. , Yuan P. , Zhou Y. , Cai C. , Ren Q. , Wen D. , Chu C. , Qi H. , Wei W. Complete decoding of TAL effectors for DNA recognition . Cell Res. 2014 ; 24 : 628 – 631 . Google Scholar CrossRef Search ADS PubMed 187. de Lange O. , Wolf C. , Thiel P. , Kruger J. , Kleusch C. , Kohlbacher O. , Lahaye T. DNA-binding proteins from marine bacteria expand the known sequence diversity of TALE-like repeats . Nucleic Acids Res. 2015 ; 43 : 10065 – 10080 . Google Scholar PubMed 188. de Lange O. , Schandry N. , Wunderlich M. , Berendzen K.W. , Lahaye T. Exploiting the sequence diversity of TALE-like repeats to vary the strength of dTALE-promoter interactions . Synth.Biol. 2017 ; 2 : ysx004 . 189. Tochio N. , Umehara K. , Uewaki J.I. , Flechsig H. , Kondo M. , Dewa T. , Sakuma T. , Yamamoto T. , Saitoh T. , Togashi Y. et al. Non-RVD mutations that enhance the dynamics of the TAL repeat array along the superhelical axis improve TALEN genome editing efficacy . Sci. Rep. 2016 ; 6 : 37887 . Google Scholar CrossRef Search ADS PubMed 190. Holkers M. , Maggio I. , Liu J. , Janssen J.M. , Miselli F. , Mussolino C. , Recchia A. , Cathomen T. , Goncalves M.A. Differential integrity of TALE nuclease genes following adenoviral and lentiviral vector gene transfer into human cells . Nucleic Acids Res. 2013 ; 41 : e63 . Google Scholar CrossRef Search ADS PubMed 191. Richter A. , Streubel J. , Blucher C. , Szurek B. , Reschke M. , Grau J. , Boch J. A TAL effector repeat architecture for frameshift binding . Nat. Commun. 2014 ; 5 : 3447 . Google Scholar CrossRef Search ADS PubMed 192. Grindley N.D. , Whiteson K.L. , Rice P.A. Mechanisms of site-specific recombination . Annu. Rev. Biochem. 2006 ; 75 : 567 – 605 . Google Scholar CrossRef Search ADS PubMed 193. Gaj T. , Sirk S.J. , Barbas C.F. 3rd Expanding the scope of site-specific recombinases for genetic and metabolic engineering . Biotechnol. Bioeng. 2014 ; 111 : 1 – 15 . Google Scholar CrossRef Search ADS PubMed 194. Meinke G. , Bohm A. , Hauber J. , Pisabarro M.T. , Buchholz F. Cre Recombinase and other tyrosine recombinases . Chem. Rev. 2016 ; 116 : 12785 – 12820 . Google Scholar CrossRef Search ADS PubMed 195. Anastassiadis K. , Fu J. , Patsch C. , Hu S. , Weidlich S. , Duerschke K. , Buchholz F. , Edenhofer F. , Stewart A.F. Dre recombinase, like Cre, is a highly efficient site-specific recombinase in E. coli, mammalian cells and mice . Disease Models Mech. 2009 ; 2 : 508 – 515 . Google Scholar CrossRef Search ADS 196. Karimova M. , Abi-Ghanem J. , Berger N. , Surendranath V. , Pisabarro M.T. , Buchholz F. Vika/vox, a novel efficient and specific Cre/loxP-like site-specific recombination system . Nucleic Acids Res. 2013 ; 41 : e37 . Google Scholar CrossRef Search ADS PubMed 197. Karimova M. , Splith V. , Karpinski J. , Pisabarro M.T. , Buchholz F. Discovery of Nigri/nox and Panto/pox site-specific recombinase systems facilitates advanced genome engineering . Sci. Rep. 2016 ; 6 : 30130 . Google Scholar CrossRef Search ADS PubMed 198. Suzuki E. , Nakayama M. VCre/VloxP and SCre/SloxP: new site-specific recombination systems for genome engineering . Nucleic Acids Res. 2011 ; 39 : e49 . Google Scholar CrossRef Search ADS PubMed 199. Arnold P.H. , Blake D.G. , Grindley N.D. , Boocock M.R. , Stark W.M. Mutants of Tn3 resolvase which do not require accessory binding sites for recombination activity . EMBO J. 1999 ; 18 : 1407 – 1414 . Google Scholar CrossRef Search ADS PubMed 200. Keravala A. , Lee S. , Thyagarajan B. , Olivares E.C. , Gabrovsky V.E. , Woodard L.E. , Calos M.P. Mutational derivatives of PhiC31 integrase with increased efficiency and specificity . Mol. Ther. 2009 ; 17 : 112 – 120 . Google Scholar CrossRef Search ADS PubMed 201. Klippel A. , Cloppenborg K. , Kahmann R. Isolation and characterization of unusual gin mutants . EMBO J. 1988 ; 7 : 3983 – 3989 . Google Scholar PubMed 202. Lorbach E. , Christ N. , Schwikardi M. , Droge P. Site-specific recombination in human cells catalyzed by phage lambda integrase mutants . J. Mol. Biol. 2000 ; 296 : 1175 – 1181 . Google Scholar CrossRef Search ADS PubMed 203. Wallen M.C. , Gaj T. , Barbas C.F. 3rd Redesigning recombinase specificity for safe harbor sites in the human genome . PLoS One . 2015 ; 10 : e0139123 . Google Scholar CrossRef Search ADS PubMed 204. Wallace H.A. , Marques-Kranc F. , Richardson M. , Luna-Crespo F. , Sharpe J.A. , Hughes J. , Wood W.G. , Higgs D.R. , Smith A.J. Manipulating the mouse genome to engineer precise functional syntenic replacements with human sequence . Cell . 2007 ; 128 : 197 – 209 . Google Scholar CrossRef Search ADS PubMed 205. Hirano N. , Muroi T. , Takahashi H. , Haruki M. Site-specific recombinases as tools for heterologous gene integration . Appl. Microbiol. Biotechnol. 2011 ; 92 : 227 – 239 . Google Scholar CrossRef Search ADS PubMed 206. Voziyanova E. , Malchin N. , Anderson R.P. , Yagil E. , Kolot M. , Voziyanov Y. Efficient Flp-Int HK022 dual RMCE in mammalian cells . Nucleic Acids Res. 2013 ; 41 : e125 . Google Scholar CrossRef Search ADS PubMed 207. Van Duyne G.D. Cre Recombinase . Microbiol. Spectrum . 2015 ; 3 : doi:10.1128/microbiolspec.MDNA3-0014-2014 . 208. Senecoff J.F. , Cox M.M. Directionality in FLP protein-promoted site-specific recombination is mediated by DNA-DNA pairing . J. Biol. Chem. 1986 ; 261 : 7380 – 7386 . Google Scholar PubMed 209. Baldwin E.P. , Martin S.S. , Abel J. , Gelato K.A. , Kim H. , Schultz P.G. , Santoro S.W. A specificity switch in selected cre recombinase variants is mediated by macromolecular plasticity and water . Chem. Biol. 2003 ; 10 : 1085 – 1094 . Google Scholar CrossRef Search ADS PubMed 210. Rufer A.W. , Sauer B. Non-contact positions impose site selectivity on Cre recombinase . Nucleic Acids Res. 2002 ; 30 : 2764 – 2771 . Google Scholar CrossRef Search ADS PubMed 211. Missirlis P.I. , Smailus D.E. , Holt R.A. A high-throughput screen identifying sequence and promiscuity characteristics of the loxP spacer region in Cre-mediated recombination . BMC Genomics . 2006 ; 7 : 73 . Google Scholar CrossRef Search ADS PubMed 212. Sheren J. , Langer S.J. , Leinwand L.A. A randomized library approach to identifying functional lox site domains for the Cre recombinase . Nucleic Acids Res. 2007 ; 35 : 5464 – 5473 . Google Scholar CrossRef Search ADS PubMed 213. Janbandhu V.C. , Moik D. , Fassler R. Cre recombinase induces DNA damage and tetraploidy in the absence of loxP sites . Cell cycle (Georgetown, Tex.) . 2014 ; 13 : 462 – 470 . Google Scholar CrossRef Search ADS PubMed 214. Santoro S.W. , Schultz P.G. Directed evolution of the substrate specificities of a site-specific recombinase and an aminoacyl-tRNA synthetase using fluorescence-activated cell sorting (FACS) . Methods Mol. Biol. 2003 ; 230 : 291 – 312 . Google Scholar PubMed 215. Santoro S.W. , Schultz P.G. Directed evolution of the site specificity of Cre recombinase . Proc. Natl. Acad. Sci. U.S.A. 2002 ; 99 : 4185 – 4190 . Google Scholar CrossRef Search ADS PubMed 216. Stemmer W.P. Rapid evolution of a protein in vitro by DNA shuffling . Nature . 1994 ; 370 : 389 – 391 . Google Scholar CrossRef Search ADS PubMed 217. Buchholz F. , Stewart A.F. Alteration of Cre recombinase site specificity by substrate-linked protein evolution . Nat. Biotechnol. 2001 ; 19 : 1047 – 1052 . Google Scholar CrossRef Search ADS PubMed 218. Buchholz F. Molecular evolution of the tre recombinase . J. Visual. Exp.: JoVE. 2008 ; 29 : 791 . 219. Buchholz F. , Hauber J. In vitro evolution and analysis of HIV-1 LTR-specific recombinases . Methods . 2011 ; 53 : 102 – 109 . Google Scholar CrossRef Search ADS PubMed 220. Karpinski J. , Chemnitz J. , Hauber I. , Abi-Ghanem J. , Paszkowski-Rogacz M. , Surendranath V. , Chakrabort D. , Hackmann K. , Schrock E. , Pisabarro M.T. et al. Universal Tre (uTre) recombinase specifically targets the majority of HIV-1 isolates . J. Int. AIDS Soc. 2014 ; 17 : 19706 . Google Scholar CrossRef Search ADS PubMed 221. Mariyanna L. , Priyadarshini P. , Hofmann-Sieber H. , Krepstakies M. , Walz N. , Grundhoff A. , Buchholz F. , Hildt E. , Hauber J. Excision of HIV-1 proviral DNA by recombinant cell permeable tre-recombinase . PLoS One . 2012 ; 7 : e31576 . Google Scholar CrossRef Search ADS PubMed 222. Sarkar I. , Hauber I. , Hauber J. , Buchholz F. HIV-1 proviral DNA excision using an evolved recombinase . Science . 2007 ; 316 : 1912 – 1915 . Google Scholar CrossRef Search ADS PubMed 223. Abi-Ghanem J. , Chusainow J. , Karimova M. , Spiegel C. , Hofmann-Sieber H. , Hauber J. , Buchholz F. , Pisabarro M.T. Engineering of a target site-specific recombinase by a combined evolution- and structure-guided approach . Nucleic Acids Res. 2013 ; 41 : 2394 – 2403 . Google Scholar CrossRef Search ADS PubMed 224. Meinke G. , Karpinski J. , Buchholz F. , Bohm A. Crystal structure of an engineered, HIV-specific recombinase for removal of integrated proviral DNA . Nucleic Acids Res. 2017 ; 45 : 9726 – 9740 . Google Scholar CrossRef Search ADS PubMed 225. Hauber I. , Hofmann-Sieber H. , Chemnitz J. , Dubrau D. , Chusainow J. , Stucka R. , Hartjen P. , Schambach A. , Ziegler P. , Hackmann K. et al. Highly significant antiviral activity of HIV-1 LTR-specific tre-recombinase in humanized mice . PLoS Pathog . 2013 ; 9 : e1003587 . Google Scholar CrossRef Search ADS PubMed 226. Castillo F. , Benmohamed A. , Szatmari G. Xer site specific Recombination: Double and single recombinase systems . Front. Microbiol. 2017 ; 8 : 453 . Google Scholar PubMed 227. Saraf-Levy T. , Santoro S.W. , Volpin H. , Kushnirsky T. , Eyal Y. , Schultz P.G. , Gidoni D. , Carmi N. Site-specific recombination of asymmetric lox sites mediated by a heterotetrameric Cre recombinase complex . Bioorg. Med. Chem. 2006 ; 14 : 3081 – 3089 . Google Scholar CrossRef Search ADS PubMed 228. Gelato K.A. , Martin S.S. , Liu P.H. , Saunders A.A. , Baldwin E.P. Spatially directed assembly of a heterotetrameric Cre-Lox synapse restricts recombination specificity . J. Mol. Biol. 2008 ; 378 : 653 – 665 . Google Scholar CrossRef Search ADS PubMed 229. Zhang C. , Myers C.A. , Qi Z. , Mitra R.D. , Corbo J.C. , Havranek J.J. Redesign of the monomer-monomer interface of Cre recombinase yields an obligate heterotetrameric complex . Nucleic Acids Res. 2015 ; 43 : 9076 – 9085 . Google Scholar CrossRef Search ADS PubMed 230. Eroshenko N. , Church G.M. Mutants of Cre recombinase with improved accuracy . Nat. Commun. 2013 ; 4 : 2509 . Google Scholar CrossRef Search ADS PubMed 231. Buchholz F. , Angrand P.O. , Stewart A.F. Improved properties of FLP recombinase evolved by cycling mutagenesis . Nat. Biotechnol. 1998 ; 16 : 657 – 662 . Google Scholar CrossRef Search ADS PubMed 232. Konieczka J.H. , Paek A. , Jayaram M. , Voziyanov Y. Recombination of hybrid target sites by binary combinations of Flp variants: mutations that foster interprotomer collaboration and enlarge substrate tolerance . J. Mol. Biol. 2004 ; 339 : 365 – 378 . Google Scholar CrossRef Search ADS PubMed 233. Voziyanov Y. , Konieczka J.H. , Stewart A.F. , Jayaram M. Stepwise manipulation of DNA specificity in Flp recombinase: progressively adapting Flp to individual and combinatorial mutations in its target site . J. Mol. Biol. 2003 ; 326 : 65 – 76 . Google Scholar CrossRef Search ADS PubMed 234. Bolusani S. , Ma C.H. , Paek A. , Konieczka J.H. , Jayaram M. , Voziyanov Y. Evolution of variants of yeast site-specific recombinase Flp that utilize native genomic sequences as recombination target sites . Nucleic Acids Res. 2006 ; 34 : 5259 – 5269 . Google Scholar CrossRef Search ADS PubMed 235. Shah R. , Li F. , Voziyanova E. , Voziyanov Y. Target-specific variants of Flp recombinase mediate genome engineering reactions in mammalian cells . FEBS J. 2015 ; 282 : 3323 – 3333 . Google Scholar CrossRef Search ADS PubMed 236. Voziyanova E. , Anderson R.P. , Shah R. , Li F. , Voziyanov Y. Efficient genome manipulation by variants of Site-Specific recombinases R and TD . J. Mol. Biol. 2016 ; 428 : 990 – 1003 . Google Scholar CrossRef Search ADS PubMed 237. Biswas T. , Aihara H. , Radman-Livaja M. , Filman D. , Landy A. , Ellenberger T. A structural basis for allosteric control of DNA recombination by lambda integrase . Nature . 2005 ; 435 : 1059 – 1066 . Google Scholar CrossRef Search ADS PubMed 238. Rice P.A. Serine resolvases . Microbiol. Spectrum . 2015 ; 3 : doi:10.1128/microbiolspec.MDNA3-0045-2014 . 239. Smith M.C. , Brown W.R. , McEwan A.R. , Rowley P.A. Site-specific recombination by phiC31 integrase and other large serine recombinases . Biochem. Soc. Trans. 2010 ; 38 : 388 – 394 . Google Scholar CrossRef Search ADS PubMed 240. Van Duyne G.D. Lambda integrase: armed for recombination . Curr. Biol.: CB . 2005 ; 15 : R658 – R660 . Google Scholar CrossRef Search ADS 241. Yagil E. , Dorgai L. , Weisberg R.A. Identifying determinants of recombination specificity: construction and characterization of chimeric bacteriophage integrases . J. Mol. Biol. 1995 ; 252 : 163 – 177 . Google Scholar CrossRef Search ADS PubMed 242. Siau J.W. , Chee S. , Makhija H. , Wai C.M. , Chandra S.H. , Peter S. , Droge P. , Ghadessy F.J. Directed evolution of lambda integrase activity and specificity by genetic derepression . Protein Eng. Des. Selection: PEDS . 2015 ; 28 : 211 – 220 . Google Scholar CrossRef Search ADS 243. Zhao H. , Giver L. , Shao Z. , Affholter J.A. , Arnold F.H. Molecular evolution by staggered extension process (StEP) in vitro recombination . Nat. Biotechnol. 1998 ; 16 : 258 – 261 . Google Scholar CrossRef Search ADS PubMed 244. Sclimenti C.R. , Thyagarajan B. , Calos M.P. Directed evolution of a recombinase for improved genomic integration at a native human sequence . Nucleic Acids Res. 2001 ; 29 : 5044 – 5051 . Google Scholar CrossRef Search ADS PubMed 245. Akopian A. , He J. , Boocock M.R. , Stark W.M. Chimeric recombinases with designed DNA sequence recognition . Proc. Natl. Acad. Sci. U.S.A. 2003 ; 100 : 8688 – 8691 . Google Scholar CrossRef Search ADS PubMed 246. Prorocic M.M. , Wenlong D. , Olorunniji F.J. , Akopian A. , Schloetel J.G. , Hannigan A. , McPherson A.L. , Stark W.M. Zinc-finger recombinase activities in vitro . Nucleic Acids Res. 2011 ; 39 : 9316 – 9328 . Google Scholar CrossRef Search ADS PubMed 247. Gersbach C.A. , Gaj T. , Gordley R.M. , Mercer A.C. , Barbas C.F. 3rd Targeted plasmid integration into the human genome by an engineered zinc-finger recombinase . Nucleic Acids Res. 2011 ; 39 : 7868 – 7878 . Google Scholar CrossRef Search ADS PubMed 248. Yang W. , Steitz T.A. Crystal structure of the site-specific recombinase gamma delta resolvase complexed with a 34 bp cleavage site . Cell . 1995 ; 82 : 193 – 207 . Google Scholar CrossRef Search ADS PubMed 249. Keenholtz R.A. , Rowland S.J. , Boocock M.R. , Stark W.M. , Rice P.A. Structural basis for catalytic activation of a serine recombinase . Structure . 2011 ; 19 : 799 – 809 . Google Scholar CrossRef Search ADS PubMed 250. Ritacco C.J. , Kamtekar S. , Wang J. , Steitz T.A. Crystal structure of an intermediate of rotating dimers within the synaptic tetramer of the G-segment invertase . Nucleic Acids Res. 2013 ; 41 : 2673 – 2682 . Google Scholar CrossRef Search ADS PubMed 251. Gaj T. , Mercer A.C. , Gersbach C.A. , Gordley R.M. , Barbas C.F. 3rd Structure-guided reprogramming of serine recombinase DNA sequence specificity . Proc. Natl. Acad. Sci. U.S.A. 2011 ; 108 : 498 – 503 .

journal article

Open Access Collection

De novo annotation and characterization of the translatome with ribosome profiling data

Xiao, Zhengtao;Huang, Rongyao;Xing, Xudong;Chen, Yuling;Deng, Haiteng;Yang, Xuerui

2018 Nucleic Acids Research

doi: 10.1093/nar/gky179pmid: 29538776

Abstract By capturing and sequencing the RNA fragments protected by translating ribosomes, ribosome profiling provides snapshots of translation at subcodon resolution. The growing needs for comprehensive annotation and characterization of the context-dependent translatomes are calling for an efficient and unbiased method to accurately recover the signal of active translation from the ribosome profiling data. Here we present our new method, RiboCode, for such purpose. Being tested with simulated and real ribosome profiling data, and validated with cell type-specific QTI-seq and mass spectrometry data, RiboCode exhibits superior efficiency, sensitivity, and accuracy for de novo annotation of the translatome, which covers various types of ORFs in the previously annotated coding and non-coding regions. As an example, RiboCode was applied to assemble the context-specific translatomes of yeast under normal and stress conditions. Comparisons among these translatomes revealed stress-activated novel upstream and downstream ORFs, some of which are associated with translational dysregulations of the annotated main ORFs under the stress conditions. INTRODUCTION Ribosome profiling, also called Ribo-seq, generates genome-wide allocations and quantifications of the ribosome protected RNA fragments (RPF) (1), which provide real-time snapshots of translation (translatome) across the whole transcriptome. Many studies have exploited this powerful technique to systematically characterize multiple features of translation, including the translational rates (2–4), pausing upon stress signals (5–7), stop codon read-through (8), translation potential of non-coding sequences (9–12), and alternative reading frames (10,13). Many previously unannotated open reading frames (ORFs) have been identified from the published ribosome profiling data and indexed by the specialized databases (14,15). However, it has also been frequently shown that the ribosome occupancy itself, as indicated by the RPF reads mapped on the transcriptome, is not sufficient for calling of the active translation, given the possible noise from the data processing and experimental procedures, regulatory RNAs that bind with the ribosome, and ribosome engagement without translation (16,17). This therefore necessitates a specially designed methodology to recover the active translation events from the usually distorted and ambiguous signals in the ribosome profiling data. Such method should fully account for the complexity of translation itself, such as alternative initiation sites and overlapping open reading frames (ORFs). Owing to its subcodon resolution, ribosome profiling reveals the precise locations of the peptidyl-site (P-site) of the 80S ribosome in the RPF reads, given that the experiment itself was properly performed and the RPF reads were correctly filtered. Aligned by their P-site positions, the RPF reads resulted from the translating ribosomes should therefore exhibit 3-nt periodicity along the ORF, which is the strongest evidence of active translation. Only recently have different strategies been developed to assess the translation by testing the distribution of ribosome engagement at the subcodon resolution (11,12,18–23). These methods have been comprehensively reviewed in (24). Some of these methods used the strategy of machine learning, which requires prior annotation of the known coding transcripts for training of the model (12,21). Like many supervised methods in general, the results of these methods heavily rely on the pre-annotated training set, source of a potential intrinsic bias. On the other hand, only a couple of other methods were designed for de novo translatome annotation by directly assessing the 3-nt periodicity, and these include the strategy of ORFscore (11), RiboTaper (18) and RP-BP (22). In the present study, we have developed a new statistically vigorous method, RiboCode, for the de novo annotation of the full translatome by quantitatively assessing the 3-nt periodicity (Figure 1). Tested with both simulated and real data, and further benchmarked with cell-type specific QTI-seq and mass spectrometry data, RiboCode exhibited superior efficiency, sensitivity and accuracy to the existing de novo and supervised methods. We then performed detailed comparisons between RiboCode and the existing methods for discovery of the uncanonical ORFs such as the upstream ORFs (uORFs), and several representative case examples were provided. Furthermore, to showcase the application of RiboCode in reconstructing the context-dependent translatomes, we applied RiboCode on a published ribosome profiling dataset to assemble the translatomes of yeast under normal condition, heat shock, and oxidative stress (25). Comparisons among these translatomes revealed novel ORFs in the canonically non-coding regions that were activated in response to heat shock and oxidative stress. Quantitative analysis of the ORFs further showed that some of the upstream ORFs (uORFs) and downstream ORFs (dORFs) were indeed associative with the potential translation dysregulation of the previously annotated main coding regions of the mRNA transcripts. Figure 1. View largeDownload slide The methodology design of RiboCode. Schematic description of RiboCode. Further details are provided in the Materials and Methods section. Figure 1. View largeDownload slide The methodology design of RiboCode. Schematic description of RiboCode. Further details are provided in the Materials and Methods section. MATERIALS AND METHODS Pre-processing of the ribosome profiling and RNA-seq data The five sets of ribosome profiling data, including two in HEK293 cell (18,26), and one for each in Zebrafish (11), mouse liver cell (26), and cancer cell line PC3 (3), were downloaded from the NCBI Sequence Read Archive and the Gene Expression Omnibus (GEO) database. The accession IDs are SRA160745 for HEK293 (Gao et al.) and mouse liver cells, GSE73136 for HEK293 (Calviello et al.), GSE35469 for PC3 and GSE53693 for Zebrafish. The ribosome profiling data of yeast under normal, oxidative stress, and heat shock conditions was also downloaded from GEO (GSE59573). The pre-processing procedure of the ribosome profiling data has been described previously (27). Specifically, the cutadapt program (28) was used to trim the 3′ adaptor in the raw reads of both mRNA and RPF. Low-quality reads with Phred quality scores lower than 20 (>50% of bases) were removed using the fastx quality filter (http://hannonlab.cshl.edu/fastx_toolkit/). Next, sequencing reads originating from rRNAs were identified and discarded by aligning the reads to rRNA sequences of the particular species using Bowtie (version 1.1.2) with no mismatch allowed. The remaining reads were then mapped to the genome and spliced transcripts using STAR with the following parameters: –outFilterType BySJout –outFilterMismatchNmax 2 –outSAMtype BAM –quantMode TranscriptomeSAM –outFilterMultimapNmax 1 –outFilterMatchNmin 16. To control the noise from multiple alignments, reads mapped to multiple genomic positions were discarded. RiboCode step 1: preparation of the transcriptome annotation This step defines the annotated transcripts, from which the candidate ORFs will be identified. This is done by the prepare_transcripts command in the RiboCode package, with inputs of a GTF file and a genome FASTA file. The GTF and FASTA file (release 74 for human, and release 87 for Zebrafish) were downloaded from the Ensembl FTP repository (http://www.ensembl.org/info/data/ftp/index.html). Each transcript was assembled by merging the exons according to the structures defined in the GTF file. The transcript sequences were then retrieved from the genome FASTA file. The yeast genome (version R61-1-1) was retrieved from SGD database (http://www.yeastgenome.org) and the transcriptome annotation was obtained from (29). Note that RiboCode requires the GTF file in the standard format, which includes the three-level hierarchy annotations (genes, transcripts and exons). Such standard GTF files can be obtained from the ENSEMBL/GENCODE databases. Those from other sources or the custom GTF files may lack the gene and transcript annotation information. The RiboCode package thereby provides a command GTFupdate, which adds the missing information to a non-standard GTF file and converts it into the standard format. Please refer to the software instruction page at https://pypi.python.org/pypi/RiboCode for more information. RiboCode step 2: filtering of the RPF reads and identification of the P-site locations The purpose of this step, with the metaplots command in the RiboCode package, is to (i) select the length range of the RPF reads that are most likely originated from the translating ribosomes and (ii) identify the P-site locations for different lengths of the RPFs. This was done with a meta-gene analysis of the RPF reads mapped on the previously annotated coding genes (Figure 1). Specifically, for each set of the RPF reads with a particular length, the distances from their 5′ ends to the annotated start and stop codons were calculated and summarized as histograms (Supplementary Figure S14 as an example). The length range, in which the pooled RPF reads showed strong 3-nt periodicity from their 5′ ends to the start and stop codons, should then be determined by the user, for the following analysis of RiboCode. In the examples shown in Supplementary Figure S14, the RPF length range was deemed to be 26–29 nt for HEK293 and 28–29 nt for Zebrafish. Also from the histograms for each of the RPF lengths selected above, the P-site locations were inferred according to the offsets of the 5′ end of the RPF reads mapped on the start codons. In the examples shown in Supplementary Figure S14, the P-sites were identified as the +12th nt of all the RPF reads within the selected length range for HEK293 and Zebrafish data. Supplementary Table S8 presented the selected read lengths and the P-site positions for the different ribosome profiling datasets used in the present study. Based on our experience, in most cases, selection of the RPF reads around 28–30 nt is generally appropriate, and their P-site positions are usually at +12. However, we believe that it is critical to run this step of RiboCode to extract the RPFs that are most likely from the translating ribosomes and to precisely determine their P-site positions. Alternatively, the users have the option to skip this step and directly provide the information of read length and P-site positions based on their experiences, although this is not recommended, especially when the experimental conditions (species, culturing condition, stress) or the procedure of ribosome profiling (nuclease, buffer, library preparation) have been changed. RiboCode step 3: identification of the candidate ORFs and assessment of the 3-nt periodicity As the primary analysis procedure of RiboCode, this step is executed with a single command RiboCode (Figure 1). It starts with a transcriptome-wide search for the candidate ORFs from a canonical start codon (AUG) to the next stop codon. Optionally, alternative start codons provided by the users, for example CUG and GUG, can also be included in the search for the candidate ORFs in the regions outside of the ORFs with the canonical start codon AUG. Next, based on the mapping results of the RPF reads within the length range identified in the second step, for each nucleotide of the candidate ORF, RiboCode counts the number of reads, of which the P-sites were allocated on the particular nucleotide. Eventually, RiboCode generates a spectrum of the P-site densities at each nucleotide along each candidate ORF. Mathematically, the spectrum of the P-site densities along each candidate ORF is a numerical vector with the length of the ORF. From this vector, we simply derived three shorter vectors, each with one-third of the length of the ORF. As shown in Figure 1, one of these three vectors, F0, represents the P-site density along the first nucleotide of each codon, from the start to the stop codon. Similarly, the other two vectors, F1 and F2, represent the P-site densities along the second and the third nucleotide, respectively, of each codon. To assess the 3-nt periodicity, the Wilcoxon signed rank test strategy was modified and used to evaluate whether F0 is generally greater than F1 and F2 at the non-zero positions. Accordingly, this would yield two P-values, indicating the significance levels of F0 > F1 and F0 > F2. Finally, an integrated P-value was derived with Stoufer's method, which represents the overall statistical significance of the 3-nt periodicity. Many transcripts have multiple start codons upstream of the stop codon, and we followed two simple principles to identify the translation initiation sites for the candidate ORFs. (i) We used the same procedure of the modified Wilcoxon signed rank test, as described above, to assess the 3-nt periodicity of the RPF reads mapped between the most upstream (first) start codon and the next one (second) downstream. This was done only if there were more than 10 codons in this region, of which the in-frame RPF counts are larger than zero. If this test resulted in a statistically significant 3-nt periodicity (P-value smaller than the cutoff provided by the user, e.g. 0.05), we defined the first start codon as the translation initiation site. Otherwise, we disregard it and repeat the same procedure for the region between the second start codon and the subsequent one. Note that the capability of RiboCode in dealing with short sequences, as shown in Figure 2B, makes it possible to assess the 3-nt periodicity between the two neighboring start codons, which are usually close. (ii) If the two start codons are too close or if there are limited RPF reads (fewer than 10 codons with none-zero in-frame RPF counts) between two neighboring start codons, we chose the upstream start codon of the region, in which the codons that have more in-frame than off-frame RPF reads (frame0 > frame 1 and frame0 > frame2) are greater than the ones that do not (frame0 < = frame 1 and frame0 < = frame2). Figure 2. View largeDownload slide Performance of RiboCode compared with the de novo methods RiboTaper and ORFscore. (A) The numbers of CCDS exons identified by different methods with the RPF data in HEK293 cells. The cutoffs used for three methods, RiboCode, RiboTaper and ORFscore, were calibrated so that they produced the same numbers of false positives with the RNA-seq data. (B) Distributions of the lengths, total read counts, and coverage of the CCDS exons identified by RiboCode, RiboTaper and ORFscore. (C) ROC and precision curves generated with the results of RiboCode, RiboTaper and ORFscore with two simulation datasets, one generated from the HEK293 cell data in Gao et al. (left) and the other one from the Zebrafish data in Bazzini et al. (right). The P-values of the ROC curve differences between RiboCode and the second best method were provided in Supplementary Figure S2B. (D) A representative ROC curve generated with the results of RiboCode on a simulation dataset specifically for the overlapping ORFs. Such simulation for overlapping ORFs were performed for 20 times, and the box plot inside summarizes the AUC of the 20 ROC curves from the results of RiboCode applied on these 20 datasets. Figure 2. View largeDownload slide Performance of RiboCode compared with the de novo methods RiboTaper and ORFscore. (A) The numbers of CCDS exons identified by different methods with the RPF data in HEK293 cells. The cutoffs used for three methods, RiboCode, RiboTaper and ORFscore, were calibrated so that they produced the same numbers of false positives with the RNA-seq data. (B) Distributions of the lengths, total read counts, and coverage of the CCDS exons identified by RiboCode, RiboTaper and ORFscore. (C) ROC and precision curves generated with the results of RiboCode, RiboTaper and ORFscore with two simulation datasets, one generated from the HEK293 cell data in Gao et al. (left) and the other one from the Zebrafish data in Bazzini et al. (right). The P-values of the ROC curve differences between RiboCode and the second best method were provided in Supplementary Figure S2B. (D) A representative ROC curve generated with the results of RiboCode on a simulation dataset specifically for the overlapping ORFs. Such simulation for overlapping ORFs were performed for 20 times, and the box plot inside summarizes the AUC of the 20 ROC curves from the results of RiboCode applied on these 20 datasets. Generation of the simulation datasets The exon-level simulation datasets used in Figure 2A–C and Supplementary Figure S2 were generated from the five datasets of ribosome profiling with RNA-seq in parallel, in HEK293 (Gao et al. and Calviello et al.), Zebrafish, mouse liver and PC3. The P-site data track for each CCDS exon from the Ensembl annotation was created using the RiboTaper package (P_sites_all_tracks_ccds and Centered_RNA_tracks_ccds files in data_tracks generated by RiboTaper). The read lengths and P-site locations used in these data are provided in Supplementary Table S8. For the RNA-seq data used as true negatives, the 25th position was arbitrarily defined as the P-site position. Exons shorter than 10 nt were discarded. The RiboTaper package was used to calculate the ORFscore and P-value of RiboTaper (results_ccds generated by RiboTaper). The ribosome profiling datasets with different levels of noise, used in Supplementary Figure S3A, were generated by subsampling different fractions of the RPF reads of the HEK293 data (26) and shuffling their P-site positions among –1, 0, and +1 in relative to the original position (+12 nt). For the datasets with reduced sequencing depth, used in Supplementary Figure S3B, we just randomly discarded different percentages of the RPF reads in the HEK293 data (26). The gene-level simulation datasets used in Figure 3 and Supplementary Figure S4 were also generated from the five datasets in HEK293, Zebrafish, mouse liver, and PC3. Specifically, from the original ribosome profiling data, the RPF reads uniquely mapped on 1000 randomly selected annotated protein coding genes (with RPF reads count > 5) were collected. The protein-coding transcripts of these genes were considered as true positives of translation. Next, true negatives were also defined from these 1000 genes, of which the RNA-seq data was simply used as the simulated RPF reads. Each of the five datasets, two used in Figure 3 and the other three in Supplementary Figure S4A, was therefore composed of the RPF reads of 1000 coding genes for positives and the RNA-seq reads for negatives. Figure 3. View largeDownload slide Performance of RiboCode compared with the supervised methods and de novo method RP-BP. (A) ROC and precision curves generated with the results of RiboCode, RibORF, ORF-RATER and BR-BP with two simulation datasets. The P-values of the ROC curve differences between RiboCode and the second best method were provided in Supplementary Figure S4B. (B) The numbers of true positives identified by different methods with the RPF data in HEK293 cells and Zebrafish. The cutoffs used for these methods were calibrated so that they produced the same numbers of false positives (RNA). Figure 3. View largeDownload slide Performance of RiboCode compared with the supervised methods and de novo method RP-BP. (A) ROC and precision curves generated with the results of RiboCode, RibORF, ORF-RATER and BR-BP with two simulation datasets. The P-values of the ROC curve differences between RiboCode and the second best method were provided in Supplementary Figure S4B. (B) The numbers of true positives identified by different methods with the RPF data in HEK293 cells and Zebrafish. The cutoffs used for these methods were calibrated so that they produced the same numbers of false positives (RNA). RiboCode and other existing methods were applied on these simulated datasets. Overall performances of the tested methods were assessed by ROC and precision analysis using the R package ROCR. The statistical significance (P-value) of the difference between two ROC curves was inferred with an online tool at http://vassarstats.net/roc_comp.html based on the method in the reference (30). Running of the existing methods All the existing methods were applied with their default settings. The same pre-processed ribosome profiling datasets (simulated or real) and the same transcriptome annotation files were supplied to the different methods, including RiboCode, RiboTaper (1.3), RP-BP (version 1.1.8), ORF-RATER, RibORF (version 0.1). For RiboTaper, the values of ‘ORF_pval_multi_ribo’ indicate the statistical significance of the translations, thereby used for ranking of the ORFs, from low to high. For RP-BP, the values of ‘bayes_factor_mean’ were used for ranking the predicted ORFs, of which the larger value indicates stronger signal of translation. For RibORF, the value ‘pvalue’ was used to evaluate the possibility of translation of an ORF. Similarly, for ORF-RATER, the value ‘orfrating’ was used. For all these method, the same predefined read lengths and P site positions were set as shown in Supplementary Table S8. All our scripts used for running the existing algorithms have been provided in Supplementary File 1, which also includes detailed tutorials to help the users run these algorithms. The scripts and tutorials can also be found in Github at https://github.com/xryanglab/ORFcalling. Validations with QTI-seq and MS data The cutoffs were set so that all the methods identified the same total number of ORFs, except ORF-RATER, of which the predicted ORFs are much fewer than any of the other methods. Given that not all the methods were designed for identification of the exact translation initiation sites, the ORFs predicted by different methods but with the same stop codon were considered the same, and the longest ORF was selected for the validations with QTI-seq data and MS data. The types of ORFs from the coding genes were defined based on their coordination relative to the longest CDS on the genome. The peptide sequences predicted by all the methods were pooled together for searching in the MS/MS data. Annotation of the ORFs from QTI-seq data For each initiation site identified by the QTI-seq data (26), we selected the closest downstream in-frame stop codon, thereby annotating an ORF. If one initiation site has more than one in-frame stop codon in different transcripts of the same gene, only the one harbored in the longest transcript was chosen. Mass spectrometry data collection and processing Human MS/MS data of HEK293 cells were obtained from our previously published study (31) and ProteomeXchange Consortium (PXD002389). Zebrafish MS/MS data was downloaded from ProteomeXchange Consortium (PXD000479, tissue of Testis). The peptides were searched using the SEQUEST searching engine of Proteome Discoverer (PD) software (version 1.4). The same search criteria as published before (31) was used. The false discovery rate (FDR), calculated using Percolator provided in PD, was set to 0.1 for peptides and proteins. Counting of the RPF reads of the ORFs For the yeast data, the RPF reads on each ORF were counted based on HTSeq-count (27,32) in intersection-strict mode. The RibocCode package provides a function ORF_counts for such purpose. Only the RPF reads with length between 27 and 29 nt, which were found to exhibit strong 3-nt periodicity, were used for counting. Due to the potential accumulation of ribosomes around the starts and ends of the coding regions (9,33), reads aligned to the first 15 and last 5 codons were excluded for counting of RPF reads for the ORFs longer than 100 nt. Note that it is optional for the function ORF_counts to include or exclude the reads close to the start and the stop codon. The raw read counts of each ORF across the three conditions were further subjected to median-of-ratios normalization (34). RESULTS Methodology design of RiboCode The methodology of RiboCode primarily relies on evaluation of the 3-nt periodicity of the RPF reads aligned by the P-sites on the RNA transcripts. Considering the usually distorted patterns of RPF read allocations and potentially high noise level of the ribosome profiling data, we adapted the Wilcoxon signed-rank test to assess the oddness of consistently higher in-frame reads along the whole ORF. The workflow of RiboCode is composed of three major steps, (i) preparing the transcriptome for search of the candidate ORFs, (ii) determining the length range of the RPF reads that are most likely to be from active translation, and identifying the P-site positions in these reads and (iii) assessing the active translation event via statistical comparisons among the three vectors representing the RPF read densities in and off the reading frame along each candidate ORF. The analysis strategy of RiboCode is illustrated in Figure 1, and the details of the method design are provided in the Materials and Methods section. The performance of RiboCode for de novo translatome annotation Here we compared the performance of RiboCode with those of the existing methods that were designed for de novo annotation of the translatome, including RiboTaper (18) and ORFscore (11). Since the methodology of RiboTaper was based on testing of each annotated exon, and its performance was originally benchmarked at the exon level (18), our comparisons among RiboCode, RiboTaper and ORFscore were similarly executed at the exon level. Note that the other de novo method, RP-BP, does not work on exons, and therefore will be included for comparison in the next session. We first used a published ribosome profiling dataset in human HEK293 cell (26), which was the most frequently used dataset for evaluating the existing methods in literature, including RiboTaper, and RP-BP. RPF reads of the consensus coding sequence (CCDS) exons were considered as positives for translation, and the paralleled RNA-seq data was included to mimic the negatives, i.e., simulated RPF reads of untranslated RNA that lack the 3-nt periodicity. We calibrated the cutoffs for all three methods, RiboCode, RiboTaper and ORFscore, to achieve the same false positive rate (∼7.5%, 3215). As a result, RiboCode recovered many more CCDS exons than the other two methods did (Figure 2A, detailed results in Supplementary Table S1). In addition, unlike the other methods, RiboCode yielded significant distinctiveness when processing the RPF reads and RNA-seq reads, which is not or only slightly dependent on the length, read counts, or coverage of the CCDS exons (Supplementary Figure S1A–C). Indeed, the distributions of the lengths, RPF read counts, and coverages of the results are highly concordant with those of the full CCDS exon set as a background (Figure 2B), suggesting limited bias of RiboCode when annotating the full translatome. Note that the exons with the RPF read count fewer than 10 or the coverage smaller than 0.1 were discarded. The other two methods, however, showed some bias towards the ORFs with high read counts and coverage (Figure 2B, Supplementary Figure S1A–C). The P-value distributions of the results of RiboCode indeed showed a much cleaner separation of the CCDS exons called from the RPF reads and the ones from the RNA-seq reads (Supplementary Figure S1D). To further systematically evaluate the sensitivity and specificity of the three methods, we prepared ROC and precision curves with the results of the different methods applied on five published ribosome profiling datasets, in HEK293 cells (18,26), Zebrafish (11), mouse liver cells (26) and cancer cell line PC3 (3) (results of HEK293 (Gao et al.) and Zebrafish in Figure 2C, and results of mouse liver cell, PC3, and HEK293 (Calviello et al.) in Supplementary Figure S2A). The paralleled RNA-seq data was again used as true negatives. The detailed results are provided in Supplementary Table S1. The statistical significances (P-values) of the performance differences between RiboCode and the second best method, by comparing the ROC curves, were summarized in Supplementary Figure S2B. These test runs illustrated the superior sensitivity and specificity of RiboCode compared to the two other existing methods. The tolerance to the sometimes unavoidable technical noise is important for the broad applications of a method. This is especially true for the analysis of ribosome profiling data, given its nature of high noise resulting from contaminations of non-ribosome-bound RNA, regulatory RNA in the ribosomal complex, inappropriate RPF read length selections, and inaccurate P-site position. These noises result in either contamination of the RPF reads or incorrect alignments of the reads, both of which should weaken the 3-nt periodicity. Essentially, such noise can be simulated by shuffling the P-site among the three positions, –1, 0, or +1 in relative to the original position, for a randomly selected subset of the RPF reads, which by definition weakens the overall 3-nt periodicity of the RPF reads. Stress tests of the three methods were performed with such datasets generated from the HEK293 data (26), in which different percentages of the RPF reads were disturbed. The ROC analyses with the results showed that RiboCode consistently out-performed the other two methods with low- to high-noise data (Supplementary Figure S3A). In addition, considering that the sequencing depth of the different ribosome profiling studies could vary significantly, we also tested the performance of the three methods with different numbers of RPF reads. As Supplementary Figure S3B shows, RiboCode was able to deliver relatively good performances, which were not much sacrificed with fewer total RPF reads. Taken together, these tests suggest that RiboCode is of great value for annotating the translatomes with ribosome profiling datasets that are of relatively low quality or with limited number of usable RPF reads. One of the major challenges for the de novo annotation of the translatome is the complicated re-coding events, including the frequently found overlapping off-frame ORFs. The methodology design of RiboCode genuinely allows assessment of the overlapping ORFs, while the two existing methods for de novo translatome annotation, RiboTaper and ORFscore, cannot recover such recoding events by design. Here, we used a simulation dataset to test the performance of RiboCode in annotating the actively translated overlapping ORFs. Specifically, with the previously used HEK293 dataset (26), we overlaid the RPF reads of two annotated CCDS with a +1 or +2 frame shift to simulate the RPF reads from an artificial pair of overlapping ORFs. For a negative case, without changing the RPF reads, we randomly assigned an artificial ORF that partly overlaps (with a frame shift) with an annotated CCDS. As Figure 2D shows, such simulation was repeated for 20 times, and RiboCode always exhibited high sensitivity and accuracy in capturing the actively translated overlapping ORFs. Comparisons between RiboCode and other existing methods In addition to the unsupervised de novo methods for annotating the translatome, two other methods, ORF-RATER (21) and RibORF (12), both of which use the strategy of machine learning, can also be used to assess the RNA translation. However, these methods rely on subsets of the ORFs that were pre-defined to be actively translated. Although technically they were not designed for de novo annotation of the translatome, we also performed systematic comparison between these supervised methods and RiboCode. Here, we also included the de novo method, RP-BP, which works at the transcript level and thereby was not included in the previous comparison. We again used the five published ribosome profiling datasets, in HEK293 cells (18,26), Zebrafish (11), mouse liver cells (26) and cancer cell line PC3 (3) to test the four methods. RPF reads of 1000 randomly selected consensus protein-coding genes were considered as positives for translation, and the paralleled RNA-seq data was included to mimic the negatives. The detailed results are provided in Supplementary Table S2. ROC and precision curves were prepared to illustrate the sensitivity and specificity of the four methods (Figure 3A for HEK293 (Gao et al.) and Zebrafish, and Supplementary Figure S4A for mouse liver cells, PC3 cells and HEK293 (Calviello et al.)). Again, RiboCode significantly out-performed the two supervised methods and the de novo method RP-BP (Supplementary Figure S4B). Indeed, when controlling the total number of false positives, i.e. transcripts identified as actively translated based on the RNA-seq data, RiboCode recovered many more coding genes based on the RPF data, than the other three methods did (Figure 3B). These test runs therefore indicated the superior sensitivity and specificity of RiboCode compared to the other existing methods. Finally, it is worth noting that owing to the efficient statistical design, RiboCode is very user-friendly and requires little computation resource. Annotation of the full translatome with the ribosome profiling dataset in HEK293 cells (26) took about 8 min with RiboCode on a single-core computer (8 core-minutes), which is trivial compared to RiboTaper (∼20 h on a 16-core server, 3570 core-minutes), ORF-RATER (82 core-minutes) and RP-BP (240 core-minutes) (Supplementary Figure S5). Only RibORF takes the similar computing time (6 core-minutes), but this does not include the time for model training in its machine learning pipeline. Validations of the predicted ORFs by QTI-seq data Multiple studies have reported widespread alternative translation initiation (9,35,36), which is suspected to be context-dependent. A precise annotation of the translation initiation sites is therefore critical for the complete assembly of the translatome. Several bioinformatics tools have been developed for searching of the AUG start codons from the mRNA sequences, and only recently the ribosome profiling data was used for training of the method in calling the AUG and near–cognate start codons from the mRNA sequences (37). Experimentally, blockage of elongation from the newly assembled initiation complex with antibiotics such as harringtonine and lactimidomycin (9,35) allows the efficient screening of the translation initiation sites with ribosome profiling. However, such experimental setting is not a common practice in the previous and recent ribosome profiling experiments. After all, one of the primary goals of ribosome profiling is to quantify the translation efficiencies (TE), and blocking the translation elongation would make such application unfeasible. Therefore, it would be greatly beneficial to have a method that can precisely allocate at least some of the translation initiation sites directly from the regular ribosome profiling data. We used a QTI-seq dataset that comprehensively mapped the translation initiation sites of the coding genes in HEK293 cells (26), to test the performances of RiboCode and the other existing methods in correctly annotating the real start codons with the ribosome profiling data in the same cellular context. The results of all the methods, for a complete translatome annotation with the regular ribosome profiling data in HEK293 cells (26), were provided in Supplementary Table S3, in which the detailed information of the ORFs including the initiation sites can be found. However, RP-BP and RibORF were not designed for annotating the translation initiation sites, and therefore they were not included in the following comparison. The accumulation curves were prepared to show the proportions of the presumably true initiation sites (identified by QTI-seq) that were correctly recovered by the three methods with the ribosome profiling data (Figure 4A). It appears that RiboCode is indeed more efficient in annotating the translation initiation sites. It is worth noting that ORF-RATER generated the ORF predictions that were much fewer than all the other methods did (also seen in Figures 3B and 4B). As reported in its original article, ORF-RATER was designed to capture the most high-confidence ORFs and expected to have a high false negative rate (21). In fact, the more preferred application scenario for ORF-RATER, by design, would be mining of the ribosome profiling datasets from the untreated cells in parallel with the cells treated with the antibiotics such as harringtonine and lactimidomycin that inhibit translation elongation (21). Figure 4. View largeDownload slide Validations of the ORFs with QTI-seq data. (A, B) Accumulation curves showing proportions of the initiation sites (A) and the annotated ORFs (B) identified by QTI-seq that were recovered by RiboCode or other existing methods. (C–F) The cutoffs of all the methods (except ORF-RATER) were set so that they yielded the same total number of predicted ORFs, as marked on the accumulation curve in panel (B). The bar plots show the numbers of the previously annotated ORFs (C) and the uncanonical ORFs (D–F) identified by the different methods with ribosome profiling data. The proportions of the annotated ORFs (C), uORFs (D) and overlapping ORFs (E) that are supported by the QTI-seq data were provided next to the bar plots and also marked on the bar plots with darker colors. Under different categories of the ORFs, the highest proportions of validation were highlighted with dark red color. Figure 4. View largeDownload slide Validations of the ORFs with QTI-seq data. (A, B) Accumulation curves showing proportions of the initiation sites (A) and the annotated ORFs (B) identified by QTI-seq that were recovered by RiboCode or other existing methods. (C–F) The cutoffs of all the methods (except ORF-RATER) were set so that they yielded the same total number of predicted ORFs, as marked on the accumulation curve in panel (B). The bar plots show the numbers of the previously annotated ORFs (C) and the uncanonical ORFs (D–F) identified by the different methods with ribosome profiling data. The proportions of the annotated ORFs (C), uORFs (D) and overlapping ORFs (E) that are supported by the QTI-seq data were provided next to the bar plots and also marked on the bar plots with darker colors. Under different categories of the ORFs, the highest proportions of validation were highlighted with dark red color. By capturing the accumulated ribosomes at the initiation sites due to stalled translation elongation, the QTI-seq data was also used to predict the actively translated ORFs of the coding genes, including both the annotated main coding sequence (CDS) and unannotated ORFs, such as the uORFs and the previously discussed overlapping ORFs. We then evaluated the overlaps between the ORFs inferred from the QTI-seq data and the ORFs identified by RiboCode and the existing methods with ribosome profiling data. The accumulation curves (Figure 4B) indicate the proportions of the ORFs from the QTI-seq data that were also identified by the different methods with the ribosome profiling data, and clearly, RiboCode illustrated higher sensitivity to the ORFs identified by QTI-seq (Figure 4B). In other words, with the same total number of predicted ORFs, RiboCode recovered more ORFs that were also supported by QTI-seq data, than the other methods did. These include the previously annotated protein-coding ORFs, uORFs, dORFs and overlapping ORFs (Figure 4C–F). With a pre-set total number of predicted ORFs (9000 as shown on Figure 4B, to fit the result of RibORF), RiboCode identified the largest number of annotated coding ORFs, with the highest validation rate by QTI-seq (Figure 4C). As a result, RiboCode identified fewer of the other types of ORFs (uORFs, overlapping ORFs, and dORFs) than some other methods did (Figure 4D–F), which is expected given the same total number of ORFs identified by each method. Nevertheless, among the four methods (ORF-RATER excluded due to the small size of its result), RiboCode had the highest validation rates of the predicted uORFs and overlapping ORFs, by QTI-seq (Figure 4D and E). The translatomes assembled by RiboCode and supports from MS data Collectively, the results above illustrate the sensitivity and accuracy of RiboCode for comprehensive de novo annotation of the translatome with ribosome profiling data. We then summarized the different types of ORFs recovered by RiboCode and the other existing methods, with two published ribosome profiling datasets in the HEK293 cell (26) and Zebrafish (11) (Figure 5). The detailed results are provided in Supplementary Table S3 (HEK293) and 4 (Zebrafish). The protein or peptide products from these ORFs were further validated, in a cell type-specific manner, with published Mass Spectrometry (MS) data of the HEK293 cell and Zebrafish (Figure 5, Supplementary Table S5). With both the HEK293 and Zebrafish data, the total sets of ORFs identified by RiboCode had the highest validation rates, among all the methods, with ORF-RATER excluded for comparison (Figure 5). Furthermore, for various sub-categories of the ORFs, while RP-BP or RiboTaper in some cases delivered slightly higher validation rates, RiboCode in general performs well and balanced in recovering the uncanonical ORFs that are supported by the MS data (Figure 5). These validated ORFs include many previously unannotated uORFs, dORFs, overlapping ORFs, and ORFs from non-coding genes. Some examples were given in Supplementary Figure S6A-D. Figure 5. View largeDownload slide De novo annotations of the translatomes and validations with MS data. Bar plots showing the proportions of the ORFs that are supported by the MS data of HEK293 cells or Zebrafish. Provided on each of the bar plot are the number of predicted ORFs (top), by a particular method, and the number of ORFs validated with the specific MS data (bottom). The validation results of all the ORFs (total) and the different sub-categories of the ORFs are provided. Under each category of the ORFs (four bars in each column, except ORF-RATER), the one with the highest validation rate was outlined by green color. Figure 5. View largeDownload slide De novo annotations of the translatomes and validations with MS data. Bar plots showing the proportions of the ORFs that are supported by the MS data of HEK293 cells or Zebrafish. Provided on each of the bar plot are the number of predicted ORFs (top), by a particular method, and the number of ORFs validated with the specific MS data (bottom). The validation results of all the ORFs (total) and the different sub-categories of the ORFs are provided. Under each category of the ORFs (four bars in each column, except ORF-RATER), the one with the highest validation rate was outlined by green color. Comparisons of the uncanonical ORFs identified by RiboCode and other existing methods As discussed above, the systematic comparisons among the different methods with simulated and real datasets have illustrated the outstanding performance of RiboCode for de novo annotation of the translatomes. Discovery and functional analyses of the uncanonical ORFs, for example uORFs, are of particular interest in the field of translation. Therefore, we used the HEK293 dataset (Gao et al.) again as an example and summarized the uncanonical ORFs identified by RiboCode and other existing methods (Figure 6A). Compared to each of the existing methods, RiboCode annotated significantly different sets of uORFs, dORFs, and overlapping ORFs (Figure 6A). Next, taking the uORFs as examples, for the ones annotated by both RiboCode and each of the existing methods (numbers in the parentheses in Figure 6A), we found that the ranks of these ORFs by RiboCode and the other methods were largely inconsistent (Supplementary Figure S7A–D). Taken together, these data indicated that RiboCode and the other existing methods behave differently when identifying and prioritizing the high-confidence uncanonical ORFs. Figure 6. View largeDownload slide Uncanonical ORFs identified by RiboCode and other existing methods. (A) Total counts of the uORFs, dORFs, and overlapping ORFs that were identified by five different methods. The numbers of ORFs identified by both RiboCode and each of the other four methods were provided in the parentheses. (B, C) Two representative examples from the top 10 uORFs (Supplementary Figure S8) identified by RiboCode. (D) A representative example from the top 10 uORFs (Supplementary Figure S9) identified by RiboTaper. (E) A representative example from the top 10 uORFs (Supplementary Figure S11) identified by RibORF. (F) A representative example from the top 10 uORFs (Supplementary Figure S12) identified by ORF-RATER. (G) Ranks of the five uORF examples above in the panels b-f by the 5 methods. Figure 6. View largeDownload slide Uncanonical ORFs identified by RiboCode and other existing methods. (A) Total counts of the uORFs, dORFs, and overlapping ORFs that were identified by five different methods. The numbers of ORFs identified by both RiboCode and each of the other four methods were provided in the parentheses. (B, C) Two representative examples from the top 10 uORFs (Supplementary Figure S8) identified by RiboCode. (D) A representative example from the top 10 uORFs (Supplementary Figure S9) identified by RiboTaper. (E) A representative example from the top 10 uORFs (Supplementary Figure S11) identified by RibORF. (F) A representative example from the top 10 uORFs (Supplementary Figure S12) identified by ORF-RATER. (G) Ranks of the five uORF examples above in the panels b-f by the 5 methods. Therefore, we looked into the top 10 uORFs with the highest confidences inferred by different methods (indicated by P-values for RiboCode and RiboTaper, Bayes factor for RP-BP, ‘pvalue’ for RibORF, and ‘orfrating’ for ORF-RATER), which are listed in Supplementary Figures S8–S12. As shown by these case examples, all the 10 uORFs with the top confidence levels predicted by RiboCode have high in-frame reads, strong 3-nt periodicity, and are relatively long (Supplementary Figure S8), which are all indicative of active translation. For example, the first uORF (ENSG00000183479_152713336) was recovered as the top one by three methods including RiboCode, RiboTaper, and RP-BP (Figure 6B, G, Supplementary Figure S8), whereas RibORF did not render a top rank to this uORF and ORF-RATER completely missed it (Figure 6G, Supplementary Figure S8). In addition, another of these top 10 uORFs (the ninth) was missed by both RP-BP and ORF-RATER, and it was lowly ranked by RiboTaper (305th/491) and RibORF (199th/316) (Figure 6C, G, Supplementary Figure S8). Most of the top 10 uORFs annotated by RiboTaper were also highly ranked by RiboCode (Figure 6G, Supplementary Figure S9), and they indeed showed strong spectrum patterns of translation, except the sixth uORF, which had a small read count and did not show a 3-nt periodicity as strong as the other 9 (Figure 6D, Supplementary Figure S9). Therefore, this indicates a potential misjudgement by RiboTaper, whereas by contrast, RiboCode deprioritized this uORF among the full list of uORFs (304th/414, Figure 6G), which we believe is appropriate. Most of the top 10 uORFs identified by RP-BP have relatively low in-frame read counts (Supplementary Figure S10). It appears that these uORFs were highly ranked because their off-frame read counts were mostly 0, which gave rise to seemingly high ‘in-frame to off-frame’ ratios. However, given the imperfect features of ribosome profiling data, including the high sequencing noises, errors in P-site locations, and RNA contaminations, some of these top-ranked uORFs are likely to be false positives, or at least should not be granted such high priorities. Similar to the results of RP-BP, many of the top uORFs predicted by RibORF also have low in-frame read counts, and their 3-nt periodicities are weak (Supplementary Figure S11, and an example given in Figure 6E). By contrast, these uORFs with little support from the read spectrums were left out or deprioritized by RiboCode (Figure 6G, Supplementary Figure S11). For ORF-RATER, the top 11 uORFs all have the same highest score (Supplementary Figure S12). However, seven of them are extremely short (15–21 nt). This makes it questionable whether these small uORFs were actually translated, even though some of them have high in-frame read counts (an example given in Figure 6F). Most of these uORFs were indeed lowly ranked or disregarded by other methods (Figure 6G, Supplementary Figure S12). In summary, the detailed comparisons between the uORFs annotated by different methods, especially for the top ranked ones, again showed the sensitivity and accuracy of RiboCode for discovery of the uncanonical small ORFs with reliable evidence of translation. Importantly, RiboCode outperformed the other existing methods in prioritizing the most likely translated ORFs, tolerating the distractive noise, and in excluding the misleading data patterns which resulted in false discoveries by other methods. Application of RiboCode for annotating the context-specific translatomes of yeast We used the ribosome profiling data in yeast under three conditions: normal, heat shock, and oxidative stress (25), to showcase the application of RiboCode for de novo assembly of the context-specific translatomes. Figure 7A and B summarized the yeast translatomes under the three conditions (details of the annotated ORFs are provided in Supplementary Table S6). In general, RiboCode identified more uORFs and dORFs being translated in the heat shock and oxidative stress conditions, compared to the normal condition. Next, we compared the RPF read counts of the different ORF types in the translatomes between the stress and normal conditions. The raw and normalized RPF read counts of all the ORFs annotated by RiboCode are provided in Supplementary Table S7. While the previously annotated protein coding genes have similar overall distributions of the RPF read counts, the uORFs and dORFs showed markedly higher RPF read counts under the stress conditions (Figure 7C, Supplementary Figure S13). This is well in line with previous reports about translation of uORFs in multiple organisms in response to various stress signals (7,38–41). Figure 7. View largeDownload slide Application of RiboCode for assembly of the yeast translatomes under normal and stress conditions. (A, B) The composition of the translatomes assembled by RiboCode, with the ribosome profiling data of yeast, under normal condition, oxidative stress and heat shock. (C) Distributions of the normalized RPF read counts (log2) of the uORFs, dORFs and annotated CDS, under the three conditions: normal, oxidative stress and heat shock. Mann–Whitney U tests were performed to assess the statistical significance of the difference between the distributions of uORF or dORF under heat shock versus normal or oxidative stress versus normal condition. The P-values were provided in the figure. Figure 7. View largeDownload slide Application of RiboCode for assembly of the yeast translatomes under normal and stress conditions. (A, B) The composition of the translatomes assembled by RiboCode, with the ribosome profiling data of yeast, under normal condition, oxidative stress and heat shock. (C) Distributions of the normalized RPF read counts (log2) of the uORFs, dORFs and annotated CDS, under the three conditions: normal, oxidative stress and heat shock. Mann–Whitney U tests were performed to assess the statistical significance of the difference between the distributions of uORF or dORF under heat shock versus normal or oxidative stress versus normal condition. The P-values were provided in the figure. Next, we looked into the RPF read counts of the uORFs and the dORFs, together with their downstream or upstream main protein-coding ORFs. In Figure 8A–D, the vertical bars, representing each of these uORFs (Figure 8A and B) or dORFs (Figure 8C and D), were positioned according to the fold-change of the downstream or upstream main protein-coding ORFs, on the background of all the annotated protein-coding genes (Figure 8E and F). These bars were then color-coded based on the fold change of the uORF (Figure 8A and B) or dORF (Figure 8C and D) under the heat shock (Figure 8A and C) or oxidative stress (Figure 8B and D) condition, compared to the normal condition. It appears that the translational up-regulation of some uORFs or dORFs were associated with stress-induced translational repression of the annotated main coding ORF of the same transcripts (the red vertical bars to the left side of the spectrums in Figure 8A–D, and some examples shown in Figure 8G–J). Indeed, many previous studies have reported that activations of some uORFs result in translational inhibition of the downstream main protein-coding ORF (7,9,35,41). On the other hand, many of the uORFs or dORFs were positively associated with the translation of the main coding ORF (the red vertical bars to the right side and the blue bars to the left side of the spectrums in Figure 8A–D). This could be attributed to the general translational or transcriptional regulation of the mRNA transcripts that harbor the main protein-coding ORF and the uORF or the dORF. More data and further analysis would be needed to fully elucidate the potential involvements of the uORFs and dORFs in regulating the translation of the main protein coding ORFs. Figure 8. View largeDownload slide Associations of the uORFs and dORFs with the main protein coding ORF. (A–F) All the annotated canonical protein coding ORFs were sorted based on the fold change of their normalized RPF read counts upon heat shock (E) or oxidative stress (F) versus the normal condition. On this background, the ORFs with upstream uORFs (A, B) or downstream dORFs (C, D) in the same mRNA transcripts were marked as vertical bars. The color of these bars represents the fold change of the RPF read counts on the uORF (A, B) or dORF (C, D). (G–J) Four examples of the uORF (G, I) or dORF (H, J) that appear negatively associated with the main ORF in response to oxidative stress (G, H) or heat shock (I, J). The colored bar on each nucleotide position represents the count of RPF reads allocated according to its P-site position. Three different colors represent the three frames. The total RPF read counts of the main protein coding ORFs and the uORFs or dORFs are given in the figures. Figure 8. View largeDownload slide Associations of the uORFs and dORFs with the main protein coding ORF. (A–F) All the annotated canonical protein coding ORFs were sorted based on the fold change of their normalized RPF read counts upon heat shock (E) or oxidative stress (F) versus the normal condition. On this background, the ORFs with upstream uORFs (A, B) or downstream dORFs (C, D) in the same mRNA transcripts were marked as vertical bars. The color of these bars represents the fold change of the RPF read counts on the uORF (A, B) or dORF (C, D). (G–J) Four examples of the uORF (G, I) or dORF (H, J) that appear negatively associated with the main ORF in response to oxidative stress (G, H) or heat shock (I, J). The colored bar on each nucleotide position represents the count of RPF reads allocated according to its P-site position. Three different colors represent the three frames. The total RPF read counts of the main protein coding ORFs and the uORFs or dORFs are given in the figures. DISCUSSION The collection of ribosome profiling data has been quickly expanding, thus shaping the landscapes of translation in various systems with increasing details. There is a clear need for de novo annotations of the species- and cellular context-dependent translatomes, which have largely lagged behind the genome and transcriptome annotations (42). Recently, multiple bioinformatics methods for such purpose have been developed, and they have been nicely reviewed in (24). The de novo methods including ORFscore (11), RiboTaper (18), RP-BP (22), and our method RiboCode, were all designed to assess the active translation mainly based on the 3-nt periodicity. This feature was also the core of the other machine-learning based methods (12,21). This is because under the current experimental settings of ribosome profiling, 3-nt periodicity is the strongest and most efficient feature for calling of the translation from the ribosome-protected RNA fragments. However, in practice, even if the RPF purification and library preparation procedures were properly performed, the ribosome profiling data has, but not limited to, the following features that complicate the data-mining procedure: (i) discrete and sparse RPF reads along the ORF; (ii) uneven distributions of the RPF reads; (iii) contaminations from the untranslated RNA; (iv) errors of P-site allocation; (v) read duplicates; (vi) limited coverage that varies across different datasets; (vii) highly variable lengths of the candidate ORFs; (viii) various noise levels among the ORFs in the same dataset. Therefore, it is critical to have an approach that can robustly and precisely assess the 3-nt periodicity due to active translation from such data that is far from ideal. The existing methods used completely different strategies and statistical models for evaluating the spectrum of ribosome profiling data. RiboTaper used the multitaper strategy, a method previously developed for evaluating the harmonic spectrums (18). RP-BP is an unsupervised Bayesian approach that models the periodicity and evaluate the ORF by comparing with a uniform model (22). ORFscore counts the total in- and out-of-frame reads (11). It ignores the periodicity spectrum, and is not a statistically vigorous method. The other two methods (RibORF and ORF-RATER) (12,21) rely on machine-learning of pre-defined ORFs and thereby are not strictly de novo methods. RiboCode was designed for de novo annotation of the translatome, and was based on a modified Wilcoxon signed-rank test to assess the oddness of consistently higher in-frame reads along the whole ORF. RiboCode takes advantage of the Wilcoxon signed-rank test because of the following reasons. First, it is insensitive to the potentially strong artificial in- or off-frame RPF signals at small fractions of the codons in the whole ORF. Such artifacts are not rare in ribosome profiling data, due to the occasional RPF read duplicates potentially resulted from the PCR amplification bias. The Wilcoxon signed-rank test evaluates the whole spectrum and tolerates some outliers. Second, the Wilcoxon signed-rank test is not distracted by the codons with no RPF read, i.e. no evidence for either active translation or the opposite. Third, this test is insensitive to the background noise due to contamination of the untranslated RNA or errors of P-site allocation. Last but not the least, this statistical test is computationally cost-effective, thereby rendering high computation efficiency of RiboCode. Designed for the comprehensive de novo annotation of the translatome with ribosome profiling data, RiboCode presents remarkable advantages. It has higher efficiency and accuracy in calling the actively translated ORFs. Its capability of recovering recoding events such as overlapping ORFs and its consistent performance, which is largely independent of the length, read count and coverage, assure the comprehensiveness of the translatome annotation. In addition, RiboCode's relatively consistent performance with different noise levels and sequencing depths is another valuable feature for processing the various published ribosome profiling data. Last but not the least, RiboCode requires very little computational resource, thereby enabling routine large-scale annotations of the context-dependent translatomes with ribosome profiling datasets. In addition, the RiboCode package provides other handy supporting functions, including automatic selection of the reliable read lengths and the P-site locations, counting of the reads of each ORF, and convenient plot functions. We highly recommend RiboCode to the community for the processing of published and future ribosome profiling data to obtain more comprehensive understanding of the context-specific translatomes. DATA AVAILABILITY The RiboCode package is available at https://pypi.python.org/pypi/RiboCode, https://anaconda.org/bioconda/ribocode or https://github.com/xryanglab/RiboCode. A detailed step-by-step instruction of the data pre-processing and usage of RiboCode is also provided. The method requires a genome FASTA file, a GTF file for transcriptome annotation, and the alignment result file of the ribosome profiling data. All our scripts used for running RiboCode and the other existing algorithms have been provided in Supplementary File 1 and also available at https://github.com/xryanglab/ORFcalling. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS The authors would like to thank Dr. Xian Cao for proof-reading and thoughtful comments. The authors wish to acknowledge the support from the Gene Sequencing, Protein Chemistry, and Computing core facilities at the National Protein Science Facility (Beijing) and the Center for Biomedical Analysis of Tsinghua University. Z.X. and X.Y. conceived and designed the study. Z.X. developed the algorithm and performed the analyses with help from R.H. and X.X. Z.X., Y.C. and H.D. performed the mass spectrometry data analysis. X.Y. supervised the whole project. Z.X., R.H. and X.Y. wrote the manuscript. All authors have read and approved the final manuscript. FUNDING National Key Research and Development Program, Precision Medicine Project [2016YFC0906001 to X.Y.]; National Natural Science Foundation of China [91540109, 81472855 to X.Y.]; Tsinghua University Initiative Scientific Research Program [20131089278 to X.Y.]; Tsinghua–Peking Joint Center for Life Sciences; 1000 talent program (Youth Category). Funding for open access charge: National Natural Science Foundation of China [91540109, 81472855 to X.Y.]. Conflict of interest statement. None declared. REFERENCES 1. Ingolia N.T. , Ghaemmaghami S. , Newman J.R. , Weissman J.S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling . Science . 2009 ; 324 : 218 – 223 . Google Scholar CrossRef Search ADS PubMed 2. Thoreen C.C. , Chantranupong L. , Keys H.R. , Wang T. , Gray N.S. , Sabatini D.M. A unifying model for mTORC1-mediated regulation of mRNA translation . Nature . 2012 ; 485 : 109 – 113 . Google Scholar CrossRef Search ADS PubMed 3. Hsieh A.C. , Liu Y. , Edlind M.P. , Ingolia N.T. , Janes M.R. , Sher A. , Shi E.Y. , Stumpf C.R. , Christensen C. , Bonham M.J. et al. The translational landscape of mTOR signalling steers cancer initiation and metastasis . Nature . 2012 ; 485 : 55 – 61 . Google Scholar CrossRef Search ADS PubMed 4. Su X. , Yu Y. , Zhong Y. , Giannopoulou E.G. , Hu X. , Liu H. , Cross J.R. , Ratsch G. , Rice C.M. , Ivashkiv L.B. Interferon-gamma regulates cellular metabolism and mRNA translation to potentiate macrophage activation . Nat. Immunol. 2015 ; 16 : 838 – 849 . Google Scholar CrossRef Search ADS PubMed 5. Shalgi R. , Hurt J.A. , Krykbaeva I. , Taipale M. , Lindquist S. , Burge C.B. Widespread regulation of translation by elongation pausing in heat shock . Mol. Cell . 2013 ; 49 : 439 – 452 . Google Scholar CrossRef Search ADS PubMed 6. Liu B. , Han Y. , Qian S.B. Cotranslational response to proteotoxic stress by elongation pausing of ribosomes . Mol. Cell . 2013 ; 49 : 453 – 463 . Google Scholar CrossRef Search ADS PubMed 7. Gerashchenko M.V. , Lobanov A.V. , Gladyshev V.N. Genome-wide ribosome profiling reveals complex translational regulation in response to oxidative stress . Proc. Natl. Acad. Sci. U.S.A. 2012 ; 109 : 17394 – 17399 . Google Scholar CrossRef Search ADS PubMed 8. Dunn J.G. , Foo C.K. , Belletier N.G. , Gavis E.R. , Weissman J.S. Ribosome profiling reveals pervasive and regulated stop codon readthrough in Drosophila melanogaster . eLife . 2013 ; 2 : e01179 . Google Scholar CrossRef Search ADS PubMed 9. Ingolia N.T. , Lareau L.F. , Weissman J.S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes . Cell . 2011 ; 147 : 789 – 802 . Google Scholar CrossRef Search ADS PubMed 10. Fritsch C. , Herrmann A. , Nothnagel M. , Szafranski K. , Huse K. , Schumann F. , Schreiber S. , Platzer M. , Krawczak M. , Hampe J. et al. Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting . Genome Res. 2012 ; 22 : 2208 – 2218 . Google Scholar CrossRef Search ADS PubMed 11. Bazzini A.A. , Johnstone T.G. , Christiano R. , Mackowiak S.D. , Obermayer B. , Fleming E.S. , Vejnar C.E. , Lee M.T. , Rajewsky N. , Walther T.C. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation . EMBO J. 2014 ; 33 : 981 – 993 . Google Scholar CrossRef Search ADS PubMed 12. Ji Z. , Song R. , Regev A. , Struhl K. Many lncRNAs, 5′UTRs, and pseudogenes are translated and some are likely to express functional proteins . Elife . 2015 ; 4 : e08890 . Google Scholar PubMed 13. Menschaert G. , Van Criekinge W. , Notelaers T. , Koch A. , Crappe J. , Gevaert K. , Van Damme P. Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events . Mol. Cell. Proteomics: MCP . 2013 ; 12 : 1780 – 1790 . Google Scholar CrossRef Search ADS 14. Hao Y. , Zhang L. , Niu Y. , Cai T. , Luo J. , He S. , Zhang B. , Zhang D. , Qin Y. , Yang F. et al. SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci . Brief. Bioinform. 2017 ; bbx005 . 15. Olexiouk V. , Crappe J. , Verbruggen S. , Verhegen K. , Martens L. , Menschaert G. sORFs.org: a repository of small ORFs identified by ribosome profiling . Nucleic Acids Res. 2016 ; 44 : D324 – D329 . Google Scholar CrossRef Search ADS PubMed 16. Banfai B. , Jia H. , Khatun J. , Wood E. , Risk B. , Gundling W.E. Jr , Kundaje A. , Gunawardena H.P. , Yu Y. , Xie L. et al. Long noncoding RNAs are rarely translated in two human cell lines . Genome Res. 2012 ; 22 : 1646 – 1657 . Google Scholar CrossRef Search ADS PubMed 17. Guttman M. , Russell P. , Ingolia N.T. , Weissman J.S. , Lander E.S. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins . Cell . 2013 ; 154 : 240 – 251 . Google Scholar CrossRef Search ADS PubMed 18. Calviello L. , Mukherjee N. , Wyler E. , Zauber H. , Hirsekorn A. , Selbach M. , Landthaler M. , Obermayer B. , Ohler U. Detecting actively translated open reading frames in ribosome profiling data . Nat. Methods . 2016 ; 13 : 165 – 170 . Google Scholar CrossRef Search ADS PubMed 19. Duncan C.D. , Mata J. The translational landscape of fission-yeast meiosis and sporulation . Nat. Struct. Mol. Biol. 2014 ; 21 : 641 – 647 . Google Scholar CrossRef Search ADS PubMed 20. Michel A.M. , Choudhury K.R. , Firth A.E. , Ingolia N.T. , Atkins J.F. , Baranov P.V. Observation of dually decoded regions of the human genome using ribosome profiling data . Genome Res. 2012 ; 22 : 2219 – 2229 . Google Scholar CrossRef Search ADS PubMed 21. Fields A.P. , Rodriguez E.H. , Jovanovic M. , Stern-Ginossar N. , Haas B.J. , Mertins P. , Raychowdhury R. , Hacohen N. , Carr S.A. , Ingolia N.T. et al. A regression-based analysis of ribosome-profiling data reveals a conserved complexity to mammalian translation . Mol. Cell . 2015 ; 60 : 816 – 827 . Google Scholar CrossRef Search ADS PubMed 22. Malone B. , Atanassov I. , Aeschimann F. , Li X. , Grosshans H. , Dieterich C. Bayesian prediction of RNA translation from ribosome profiling . Nucleic Acids Res. 2017 ; 45 : 2960 – 2972 . Google Scholar PubMed 23. Chun S.Y. , Rodriguez C.M. , Todd P.K. , Mills R.E. SPECtre: a spectral coherence–based classifier of actively translated transcripts from ribosome profiling sequence data . BMC Bioinformatics . 2016 ; 17 : 482 . Google Scholar CrossRef Search ADS PubMed 24. Calviello L. , Ohler U. Beyond read-counts: Ribo-seq data analysis to understand the functions of the transcriptome . Trends Genet. 2017 ; 33 : 728 – 744 . Google Scholar CrossRef Search ADS PubMed 25. Gerashchenko M.V. , Gladyshev V.N. Translation inhibitors cause abnormalities in ribosome profiling experiments . Nucleic Acids Res. 2014 ; 42 : e134 . Google Scholar CrossRef Search ADS PubMed 26. Gao X. , Wan J. , Liu B. , Ma M. , Shen B. , Qian S.B. Quantitative profiling of initiating ribosomes in vivo . Nat. Methods . 2015 ; 12 : 147 – 153 . Google Scholar CrossRef Search ADS PubMed 27. Xiao Z. , Zou Q. , Liu Y. , Yang X. Genome-wide assessment of differential translations with ribosome profiling data . Nat. Commun. 2016 ; 7 : 11194 . Google Scholar CrossRef Search ADS PubMed 28. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads . EMBnet. journal . 2011 ; 17 : 10 – 12 . Google Scholar CrossRef Search ADS 29. Nagalakshmi U. , Wang Z. , Waern K. , Shou C. , Raha D. , Gerstein M. , Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing . Science . 2008 ; 320 : 1344 – 1349 . Google Scholar CrossRef Search ADS PubMed 30. Hanley J.A. , McNeil B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve . Radiology . 1982 ; 143 : 29 – 36 . Google Scholar CrossRef Search ADS PubMed 31. Wang X. , Tang H. , Chen Y. , Chi B. , Wang S. , Lv Y. , Wu D. , Ge R. , Deng H. Overexpression of SIRT3 disrupts mitochondrial proteostasis and cell cycle progression . Protein Cell . 2016 ; 7 : 295 – 299 . Google Scholar CrossRef Search ADS PubMed 32. Anders S. , Pyl P.T. , Huber W. HTSeq—a Python framework to work with high-throughput sequencing data . Bioinformatics . 2015 ; 31 : 166 – 169 . Google Scholar CrossRef Search ADS PubMed 33. Ingolia N.T. , Brar G.A. , Stern-Ginossar N. , Harris M.S. , Talhouarne G.J. , Jackson S.E. , Wills M.R. , Weissman J.S. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes . Cell Rep. 2014 ; 8 : 1365 – 1379 . Google Scholar CrossRef Search ADS PubMed 34. Anders S. , Reyes A. , Huber W. Detecting differential usage of exons from RNA-seq data . Genome Res. 2012 ; 22 : 2008 – 2017 . Google Scholar CrossRef Search ADS PubMed 35. Lee S. , Liu B. , Lee S. , Huang S.X. , Shen B. , Qian S.B. Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution . Proc. Natl. Acad. Sci. U.S.A. 2012 ; 109 : E2424 – E2432 . Google Scholar CrossRef Search ADS PubMed 36. Kozak M. Context effects and inefficient initiation at non-AUG codons in eucaryotic cell-free translation systems . Mol. Cell. Biol. 1989 ; 9 : 5073 – 5080 . Google Scholar CrossRef Search ADS PubMed 37. Reuter K. , Biehl A. , Koch L. , Helms V. PreTIS: a tool to predict non-canonical 5′ UTR translational initiation sites in human and mouse . PLoS Comput. Biol. 2016 ; 12 : e1005170 . Google Scholar CrossRef Search ADS PubMed 38. Laing W.A. , Martinez-Sanchez M. , Wright M.A. , Bulley S.M. , Brewster D. , Dare A.P. , Rassam M. , Wang D. , Storey R. , Macknight R.C. et al. An upstream open reading frame is essential for feedback regulation of ascorbate biosynthesis in Arabidopsis . Plant Cell . 2015 ; 27 : 772 – 786 . Google Scholar CrossRef Search ADS PubMed 39. Starck S.R. , Tsai J.C. , Chen K. , Shodiya M. , Wang L. , Yahiro K. , Martins-Green M. , Shastri N. , Walter P. Translation from the 5′ untranslated region shapes the integrated stress response . Science . 2016 ; 351 : aad3867 . Google Scholar CrossRef Search ADS PubMed 40. Young S.K. , Wek R.C. Upstream open reading frames differentially regulate gene-specific translation in the integrated stress response . J. Biol. Chem. 2016 ; 291 : 16927 – 16935 . Google Scholar CrossRef Search ADS PubMed 41. Brar G.A. , Yassour M. , Friedman N. , Regev A. , Ingolia N.T. , Weissman J.S. High-resolution view of the yeast meiotic program revealed by ribosome profiling . Science . 2012 ; 335 : 552 – 557 . Google Scholar CrossRef Search ADS PubMed 42. Baranov P.V. , Michel A.M. Illuminating translation with ribosome profiling spectra . Nat. Methods . 2016 ; 13 : 123 – 124 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Showing 1 to 10 of 29 Articles

Articles per page

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

1987

1986

1985

1984

1983

1982

1981

1980

1979

1978

1977

1976

1975

1974

Related Journals: