RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete glioma subtypesMall, Raghvendra;Cerulo, Luigi;Garofano, Luciano;Frattini, Veronique;Kunji, Khalid;Bensmail, Halima;Sabedot, Thais S;Noushmehr, Houtan;Lasorella, Anna;Iavarone, Antonio;Ceccarelli, Michele
doi: 10.1093/nar/gky015pmid: 29361062
Abstract We propose a generic framework for gene regulatory network (GRN) inference approached as a feature selection problem. GRNs obtained using Machine Learning techniques are often dense, whereas real GRNs are rather sparse. We use a Tikonov regularization inspired optimal L-curve criterion that utilizes the edge weight distribution for a given target gene to determine the optimal set of TFs associated with it. Our proposed framework allows to incorporate a mechanistic active biding network based on cis-regulatory motif analysis. We evaluate our regularization framework in conjunction with two non-linear ML techniques, namely gradient boosting machines (GBM) and random-forests (GENIE), resulting in a regularized feature selection based method specifically called RGBM and RGENIE respectively. RGBM has been used to identify the main transcription factors that are causally involved as master regulators of the gene expression signature activated in the FGFR3-TACC3-positive glioblastoma. Here, we illustrate that RGBM identifies the main regulators of the molecular subtypes of brain tumors. Our analysis reveals the identity and corresponding biological activities of the master regulators characterizing the difference between G-CIMP-high and G-CIMP-low subtypes and between PA-like and LGm6-GBM, thus providing a clue to the yet undetermined nature of the transcriptional events among these subtypes. INTRODUCTION Changes in environmental and external stimuli lead to variations in gene expression during the proper functioning of living systems. A vital role is played by the transcription factors (TFs), which are proteins that bind to the DNA in the regulatory regions of specific target genes. These TFs can then repress or induce the expression of target genes. Many such transcriptional regulations have been discovered through traditional molecular biology experiments and several of these high-quality mechanistic regulatory interactions have been well documented in TF-target gene databases (1–3). With the availability of high-throughput experimental techniques for efficiently measuring gene expression, such as DNA micro-arrays and RNA-Seq, our aim now is to design computational models for reverse engineering of gene regulatory networks (GRN) (4) from such data at genomic scale. The accurate reconstruction of GRNs from diverse expression information sources is one of the most important problems in biomedical research (5). Primarily because GRNs can reveal mechanistic hypotheses about differences between phenotypes and sources of diseases (1), which ultimately helps in the identification of therapeutic targets. The problem of inferring GRNs is also one of the most actively pursued problems in computational biology (6) resulted in several DREAM challenges. This problem is complicated by the noisy and high-dimensional nature of the data (7), which obscures the regulatory network with indirect connections. Another common challenge is to identify and model the non-linear interactions among the TF-target genes in the presence of relatively few samples compared to total number of target genes (i.e. n ≪ p, typical in high dimensional statistics (8)). The majority of methods model the expression of an individual target gene as either a linear or non-linear function of the expression levels of TFs (9–12). They then combine the sub-networks obtained for each target gene to construct the final inferred GRN resulting often in dense networks, whereas in reality, there are only a few transcriptional regulations between the TFs-target genes (6). There is a plethora of research associated with the problem of inferring GRNs from expression data (10–18). Here, we briefly describe three state-of-the-art methods: ARACNE (19,20), GENIE (21) and ENNET (22) which have been extensively utilized with real data (23,24). ARACNE uses an information theoretic measure, mutual information (25), between the expressions of two genes to generate the corresponding edge weights in the inferred GRN. However, the mutual information values are rarely zero and are plagued by indirect relationships, resulting in many false positives. ARACNE uses a statistical procedure, namely bootstrapping (26), to obtain a minimum threshold for edge weights corresponding to each TF and prunes all those connections associated with a weight is less than the threshold. GENIE (21), ENNET (22) and SCENIC (27) belong to the category of machine-learning (ML) based on feature selection where the expression vector of each target gene is considered as a dependent variable and the expression matrix corresponding to the list of TFs are the independent variables. GENIE (21), whose novelty is the application of random-forests (RF) (8), is a ML technique that exploits an ensemble of several decision trees to solve the regression task. The advantage of RF is that it can capture non-linear relations between the list of TFs and a given target gene and overcomes the small n, large p problem. Recently, a more accurate non-linear ML technique, the gradient boosting machine (GBM) (28) was employed by ENNET (22) and SCENIC (27) for inferring GRNs. ENNET also solves the regression task using a decision tree. However, it builds the model additively using a boosting procedure where, during each iteration, it adds a new decision tree to the base learner. Each tree is learned by optimizing the least squares loss function between the expression of the dependent variable and the estimated expression vector obtained from the model. GENIE participated in the DREAM4 and DREAM5 challenges and ENNET was proposed afterwards. Moreover, iRafNet (24) is also a RF based ML technique which was proposed after DREAM challenges took place. All these methods achieved much superior performance w.r.t. AUpr and AUroc metrics in comparison to their competitors. A major drawback of these ML methods (21–22,24,29) is that due to lack of regularization a large number of TFs have connections with an individual target gene. Despite the fact that ML-based methods tend to have better performance on simulated data, their success in real applications to uncover important regulators of biological states has not been as wide as the co-expression approached based on mutual information (30) or correlation (31). This is probably due to the difficulties designing of suitable significance thresholds that can be used to select candidate regulatory connections for the purposes of network interrogation through Master Regulator Analysis (32) or Master Regulator Activity (33). Moreover, most current ML-based approaches lack to explicitly model upstream regulators, i.e. network nodes with no incoming connections. Here, we propose a generic framework for GRN inference using decision tree based ML techniques, like GBM and RF, as core models. The reverse-engineering procedure infers an initial set of transcriptional regulations from expression data using either boosting of regression stumps (GBM) or an ensemble of regression stumps (RF). In order to select suitable thresholds to select the edges in the output network we employ a notion used for identifying the corner of the L-curve criterion (34) in Tikonov regularization (35) on the edge weight distributions (RVI scores). This enable us to select candidate true positive regulations without the need to empirically compute the null distribution of the edge weights function by bootstrapping, such as for example in the case of ARACNE (19). We then re-iterate once through the core GBM or RF model using this optimal list of TFs for each target gene to obtain regularized transcriptional regulations. This pruning step helps to reduce the falsely identified edges while sparsifying the GRN network at the same time. We also propose a novel heuristic procedure based to identify upstream regulators. The proposed framework allows the user to specify a priori mechanistic active binding network (ABN) of TFs and target genes based on cis-regulatory analysis of active binding sites on the promoters of target genes. This allows to filter-out indirect targets and false positives due to just co-expression. In the presence of an ABN, the reconstructed GRN is a subgraph of it whereas in the absence of such an ABN, the inferred GRN is reconstructed from the expression data. The resulting GRN is sparse, directed and weighted. The proposed techniques based on our generic framework are hereby referred as Regularized Gradient Boosting Machine (RGBM) and Regularized GENIE (RGENIE) for GBM and RF core models respectively. We evaluated RGBM and RGENIE on DREAM3, DREAM4 and DREAM5 Challenge datasets and simulated RNA-Seq datasets. Both RGBM and RGENIE obtain superior performance relative to ENNET and GENIE in terms of higher values for AUpr and AUroc as well as the winner of these competitions by up to 10–15%. RGBM outperforms RGENIE on these datasets, which is expected as the performance of ENNET surpasses that of GENIE. RGBM has been used to identify the main regulators, of the gene expression signature activated in the FGFR3-TACC3 fusion-positive glioblastoma (36). Here, we evaluate the accuracy of RGBM to identify true targets of one these regulators by validating in vitro the top targets in its regulon. Moreover, we go further and perform a case study by constructing the GRN for glioma tumors using gene expression profiles collected through the cancer genome atlas (TCGA) along with an a priori mechanistic ABN(1) of TFs and their corresponding targets with the goal of identifying the main regulators of the molecular subtypes of glioma using RGBM. Our analysis reveals the identity and corresponding biological activities of the master regulators driving the transformation of G-CIMP-high into the G-CIMP-low subtypes of IDH-mutant glioma and the main differences between PA-like and LGm6-GBM in the IDH-wild-type glioma. This result is a first step to the yet undetermined nature of the transcriptional events driving the evolution among these novel glioma subtypes. MATERIALS AND METHODS A schematic representation of RGBM approach is illustrated in Figure 1. We first utilize a mechanistic active binding network (ABN) between TFs and their potential targets if such a network is available or can be constructed. The ABN is then fed as prior information to the proposed ML framework. In the absence of an ABN, the GRN is inferred only from the available expression data. A detailed description of the heterogeneous expression datasets that can be handled by our proposed framework is given in Supplementary Section 1. Figure 1. View largeDownload slide Schematic representation of the RGBM approach. (A) First build the active binding network (ABN) and use it as a priori mechanistic network of connections between TFs and target genes, if possible. (B) Illustration of the primary procedure utilized by RGBM. Step1 uses RVI score distribution from a GBM model to rank TFs based on their ability to fit a given target gene. Step2 proposes a regularization step to identify the corner of the discrete L-shaped RVI curve. This results in the optimal set of TFs for a target gene. The proposed regularization step helps to reduce the falsely identified edges associated with a given target gene. It also identifies upstream regulators by using a simple heuristic cut-off on MRVI scores. Step3 is to re-iterate once through the boosting procedure with the optimal set of TFs for each target gene. Step4 is to infer the regulatory sub-graph for each target gene. (C) The final inferred GRN is obtained by combining the regulatory sub-graphs of all target genes and is much sparser than that obtained via ENNET which uses unregularized GBM model to reverse engineer GRN. Figure 1. View largeDownload slide Schematic representation of the RGBM approach. (A) First build the active binding network (ABN) and use it as a priori mechanistic network of connections between TFs and target genes, if possible. (B) Illustration of the primary procedure utilized by RGBM. Step1 uses RVI score distribution from a GBM model to rank TFs based on their ability to fit a given target gene. Step2 proposes a regularization step to identify the corner of the discrete L-shaped RVI curve. This results in the optimal set of TFs for a target gene. The proposed regularization step helps to reduce the falsely identified edges associated with a given target gene. It also identifies upstream regulators by using a simple heuristic cut-off on MRVI scores. Step3 is to re-iterate once through the boosting procedure with the optimal set of TFs for each target gene. Step4 is to infer the regulatory sub-graph for each target gene. (C) The final inferred GRN is obtained by combining the regulatory sub-graphs of all target genes and is much sparser than that obtained via ENNET which uses unregularized GBM model to reverse engineer GRN. The scores obtained from the GBM model are used to rank TFs based on their capability to potentially regulate a target gene. We adopt the the relative variable importance (RVI), it takes value between [0, 1], where a value of RVI(ϕ) = 1 indicates that the TF (ϕ) was the only feature that was required to explain the expression of the target gene among the list of all TFs whereas a value of 0 for a TF indicates that the TF was not regulating the expression of the given target gene. These RVI scores serve as the edge weights between the list of TFs and the given target gene. We utilized a modified version of the triangle method (37) to locate the corner of discrete L-shaped RVI curve as shown in Figure 1B. All the TFs to the left of this position form the optimal set of TFs for that target gene. We also identify the upstream regulators (genes which are not controlled by any regulator and have 0 in-degree in the inferred GRN) using a simple heuristic on the maximum RVI (MRVI) score of all genes. Finally, we re-iterate once through the boosting procedure with the optimal set of TFs for each target gene to assemble the final network. We describe the details of each step of Figure 1 in the following subsections. Building the ABN To learn potential regulatory activities between TFs and target genes in the glioma subtypes network, we merge constitutive associations due to active binding sites (ABN) and functional association due to contextual transcriptional activity (Figure 1 A). This allows to filter out indirect associations due to just co-expression and false positives. The active binding network (ABN) is reconstructed from the collection of TF binding sites that are also active i.e. falling into not methylated regions. Binding sites are predicted with the FIMO (Find Individual Motif Occurrences) tool using 2532 unique motif PWMs (Position Weight Matrices) obtained from Jaspar (38) corresponding to 1203 unique TFs (38–41). The active promoter regions are classified with ChromHMM (v1.10), a Hidden Markov Model that classifies each genome position into 18 different chromatin states (nine states are considered open/active sites: TssA, TssFlnkm, TssFlnkU, TssFlnkD, Tx, EnhG1, EnhG2, EnhA1m, EnhA1) from 98 human epigenomes (42). A binding relationship is considered active if the TF motif signal is significantly (FDR < 0.05) over-represented in the target promoter region (±5 kb TSS, hg19) and, in the same position (at least 1 bp overlapping), the chromatin state is classified as open/active. The ABN consists of 5,850,559 overlapping active sites corresponding to 1,874,570 unique TF associations between 457 TFs and 12,985 target genes. From the inference problem to a variable selection task The input of the RGBM is a gene expression matrix E and, optionally, the adjacency matrix of the ABN. In absence of the ABN we assume that every TF can potentially regulate each target gene in the expression matrix. An element of the expression matrix $$E \in \mathcal {R}^{N\times p}$$ i.e. eij, i = 1…N and j = 1…p, represents the expression value of jth gene in the ith sample. Let Cj be the list of potential TFs i.e. for each target gene j ∈ {1, …, p}, we sub-divide the problem of inferring the GRN into p independent tasks. For the jth sub-problem, we get the sub-network corresponding to the outgoing edges from the appropriate TFs to the target gene j. To generate this sub-graph, we first create the dependent vector Yj = E[, j] and a feature expression matrix i.e. matrix of independent variables, Xj = E[, Cj] from the expression matrix E (Supplementary Figure S2) Each of the p sub-problems can mathematically be formulated as: \begin{equation*} Y_{j} = h_{j}(X_{j}:\gamma _{j}) + \epsilon _{j}, \hspace{8.53581pt} \forall j \in \lbrace 1,\ldots ,p\rbrace \end{equation*} (1)Here, ϵj represents random noise and hj(Xj; γj) is the parametric function that maps the TF expression Xj to target (Yj) while optimizing the parameters γj. Our goal is to identify a small number of TFs which drive the expression of the jth gene using the columns of Xj as input features. Essentially, we have to solve a regression problem while inducing sparsity in the feature space, resulting in a subset of the list of TFs, which drive the expression of the jth target gene. This problem, referred as feature or variable selection is usually solved with a linear regression from the feature space to the target space (43–46). Inducing sparsity, have been utilized for GRN inference (15,47–48). These methods can only capture linear relationships and fail to detect non-linear interactions between the TFs and targets, thereby, missing several true positive edges. In our generic framework, we adopt two tree-based ML methods, namely RF (8,49) and GBM (9) as they solve the aforementioned problem using a non-linear mapping. Additionally, they provide a scheme to generate relative variable importance (RVI) score for each TF which allows to rank the TFs based on their contributions. The RVI scores are further used as edge weights for the sub-network obtained from Cj and the jth target gene. The RVI score measures how useful the each TF is for fitting the expression of jth target gene given the contribution of all the other TFs for that target. The RVI score for a TF ϕ from the core Gradient Boosting Model is computed as (28): \begin{eqnarray*} i^{t}_{j}(R^{t}_{l},R^{t}_{r}) = \frac{w_{l}^{t}w_{r}^{t}}{w_{l}^{t}+w_{r}^{t}} (\gamma _{jl}^{t}-\gamma ^{t}_{jr})^{2} \\ \text{VI}_{j}(\phi ) = \sum _{t=1}^{T} \delta _{j}^{t}(\phi ) \cdot i^{t}_{j}(R^{t}_{l},R^{t}_{r}) \\ \text{RVI}^{\text{GBM}}_{j}(\phi ) = \frac{\text{VI}_{j}(\phi )}{\sum _{\Phi \in C_{j}} \text{VI}_{j}(\Phi )} \end{eqnarray*}Here, $$\delta _{j}^{t}(\phi )=1$$, if TF ϕ results in the optimal split for the tth regression tree and the function $$\delta _{j}^{t}(\cdot ) = 0$$ for all the other TFs at iteration t, $$w_{l}^{t}$$ and $$w^{t}_{r}$$ are the number of observations in the left ($$R^{t}_{l}$$) and right ($$R^{t}_{r}$$) branches of the tree and the coefficients $$\gamma ^{t}_{jk}$$, k ∈ {l, r}, are the parameters of the decision tree as indicated in Equation 1 for the jth target gene. In case of the least-squares (LS) loss, $$\gamma ^{t}_{jl}$$ and $$\gamma ^{t}_{jr}$$ are the averages of all the pseudo-residuals (details in Supplementary Section 2) (22,28) falling in regions $$R^{t}_{l}$$ and $$R^{t}_{r}$$ respectively. Similarly, for least-absolute deviation (LAD) loss, $$\gamma ^{t}_{jl}$$ and $$\gamma ^{t}_{jr}$$ are the median of all the pseudo-residuals (28) in the disjoint regions $$R^{t}_{l}$$ and $$R^{t}_{r}$$ respectively. The TF which results in the optimal split is the one which maximizes the least squares improvement criterion (22,28), i( ·, ·), for regression tree t. For each tree t, we select the TF which can best divide the remaining expressions (pseudo-residuals) of the target gene into two distinct regions. Similarly, in the case of RF, the RVI score can be represented (21) as: \begin{eqnarray*} i^{t}_{j}(k) = w^{t}_{jk}\sigma ^{2}(R^{t}_{jk}) - w^{t}_{jkl}\sigma ^{2}(R^{t}_{jkl}) - w^{t}_{jkr}\sigma ^{2}(R^{t}_{jkr}) \\ \text{VI}_{j}(\phi ) = \sum _{t=1}^{T} \sum _{k=1}^{d} \delta ^{t}_{jk}(\phi ) \cdot i^{t}_{j}(k) \\ \text{RVI}^{\text{RF}}_{j}(\phi ) = \frac{\text{VI}_{j}(\phi )}{\sum _{\Phi \in C_{j}} \text{VI}_{j}(\Phi )} \end{eqnarray*}where k represents a node in the regression tree, $$w^{t}_{jk}$$, $$w^{t}_{jkl}$$ and $$w^{t}_{jkr}$$ correspond to number of samples in node k, the left branch of k and right branch of k respectively, the function σ2( · ) represents the variance of all the expression values in regions $$R^{t}_{jk}$$, $$R^{t}_{jkl}$$ and $$R^{t}_{jkr}$$and d is total number of nodes in the tth regression tree for target gene j. The overall importance of TF ϕ is then computed by summing the i (· ) values of all tree nodes where this variable is used to split. To determine a split into disjoint regions $$R^{t}_{jkl}$$ and $$R^{t}_{jkr}$$, we select the TF (ϕ) which maximizes the function i( · ), thereby indicating that values falling within each region have small variance when compared to the variance obtained from all the expression values at that node. In GENIE3 (21), the authors set $$\delta ^{t}_{jk}(\phi )$$ as 1 for TF ϕ if it maximizes the $$i^{t}_{j}(k)$$ criterion and set it to 0 for all other TFs. The TFs that are not selected at all obtain a value of 0 as their importance, and those that are selected close to the root nodes of several trees typically obtain high scores. The RVI score for a TF is unit-less, as it is the contribution of that TF given the contribution of all other TFs for fitting the expression of a given target gene as observed from the equations above. It takes values between [0, 1]. A large RVI score suggests with high confidence that the corresponding TF is regulating the expression of the given target gene. The core of the GBM model is explained in detail in the Supplementary Section 2 and we refer the readers to GENIE (21) for a detailed description about the usage of RF for GRN inference. ENNET (22) utilized the LS-Boost (Supplementary Algorithm 1:S1) as GBM model as core function for reverse engineering of gene regulatory network. In our proposed framework, we provide the user with the flexibility of utilizing either LS-Boost (Algorithm S1) or LAD-Boost (Algorithm S2) as the core GBM model for RGBM. This is because it was shown in (28) that LS-Boost performs extremely well for normally distributed expression values whereas LAD-Boost performs better for slash distributed values. Our framework also works well in combination with a core RF model resulting in a regularized version of GENIE(21) namely RGENIE. We report below the proposed regularization steps in combination with the core GBM model. Main regularization steps An important aspect of GRN is sparsity (6), i.e. there are only a few TFs which regulate a target gene and there are a few genes which have no regulations or we can have 0 in-degree (6) nodes in the inferred GRN. Thus, the procedure for reverse engineering GRNs should return sparse networks and should be able to detect such 0 in-degree upstream regulators. The adjacency matrix obtained from core GBM model can be quite dense, as shown in Supplementary Figure S3. However, when adjacency matrix is converted into an ordered edge-list (ranked in descending order based on edge weights), several of the top ranked connections are indeed true positives, whereas many others small weighted edges are false positives. Hence, there is a possibility to reduce the number of falsely identified transcriptional regulations between the TFs and targets as illustrated below. The sorted RVI score curve for an individual target gene approximately follows an exponential distribution as demonstrated empirically in Supplementary Figure S4 for GBM and in Supplementary Figure S7 for RF. In order to identify the optimal set of TFs for each target gene, RGBM uses an idea similar to that used for identifying the corner in discrete L-curve criterion (34,50) in Tikonov regularization (35). The problem in Tikonov L-curve is to identify the corner of a discrete L-curve where the surface of the discrete L-curve is monotonically decreasing. Several algorithms have been proposed for computing the corner of a discrete L-curve, taking into account the need to capture the overall behavior of the curve and avoiding the local corners (34,51–52). RGBM uses a modified variant of the triangle method (37). Specifically, let $$\mathcal {P}_{l}, \mathcal {P}_{m}$$ and $$\mathcal {P}_{n}$$ be three points on the RVI curve satisfying l < m < n and let vm, l denote the vector from $$\mathcal {P}_{m}$$ to $$\mathcal {P}_{l}$$. Then, we define the oriented angle θ(l, m, n) ∈ [0, π] associated with the triplet as the angle between the two vectors vm, n and vm, l i.e. θ(l, m, n) = ∠(vm, n, vm, l). With this definition, an angle θ(l, m, n) = π corresponds to the point $$\mathcal {P}_{l}$$, which determines the optimal position (optimal number of TFs) on the RVI curve. The key idea of the triangle method is to consider the following triples of L-curve points: $$(\mathcal {P}_{l},\mathcal {P}_{m}, \mathcal {P}_{n})$$, l = 1, …, n − 2, m = l + 1, …, n − 1, where n corresponds to the TF with the least non-zero RVI score (RVIj( · )) for the jth target gene. By using this idea, we identify as the corner the first triple where the oriented angle θ(l, m, n) is either equal to π or is maximum. If the angle θ(l, m, n) = π, then that part of the RVI curve is already “flat” w.r.t. the least contributing TF and the position l represents the optimal number of TFs for the jth target gene. All the TFs to the left of position l (including l) form the optimal set of TFs that regulate the target gene j as shown in Figure 3. The worst-case complexity of the triangle method is O(p2). However, for a $$\mathcal {P}_{l}$$, if $$\forall \mathcal {P}_{m}$$, the oriented angles $$\theta (l,m,n) \ge \frac{7\pi }{8}$$ then optimal corner corresponds to this l as the L-curve is almost flat from l and hence considered flat from $$\mathcal {P}_{l}$$ (34). Thus, all TFs to left of $$\mathcal {P}_{l}$$ and including $$\mathcal {P}_{l}$$ constitute the list of regulators for that target gene. This acts as an early stopping criterion and helps to reduce the complexity of the triangle method. The majority of the RVI curves have an approximately exponential distribution, so we can quickly reach the position where the oriented angle first becomes π and avoid unnecessary computations as indicated in Algorithm 1. From our experiments, we empirically found that the proposed technique requires much lower number steps on average to infer the optimal set of TFs for each target gene because of the exponential nature of the RVI score distribution. Moreover, the computation of the optimal set of TFs for each target gene can be performed in parallel. Another key feature of RGBM is the detection of upstream regulators, i.e. nodes which have no incoming edges in the inferred GRN, for which we devised a simple heuristic. From Figure 4A, we observe that for several target genes the maximum RVI (MRVI) score is >0.5. But there are a few outlier genes for which the MRVI score is much smaller (O(10−1) or O(10−2)) indicating that the given set of TFs cannot drive the expression of these target genes. In order to detect these outliers, we transform the MRVI score into the inverse maximum relative variable importance (IMRVI) score using the inverse cumulative density function (Ψ( · )) on the MRVI score distribution as illustrated below: \begin{equation*} \text{IMRVI}_{j} = \Psi ^{-1}(\text{RMVI}_{j}) \end{equation*} (2)By using this function it becomes easy to identify the heuristic cut-off ρ = μIMRVI − 1.64 × σIMRVI corresponding to ≈5th percentile of the IMRVI score distribution that is allocated to the outliers. All the genes whose IMRVI score is to the left of the ‘red’ line in Figure 4 are considered as candidate outliers. For these candidate outliers, if the cardinality ($$\#(\cdot )$$) of the optimal set of TFs satisfies: $$\#M[,j] \ge \frac{\#(C_{j})}{2}$$, then it is an indication that its difficult for this set of TFs to fit the expression of the given target gene as more than half the set of TFs are getting low RVI scores, close to the MRVI score for that target gene. We select and prune out such genes as 0 in-degree targets, or upstream regulators, in the final inferred GRN. For example, for the 5th target gene, the MRVI score is ≈0.2, which is very close to the smallest MRVI score in MRVI score distribution as depicted in Figure 4 A. Moreover, there are 51 TFs with relatively small non-zero RVI scores for the 5th target gene, as shown in Figure 2. Hence, the 5th gene is considered an upstream regulator in the inferred GRN. Figure 2. View large Download slide Sorted RVI score curves for the first 16 targets of Network 1 from DREAM4 challenge. The target genes are ordered in row-format from left to right (i.e. 1–4 target genes in row 1, 5–8 target genes in row 2, etc.) We can empirically observe that the sorted RVI curves ≈ follows an exponential distribution w.r.t. the number of TFs for each target gene. This is further verified by a near linear fit to the log (RVI) scores w.r.t. the set of TFs as showcased in Supplementary Figure S4. Thus, there are only a few TFs which are strongly regulating the expression of a target gene. Figure 2. View large Download slide Sorted RVI score curves for the first 16 targets of Network 1 from DREAM4 challenge. The target genes are ordered in row-format from left to right (i.e. 1–4 target genes in row 1, 5–8 target genes in row 2, etc.) We can empirically observe that the sorted RVI curves ≈ follows an exponential distribution w.r.t. the number of TFs for each target gene. This is further verified by a near linear fit to the log (RVI) scores w.r.t. the set of TFs as showcased in Supplementary Figure S4. Thus, there are only a few TFs which are strongly regulating the expression of a target gene. Figure 3. View largeDownload slide Optimal set of TFs obtained from the RVI curve of gene ‘G1’ for Network 1 from DREAM4 challenge using a triangle method (37) based technique which is commonly employed for identifying the corner in the Tikonov L-curve. We can see that the right most non-negative RVI score is at an x-axis position close to 80. This indicates that there are at least 20 TFs which had RVI(ϕ) = 0 for gene ‘G1’. Figure 3. View largeDownload slide Optimal set of TFs obtained from the RVI curve of gene ‘G1’ for Network 1 from DREAM4 challenge using a triangle method (37) based technique which is commonly employed for identifying the corner in the Tikonov L-curve. We can see that the right most non-negative RVI score is at an x-axis position close to 80. This indicates that there are at least 20 TFs which had RVI(ϕ) = 0 for gene ‘G1’. Figure 4. View largeDownload slide Subfigure A represents the MRVI score distribution for all the 100 targets of Network 1 from DREAM4 challenge. Subfigure B corresponds to inverse cumulative density function of the MRVI scores (IMRVI). Here the ‘red’ vertical line represents the heuristic cut-off ρ s.t. all targets whose MRVI score is less than ρ are selected as 0 in-degree genes and correspond to 5 percentile of the distribution. Figure 4. View largeDownload slide Subfigure A represents the MRVI score distribution for all the 100 targets of Network 1 from DREAM4 challenge. Subfigure B corresponds to inverse cumulative density function of the MRVI scores (IMRVI). Here the ‘red’ vertical line represents the heuristic cut-off ρ s.t. all targets whose MRVI score is less than ρ are selected as 0 in-degree genes and correspond to 5 percentile of the distribution. Once we have obtained the optimal set of TFs for an individual target gene, we re-iterate through the core GBM model. All these steps together form the RGBM technique for re-constructing GRNs as showcased in Algorithm 2 and illustrated via Figure 5. Figure 5. View largeDownload slide Final inferred GRN (Afinal) obtained as a result of RGBM Algorithm (Algorithm 2) on Network 1 for DREAM4 challenge. The final inferred GRN has 1144 edges between 100 nodes whose edge weights are >3.3 × 10−15 (machine precision). Afinal is much more sparse in comparison to A1 which is obtained after initial GBM modeling. In network Afinal, we have greatly reduced the number of falsely identified transcriptional regulations in A1. We also identified 4 dense communities or clusters in inferred network A1 using kernel spectral clustering (71). Nodes belonging to a cluster and edges originating from the nodes in these clusters have the same color. The size of each node is proportional to its out-degree. We observe that the communities present in Afinal have fewer edges and thus have much lower density than the clusters in A1. Figure 5. View largeDownload slide Final inferred GRN (Afinal) obtained as a result of RGBM Algorithm (Algorithm 2) on Network 1 for DREAM4 challenge. The final inferred GRN has 1144 edges between 100 nodes whose edge weights are >3.3 × 10−15 (machine precision). Afinal is much more sparse in comparison to A1 which is obtained after initial GBM modeling. In network Afinal, we have greatly reduced the number of falsely identified transcriptional regulations in A1. We also identified 4 dense communities or clusters in inferred network A1 using kernel spectral clustering (71). Nodes belonging to a cluster and edges originating from the nodes in these clusters have the same color. The size of each node is proportional to its out-degree. We observe that the communities present in Afinal have fewer edges and thus have much lower density than the clusters in A1. Post-transcriptional TF activity TF activity is determined using an algorithm that allows computationally inferring protein activity from gene expression profile data on an individual sample basis. The activity of a TF, defined as a metric that quantifies the activation of the transcriptional program of a specific regulator in each sample Si, is calculated as follows: \begin{equation*} Act(S_{i},\text{TF})=\frac{1}{U} \sum _{k=1}^{U} t^{+}_{ki}- \frac{1}{V} \sum _{j=1}^{V} t^{-}_{ji} \end{equation*} (3)where $$t^{+}_{ki}$$ is the expression level of the kth positive target of the MR in the ith sample, $$t^{-}_{ji}$$ is the expression level of the jth negative target of the MR in the ith sample, U (V) the number of positive (negative) targets present in the regulon of the considered MR. If Act(Si, TF) > 0, the TF is active in that particular sample. f Act(Si, TF) < 0, the TF is inversely activated and if Act(Si, TF) ≈ 0 it is non-active. To identify the main Master Regulators of glioma subtypes reported in Section 4, we use supervised analysis of the activity function defined in equation (3) using the Wilcoxon test (53). Cell culture, lentiviral infection and quantitative RT-PCR Human astrocytes (HA) (54) were cultured in DMEM supplemented with 10% fetal bovine serum (FBS, Sigma). Cells were routinely tested for mycoplasma contamination using the Mycoplasma Plus PCR Primer Set (Agilent Technologies) and were found to be negative. Cell authentication was performed using short tandem repeats (STR) at the ATCC facility. Human astrocytes were infected either with the lentiviral vector pLOC–vector or pLOC–PPARGC1A. Total RNA was prepared using the Trizol reagent (Invitrogen) and cDNA was synthesized using SuperScript II Reverse Transcriptase (Invitrogen) as described in (32). Quantitative RT–PCR (qRT–PCR) was performed with a Roche 480 thermal cycler, using SYBR Green PCR Master Mix (Applied Biosystems). Primers used in qRT–PCR are listed in Supplementary Table S4. qRT–PCR results were analyzed by DDCt method using 18S as housekeeping gene. RESULTS AND DISCUSSION For GBM based RGBM, we use the same parameters corresponding to the optimal parameter settings for ENNET (22). Similarly, for the RF based RGENIE, we use the parameters which correspond to the optimal parameter setting for GENIE (21). Additional details about the parameter setting for proposed RGBM and RGENIE models can be found in Supplementary Section 4. RGBM outperforms state-of-the-art on DREAM Challenge Data We assessed the performance of the proposed RGBM models using LS-Boost and LAD-Boost as core models and RGENIE using RF as core model on universally accepted benchmark networks of 100 or more genes from the DREAM3, DREAM4 and DREAM5 challenges (55–57) and compared them with several state-of-the-art GRN inference methods. For the purpose of comparison, we selected several methods including ENNET (22), GENIE (21), iRafNet (24), ARACNE (19) and the winner of each DREAM challenge. Among all the DREAM challenge networks, we performed experiments on in-silico networks of size 100 from DREAM3 and DREAM4, and on three benchmark (out of which two are real) networks from the DREAM5 challenge. The DREAM3 and DREAM4 challenges comprise five in-silico networks whose expression matrices E are simulated using GeneNetWeaver (58) software. Benchmark networks were constructed as sub-networks of systems of transcriptional regulations from known model organisms namely Escherichia coli and Saccharomyces cerevisiae. In our experiments, we focus on networks of size 100, which are the largest in the DREAM3 suite. There are several additional sources of information available for these networks, such as knockout, knockdown, and wildtype expressions apart from the time-series information. However, most of the state-of-the-art techniques do not necessarily utilize all these heterogeneous information sources. We showcase the best results generated for the DREAM3 and DREAM4 challenge networks using the optimal combination of information sources for different GRN inference methods in Table 1. Comparison of RGBM and RGENIE with a other of inference methods on DREAM3 and DREAM4 networks of size 100 Table 1. Comparison of RGBM and RGENIE with a other of inference methods on DREAM3 and DREAM4 networks of size 100 Methods Data used DREAM3 experiments Network 1 Network 2 Network 3 Network 4 Network 5 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,KD,WT 0.699 0.903 0.888 0.965 0.597 0.900 0.571 0.861 0.460 0.787 RGBM (LAD-Boost) KO,KD,WT 0.683 0.903 0.870* 0.963* 0.562* 0.900 0.535* 0.853* 0.400 0.770 ENNET KO,KD,WT,MTS 0.627 0.901 0.865+ 0.963+ 0.552+ 0.892 0.522+ 0.842 0.384 0.765 RGENIE KO,KD,WT 0.521 0.870 0.821− 0.899 0.456 0.812 0.478− 0.778 0.356 0.718 GENIE KO,KD,WT 0.430 0.850 0.782 0.883 0.372 0.729 0.423 0.724 0.314 0.656 iRafNet KO,KD,WT 0.528 0.878 0.812 0.901 0.484 0.864 0.482 0.772 0.364 0.736 ARACNE KO,KD,WT 0.348 0.781 0.656 0.813 0.285 0.669 0.396 0.662 0.274 0.583 Winner (72) KO, WT 0.694 0.948 0.806 0.960 0.493 0.915 0.469 0.853 0.433 0.783 Methods Data Used DREAM4 Experiments Network 1 Network 2 Network 3 Network 4 Network 5 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,KD,WT,MTS 0.709 0.936 0.561 0.878* 0.525 0.911 0.616 0.903 0.450 0.893 RGBM (LAD-Boost) KO,KD,WT,MTS 0.682* 0.924* 0.525* 0.895 0.490* 0.907* 0.566* 0.903 0.413* 0.885* ENNET KO,KD,WT 0.604+ 0.893 0.456+ 0.856+ 0.421+ 0.865+ 0.506+ 0.878+ 0.264+ 0.828+ RGENIE KO,WT 0.448 0.902 0.330 0.792 0.374 0.834− 0.362− 0.840 0.218− 0.773− GENIE KO,WT 0.338 0.864 0.309 0.748 0.277 0.782 0.267 0.808 0.114 0.720 iRafNet KO,TS 0.552 0.901 0.337 0.799 0.414 0.835 0.421 0.847 0.298 0.792 ARACNE KO,KD,WT 0.279 0.781 0.256 0.691 0.205 0.669 0.196 0.699 0.074 0.583 Winner (73) KO 0.536 0.914 0.377 0.801 0.390 0.833 0.349 0.842 0.213 0.759 Methods Data used DREAM3 experiments Network 1 Network 2 Network 3 Network 4 Network 5 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,KD,WT 0.699 0.903 0.888 0.965 0.597 0.900 0.571 0.861 0.460 0.787 RGBM (LAD-Boost) KO,KD,WT 0.683 0.903 0.870* 0.963* 0.562* 0.900 0.535* 0.853* 0.400 0.770 ENNET KO,KD,WT,MTS 0.627 0.901 0.865+ 0.963+ 0.552+ 0.892 0.522+ 0.842 0.384 0.765 RGENIE KO,KD,WT 0.521 0.870 0.821− 0.899 0.456 0.812 0.478− 0.778 0.356 0.718 GENIE KO,KD,WT 0.430 0.850 0.782 0.883 0.372 0.729 0.423 0.724 0.314 0.656 iRafNet KO,KD,WT 0.528 0.878 0.812 0.901 0.484 0.864 0.482 0.772 0.364 0.736 ARACNE KO,KD,WT 0.348 0.781 0.656 0.813 0.285 0.669 0.396 0.662 0.274 0.583 Winner (72) KO, WT 0.694 0.948 0.806 0.960 0.493 0.915 0.469 0.853 0.433 0.783 Methods Data Used DREAM4 Experiments Network 1 Network 2 Network 3 Network 4 Network 5 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,KD,WT,MTS 0.709 0.936 0.561 0.878* 0.525 0.911 0.616 0.903 0.450 0.893 RGBM (LAD-Boost) KO,KD,WT,MTS 0.682* 0.924* 0.525* 0.895 0.490* 0.907* 0.566* 0.903 0.413* 0.885* ENNET KO,KD,WT 0.604+ 0.893 0.456+ 0.856+ 0.421+ 0.865+ 0.506+ 0.878+ 0.264+ 0.828+ RGENIE KO,WT 0.448 0.902 0.330 0.792 0.374 0.834− 0.362− 0.840 0.218− 0.773− GENIE KO,WT 0.338 0.864 0.309 0.748 0.277 0.782 0.267 0.808 0.114 0.720 iRafNet KO,TS 0.552 0.901 0.337 0.799 0.414 0.835 0.421 0.847 0.298 0.792 ARACNE KO,KD,WT 0.279 0.781 0.256 0.691 0.205 0.669 0.196 0.699 0.074 0.583 Winner (73) KO 0.536 0.914 0.377 0.801 0.390 0.833 0.349 0.842 0.213 0.759 Here, we provide the mean AUpr and AUroc values for 10 random runs of different inference methods. Here, KO, knockout; KD, knockdown; WT, wildtype; MTS, modified smoothed version of the time-series data. The best results are highlighted in bold. *, +, − represent the quality metric values where RGBM (LAD-Boost), ENNET and RGENIE techniques, respectively outperform the winner of DREAM3 and DREAM4 challenges. View Large Table 1. Comparison of RGBM and RGENIE with a other of inference methods on DREAM3 and DREAM4 networks of size 100 Methods Data used DREAM3 experiments Network 1 Network 2 Network 3 Network 4 Network 5 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,KD,WT 0.699 0.903 0.888 0.965 0.597 0.900 0.571 0.861 0.460 0.787 RGBM (LAD-Boost) KO,KD,WT 0.683 0.903 0.870* 0.963* 0.562* 0.900 0.535* 0.853* 0.400 0.770 ENNET KO,KD,WT,MTS 0.627 0.901 0.865+ 0.963+ 0.552+ 0.892 0.522+ 0.842 0.384 0.765 RGENIE KO,KD,WT 0.521 0.870 0.821− 0.899 0.456 0.812 0.478− 0.778 0.356 0.718 GENIE KO,KD,WT 0.430 0.850 0.782 0.883 0.372 0.729 0.423 0.724 0.314 0.656 iRafNet KO,KD,WT 0.528 0.878 0.812 0.901 0.484 0.864 0.482 0.772 0.364 0.736 ARACNE KO,KD,WT 0.348 0.781 0.656 0.813 0.285 0.669 0.396 0.662 0.274 0.583 Winner (72) KO, WT 0.694 0.948 0.806 0.960 0.493 0.915 0.469 0.853 0.433 0.783 Methods Data Used DREAM4 Experiments Network 1 Network 2 Network 3 Network 4 Network 5 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,KD,WT,MTS 0.709 0.936 0.561 0.878* 0.525 0.911 0.616 0.903 0.450 0.893 RGBM (LAD-Boost) KO,KD,WT,MTS 0.682* 0.924* 0.525* 0.895 0.490* 0.907* 0.566* 0.903 0.413* 0.885* ENNET KO,KD,WT 0.604+ 0.893 0.456+ 0.856+ 0.421+ 0.865+ 0.506+ 0.878+ 0.264+ 0.828+ RGENIE KO,WT 0.448 0.902 0.330 0.792 0.374 0.834− 0.362− 0.840 0.218− 0.773− GENIE KO,WT 0.338 0.864 0.309 0.748 0.277 0.782 0.267 0.808 0.114 0.720 iRafNet KO,TS 0.552 0.901 0.337 0.799 0.414 0.835 0.421 0.847 0.298 0.792 ARACNE KO,KD,WT 0.279 0.781 0.256 0.691 0.205 0.669 0.196 0.699 0.074 0.583 Winner (73) KO 0.536 0.914 0.377 0.801 0.390 0.833 0.349 0.842 0.213 0.759 Methods Data used DREAM3 experiments Network 1 Network 2 Network 3 Network 4 Network 5 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,KD,WT 0.699 0.903 0.888 0.965 0.597 0.900 0.571 0.861 0.460 0.787 RGBM (LAD-Boost) KO,KD,WT 0.683 0.903 0.870* 0.963* 0.562* 0.900 0.535* 0.853* 0.400 0.770 ENNET KO,KD,WT,MTS 0.627 0.901 0.865+ 0.963+ 0.552+ 0.892 0.522+ 0.842 0.384 0.765 RGENIE KO,KD,WT 0.521 0.870 0.821− 0.899 0.456 0.812 0.478− 0.778 0.356 0.718 GENIE KO,KD,WT 0.430 0.850 0.782 0.883 0.372 0.729 0.423 0.724 0.314 0.656 iRafNet KO,KD,WT 0.528 0.878 0.812 0.901 0.484 0.864 0.482 0.772 0.364 0.736 ARACNE KO,KD,WT 0.348 0.781 0.656 0.813 0.285 0.669 0.396 0.662 0.274 0.583 Winner (72) KO, WT 0.694 0.948 0.806 0.960 0.493 0.915 0.469 0.853 0.433 0.783 Methods Data Used DREAM4 Experiments Network 1 Network 2 Network 3 Network 4 Network 5 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,KD,WT,MTS 0.709 0.936 0.561 0.878* 0.525 0.911 0.616 0.903 0.450 0.893 RGBM (LAD-Boost) KO,KD,WT,MTS 0.682* 0.924* 0.525* 0.895 0.490* 0.907* 0.566* 0.903 0.413* 0.885* ENNET KO,KD,WT 0.604+ 0.893 0.456+ 0.856+ 0.421+ 0.865+ 0.506+ 0.878+ 0.264+ 0.828+ RGENIE KO,WT 0.448 0.902 0.330 0.792 0.374 0.834− 0.362− 0.840 0.218− 0.773− GENIE KO,WT 0.338 0.864 0.309 0.748 0.277 0.782 0.267 0.808 0.114 0.720 iRafNet KO,TS 0.552 0.901 0.337 0.799 0.414 0.835 0.421 0.847 0.298 0.792 ARACNE KO,KD,WT 0.279 0.781 0.256 0.691 0.205 0.669 0.196 0.699 0.074 0.583 Winner (73) KO 0.536 0.914 0.377 0.801 0.390 0.833 0.349 0.842 0.213 0.759 Here, we provide the mean AUpr and AUroc values for 10 random runs of different inference methods. Here, KO, knockout; KD, knockdown; WT, wildtype; MTS, modified smoothed version of the time-series data. The best results are highlighted in bold. *, +, − represent the quality metric values where RGBM (LAD-Boost), ENNET and RGENIE techniques, respectively outperform the winner of DREAM3 and DREAM4 challenges. View Large We observe from Table 1 that the best source of information for almost all the GRN inference methods are the knockout, knockdown, and wildtype expressions for DREAM3 challenge. But in case of the DREAM4 challenge, all available heterogeneous information sources are useful for RGBM models, whereas knockout, knockdown, and wildtype expressions are useful for ENNET and ARACNE, while the knockout and wildtype expression are optimal for RGENIE and GENIE. From Table 1, we showcase that ARACNE performs the worst on all DREAM3 and DREAM4 challenge datasets. RF based methods GENIE, iRafNet and RGENIE are inferior to GBM based methods ENNET and RGBM, for both the DREAM3 and DREAM4 challenge. But, RGENIE significantly outperforms GENIE w.r.t quality metrics AUpr and AUroc on all DREAM3 and DREAM4 challenge datasets. Similarly, RGBM using LS-Boost as the core model significantly outperforms ENNET as well as the winner on several networks for both of these challenges. Both RGBM and RGENIE gain maximum benefit from the proposed regularization steps by removing falsely identified edges and can efficiently detect 0 in-degree genes. As a result these methods gain a lot in terms of precision and recall. However, RGBM (LS-Boost) clearly performs the best on the majority of the datasets from the DREAM3 and DREAM4 challenge. Figure 6 illustrates the optimal number of TFs identified by proposed Algorithm 1 for each target gene and passed as network M either to Algorithm S1 or Algorithm S2 to infer the final GRN for Network 1 of DREAM4 challenge. We observe that several genes (including ‘G5’, ‘G26’, ‘G40’, ‘G42’ etc.) have 0 TFs connected to them and inferred as the 0 in-degree upstream regulators. Figure 6. View largeDownload slide Optimal number of TFs for each target gene obtained from proposed Algorithm 1 for Network 1 of DREAM4 challenge. Figure 6. View largeDownload slide Optimal number of TFs for each target gene obtained from proposed Algorithm 1 for Network 1 of DREAM4 challenge. Two benchmark networks in the DREAM5 (6) challenge of different sizes and structure were generated using a Prokaryotic model organism (E. coli) and a Eukaryotic model organism (S. cerevisiae) corresponding to Network 3 and Network 4 respectively. The time-series data of only Network 1 was simulated in-silico, the two other sets of expression data were measured in real experiments. DREAM5 was the first challenge where participants were asked to infer GRNs for large-scale real datasets, i.e. for O(103) target genes and O(102) known TFs. Gold standard networks were obtained from two sources: the RegulonDB database (59), and the Gene Ontology (GO) annotations (60). The E. coli network of the DREAM5 challenge consisted of 4297 target genes, 296 TFs and the corresponding gold standard has 2066 interactions. Similarly, the S. cerevisiae network comprises 5667 targets, 183 TFs and the corresponding gold standard has 2528 regulatory interactions (6). The results of all the inference methods for DREAM5 expression data using the optimal combination of information sources are summarized in Table 2. Network 2 from DREAM5 was ignored as the gold standard network was not well constructed (6,22). Comparison of RGBM and RGENIE with inference methods on DREAM5 networks of varying sizes Table 2. Comparison of RGBM and RGENIE with inference methods on DREAM5 networks of varying sizes Methods Data used DREAM5 experiments Network 1 Network 3 Network 4 AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,Exp 0.537 0.846* 0.086 0.633* 0.048 0.546 RGBM (LAD-Boost) KO,Exp 0.513* 0.842* 0.084 0.628* 0.047* 0.544* ENNET KO,Exp 0.432+ 0.857 0.069 0.632+ 0.021 0.532+ iRafNet KO,MTS,Exp 0.364 0.813 0.112 0.641 0.021 0.523 RGENIE Exp 0.343− 0.821− 0.104− 0.623− 0.022− 0.524− GENIE (Winner) Exp 0.291 0.814 0.094 0.619 0.021 0.517 TIGRESS (15) KO,Exp 0.301 0.782 0.069 0.595 0.020 0.517 CLR (18) Exp 0.217 0.666 0.050 0.538 0.018 0.505 ARACNE Exp 0.099 0.545 0.029 0.512 0.017 0.500 Methods Data used DREAM5 experiments Network 1 Network 3 Network 4 AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,Exp 0.537 0.846* 0.086 0.633* 0.048 0.546 RGBM (LAD-Boost) KO,Exp 0.513* 0.842* 0.084 0.628* 0.047* 0.544* ENNET KO,Exp 0.432+ 0.857 0.069 0.632+ 0.021 0.532+ iRafNet KO,MTS,Exp 0.364 0.813 0.112 0.641 0.021 0.523 RGENIE Exp 0.343− 0.821− 0.104− 0.623− 0.022− 0.524− GENIE (Winner) Exp 0.291 0.814 0.094 0.619 0.021 0.517 TIGRESS (15) KO,Exp 0.301 0.782 0.069 0.595 0.020 0.517 CLR (18) Exp 0.217 0.666 0.050 0.538 0.018 0.505 ARACNE Exp 0.099 0.545 0.029 0.512 0.017 0.500 Here, we provide the mean AUpr and AUroc values for 10 random runs of different inference methods. Here, KO, knockout; KD, knockdown; WT, wildtype; MTS, modified smoothed version of the time-series data; Exp, steady-state gene expression. The best results are highlighted in bold. *, + and − represent the quality metric values where RGBM, ENNET and RGENIE techniques respectively defeat the winner of DREAM5 challenge, i.e. GENIE. View Large Table 2. Comparison of RGBM and RGENIE with inference methods on DREAM5 networks of varying sizes Methods Data used DREAM5 experiments Network 1 Network 3 Network 4 AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,Exp 0.537 0.846* 0.086 0.633* 0.048 0.546 RGBM (LAD-Boost) KO,Exp 0.513* 0.842* 0.084 0.628* 0.047* 0.544* ENNET KO,Exp 0.432+ 0.857 0.069 0.632+ 0.021 0.532+ iRafNet KO,MTS,Exp 0.364 0.813 0.112 0.641 0.021 0.523 RGENIE Exp 0.343− 0.821− 0.104− 0.623− 0.022− 0.524− GENIE (Winner) Exp 0.291 0.814 0.094 0.619 0.021 0.517 TIGRESS (15) KO,Exp 0.301 0.782 0.069 0.595 0.020 0.517 CLR (18) Exp 0.217 0.666 0.050 0.538 0.018 0.505 ARACNE Exp 0.099 0.545 0.029 0.512 0.017 0.500 Methods Data used DREAM5 experiments Network 1 Network 3 Network 4 AUpr AUroc AUpr AUroc AUpr AUroc RGBM (LS-Boost) KO,Exp 0.537 0.846* 0.086 0.633* 0.048 0.546 RGBM (LAD-Boost) KO,Exp 0.513* 0.842* 0.084 0.628* 0.047* 0.544* ENNET KO,Exp 0.432+ 0.857 0.069 0.632+ 0.021 0.532+ iRafNet KO,MTS,Exp 0.364 0.813 0.112 0.641 0.021 0.523 RGENIE Exp 0.343− 0.821− 0.104− 0.623− 0.022− 0.524− GENIE (Winner) Exp 0.291 0.814 0.094 0.619 0.021 0.517 TIGRESS (15) KO,Exp 0.301 0.782 0.069 0.595 0.020 0.517 CLR (18) Exp 0.217 0.666 0.050 0.538 0.018 0.505 ARACNE Exp 0.099 0.545 0.029 0.512 0.017 0.500 Here, we provide the mean AUpr and AUroc values for 10 random runs of different inference methods. Here, KO, knockout; KD, knockdown; WT, wildtype; MTS, modified smoothed version of the time-series data; Exp, steady-state gene expression. The best results are highlighted in bold. *, + and − represent the quality metric values where RGBM, ENNET and RGENIE techniques respectively defeat the winner of DREAM5 challenge, i.e. GENIE. View Large RGBM using LS-Boost core model gives better results than other methods w.r.t evaluation metrics AUpr and AUroc on Network 4 as illustrated in Table 2. It easily defeats the winner (GENIE) of the DREAM5 challenge and outperforms recent state-of-the-art GRN inference methods iRafNet and ENNET. Similarly, the performance of RGENIE surpasses that of GENIE. However, RGBM performs much better than RGENIE on Network 4 whereas it is defeated by RGENIE w.r.t. AUpr for Network 3. We observe from Table 1 and Table 2 that RGBM based on the LS-Boost model usually has a better performance than RGBM based on the LAD-Boost model for both in-silico and real datasets. Hence, for all further experimental comparisons, we will use RGBM based on the core LS-Boost model. Interestingly, the predictions for real expression profiles (DREAM5 challenge—Networks 3 and 4) result in extremely low precision-recall values as depicted in Table 2. One of the reasons for the poor performance of all the inference methods for such expression data is the fact that experimentally derived pathways, and consequently gold standards obtained from them, are not necessarily complete, regardless of how well the model organism is known. Additionally, there are regulators of gene expression other than TFs, such as miRNA and siRNA, which also drive the expression of these genes. RGBM outperforms state-of-the-art on synthetic RNA-Seq data We conducted additional experiments on simulated RNA-Seq data. We used our R package synRNASeqNet (https://cran.r-project.org/web/packages/synRNASeqNet) to generate RNA-Seq expression matrices. It uses a stochastic Barabási-Albert (BA) model (61) to build random scale-free networks using a preferential attachment mechanism with power exponent α and simulated RNA-Seq counts from a Poisson multivariate distribution (62). For our experiments, we generated 5 RNA-Seq expression (E) matrices comprised of 500 RNA-Seq counts for 50 target genes using power exponent values α ∈ {1.75, 2, 2.25, 2.5, 2.75} respectively and repeated this procedure 10 times. In this experiment, we are not provided with any additional information, such as knockout or knockdown, and the active binding network (ABN) is not present. We use evaluation metrics like AUpr and AUroc to compare the proposed RGBM (using LS-Boost) and RGENIE with state-of-the-art GRN inference methods, including ENNET, GENIE and ARACNE. Figure 7 illustrates the performance of various GRN inference methods w.r.t. ROC and PR curves.The performance of RGBM and RGENIE is compared with ENNET, GENIE and ARACNE for five different experimental settings as shown in Table 3. Figure 7. View largeDownload slide Comparison of RGBM and RGENIE with ENNET, GENIE3 and ARACNE w.r.t. AUroc and AUpr curves for five different RNA-Seq experiments. Figure 7. View largeDownload slide Comparison of RGBM and RGENIE with ENNET, GENIE3 and ARACNE w.r.t. AUroc and AUpr curves for five different RNA-Seq experiments. Comparison of proposed RGBM and RGENIE techniques with ENNET, GENIE and ARACNE GRN inference methods w.r.t. evaluation metrics AUroc and AUpr for reverse-engineering GRNs from RNA-Seq counts where the underlying ground-truth network follows a BA preferential attachment model with exponent α. Here no additional information (ABN or knockout or knockdown) is available Table 3. Comparison of proposed RGBM and RGENIE techniques with ENNET, GENIE and ARACNE GRN inference methods w.r.t. evaluation metrics AUroc and AUpr for reverse-engineering GRNs from RNA-Seq counts where the underlying ground-truth network follows a BA preferential attachment model with exponent α. Here no additional information (ABN or knockout or knockdown) is available Methods RNA-Seq experiments Exponent α = 1.75 Exponent α = 2 Exponent α = 2.25 Exponent 2.5 Exponent 2.75 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM 0.575 0.808 0.470 0.789 0.498 0.700 0.500 0.695 0.506 0.709 ENNET 0.566 0.802 0.454 0.780 0.495 0.685 0.494 0.684 0.494 0.684 RGENIE 0.605 0.846 0.528 0.785 0.270 0.652 0.272 0.626 0.274 0.641 GENIE 0.622 0.822 0.507 0.777 0.235 0.607 0.245 0.610 0.241 0.601 ARACNE 0.065 0.575 0.053 0.556 0.056 0.600 0.055 0.600 0.055 0.600 Methods RNA-Seq experiments Exponent α = 1.75 Exponent α = 2 Exponent α = 2.25 Exponent 2.5 Exponent 2.75 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM 0.575 0.808 0.470 0.789 0.498 0.700 0.500 0.695 0.506 0.709 ENNET 0.566 0.802 0.454 0.780 0.495 0.685 0.494 0.684 0.494 0.684 RGENIE 0.605 0.846 0.528 0.785 0.270 0.652 0.272 0.626 0.274 0.641 GENIE 0.622 0.822 0.507 0.777 0.235 0.607 0.245 0.610 0.241 0.601 ARACNE 0.065 0.575 0.053 0.556 0.056 0.600 0.055 0.600 0.055 0.600 View Large Table 3. Comparison of proposed RGBM and RGENIE techniques with ENNET, GENIE and ARACNE GRN inference methods w.r.t. evaluation metrics AUroc and AUpr for reverse-engineering GRNs from RNA-Seq counts where the underlying ground-truth network follows a BA preferential attachment model with exponent α. Here no additional information (ABN or knockout or knockdown) is available Methods RNA-Seq experiments Exponent α = 1.75 Exponent α = 2 Exponent α = 2.25 Exponent 2.5 Exponent 2.75 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM 0.575 0.808 0.470 0.789 0.498 0.700 0.500 0.695 0.506 0.709 ENNET 0.566 0.802 0.454 0.780 0.495 0.685 0.494 0.684 0.494 0.684 RGENIE 0.605 0.846 0.528 0.785 0.270 0.652 0.272 0.626 0.274 0.641 GENIE 0.622 0.822 0.507 0.777 0.235 0.607 0.245 0.610 0.241 0.601 ARACNE 0.065 0.575 0.053 0.556 0.056 0.600 0.055 0.600 0.055 0.600 Methods RNA-Seq experiments Exponent α = 1.75 Exponent α = 2 Exponent α = 2.25 Exponent 2.5 Exponent 2.75 AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc AUpr AUroc RGBM 0.575 0.808 0.470 0.789 0.498 0.700 0.500 0.695 0.506 0.709 ENNET 0.566 0.802 0.454 0.780 0.495 0.685 0.494 0.684 0.494 0.684 RGENIE 0.605 0.846 0.528 0.785 0.270 0.652 0.272 0.626 0.274 0.641 GENIE 0.622 0.822 0.507 0.777 0.235 0.607 0.245 0.610 0.241 0.601 ARACNE 0.065 0.575 0.053 0.556 0.056 0.600 0.055 0.600 0.055 0.600 View Large Here, the evaluation metrics AUroc and AUpr represent the mean value of these evaluation metrics for 10 random runs of each setting. We can observe from Figure 7 and Table 3 that RGBM performs the best as preferential attachment increases and the degree distribution becomes more skewed for the synthetic RNA-Seq networks. However, for smaller values of α, the RF based inference methods GENIE and RGENIE are better than RGBM. But their performance decreases drastically w.r.t. the evaluation metric AUpr for increasing values of preferential attachment exponent α, suggesting that RF based GRNs are obscured by false identified edges and are inferior to GBM based methods when trying to reverse engineer GRNs where very few TFs (hubs) are regulating a majority of the target genes. In both DREAM challenge and synthetic RNA-Seq experiments, GBM based RGBM outperforms almost always the RF based RGENIE method. Hence, we only used the proposed RGBM method for additional experiments as depicted in Supplementary Section 5 and in our real case-study to identify the master regulators of different glioma cancer subtypes. RGBM identifies the master regulators of glioma cancer subtypes The results in the previous paragraphs have shown that RGBM is a promising technique to efficiently recover the regulatory structure of small and large gene networks. Here, we apply RGBM for the identification of Master Regulators of tumor subtypes in human glioma, the most frequent primary brain tumor in adults (63). In the cancer field, master regulators (MR) have been defined as gene products (mostly TFs) necessary and sufficient for the expression of particular tumor-specific signatures typically associated with specific tumor phenotypes (e.g. pro-neural vs. mesenchymal). In the case of malignant gliomas, reverse engineering has been used to successfully predict the experimentally validated transcriptional regulatory network responsible for activation of the highly aggressive mesenchymal gene expression signature of malignant glioma (32). A master regulator gene can be defined as a network hub whose regulon exhibits a statistically significant enrichment of the given phenotype signature, which expresses a cellular phenotype of interest, such as tumor subtype. MARINa (MAster Regulator INference algorithm) is an algorithm to identify MRs starting from a GRN and a list of differentially expressed genes (64). This specific algorithm was successfully applied previously (32) to identify Stat3 and C/EBPβ as the two TFs hierarchically placed at the top of the transcriptional network of mesenchymal high-grade glioma. We use MARINa in conjunction with the GRN inferred using RGBM on a Pan-glioma dataset. Recently, the Pan-Glioma Analysis Working Group of the The Cancer Genome Atlas (TCGA) project analyzed the largest collection of human glioma ever reported (23). It has been shown that, using a combination of DNA copy number and mutation information, together with DNA methylation and mRNA gene expression, human gliomas can be robustly divided into seven major subtypes defined as G-CIMP-low, G-CIMP-high, Codel, Mesenchymal-Like, Classic-Like, LGm6-GBM and PA-like (23). The first key division of human glioma is driven by the status of the IDH1 gene, whereby IDH1 mutations are typically characterized by a relatively more favorable clinical course of the disease. IDH1 mutations are associated with a hypermethylation phenotype of glioma (G-CIMP, (65)). However, our Pan-glioma study reported that IDH-mutant tumors lacking co-deletion of Chromosome 1p and 19q are a heterogeneous subgroup characterized predominantly by the G-CIMP-high subtype and less frequently by the G-CIMP-low subgroup. This last is characterized by relative loss of the DNA hypermethylation profile, worse clinical outcomes and likely represents the progressive evolution of G-CIMP-high gliomas toward a more aggressive tumor phenotype (23). However, the transcriptional network and the set of MRs responsible for the transformation of G-CIMP-high into G-CIMP-low gliomas remained elusive. Among the large group of IDH-wildtype tumors (typically characterized by a worse prognosis when compared to IDH-mutant glioma), we discovered that, within a particular methylation-driven cluster (LGm6) and at variance with the other methylation-driven clusters of IDH-wildtype tumors, the lower grade gliomas (LGG) display significantly better clinical outcome than GBM tumors (GBM-LGm6). We defined these LGG tumors as PA-like based on their expression and genomic similarity with the pediatric tumor Pylocitic Astrocytoma. However, for the transition from G-CIMP-high into G-CIMP-low gliomas, the determinants of the malignant progression of PA-like LGG into GBM-LGm6 remained unknown. Here, we applied our novel computational RGBM approach to infer the MRs responsible for the progression of G-CIMP-high into G-CIMP-low IDH-mutant glioma and those driving progression of PA-like LGG into LGm6-GBM IDH-wildtype tumors respectively. Toward this aim, we first built the Pan-glioma network between 457 TFs and 12 985 target genes. An ABN network was to used as prior for the RGBM algorithm and for the expression matrix we used the TCGA Pan-glioma dataset (23) including 1250 samples (463 IDH-mutant and 653 IDH-wild-type), 583 of which were profiled with Agilent and 667 with RNA-Seq Illumina HiSeq downloaded from the TCGA portal. The batch effects between the two platforms were corrected as reported in (66) using the COMBAT algorithm (67) having tumor type and profiling platform as covariates. Subsequently, quantile normalization is applied to the whole matrix. The inferred Pan-glioma RGBM network is shown in Supplementary Figure 8 (F8) and contains 39 192 connections with an average regulon size of 85.8 genes. To identify the MRs displaying the highest differential activity for each group, we ranked MR activity for each TF among for all the seven glioma subtypes. The top MRs exhibiting differential activity among the glioma groups are shown in Supplementary Figure S9 and their average activity in Figure 8. We found that RGBM-based MR analysis efficiently separates an IDH-mutant dominated cluster of gliomas including each of the three IDH-mutant subtypes (G-CIMP-high, G-GIMP-low and Codel) from an IDH-wildtype group including Mesenchymal-Like, Classic-Like and LGm6-GBM. This finding indicates that RGBM correctly identifies biologically-defined subgroups in terms of the activity of MRs. The MRs characterizing IDH-mutant glioma include known regulators of cell fate and differentiation of the nervous system, therefore, indicating that these tumors are driven by a more differentiated set of TFs that are retained from the neural tissue of origin (e.g. NEUROD2, MEF2C, EMX1, etc.). Conversely, the MRs whose activity is enriched in IDH-wildtype glioma are well-known TFs driving the mesenchymal transformation, immune response and the higher aggressiveness that characterizes the IDH-wildtype glioma (STAT3, CEBPB, FOSL2, BATF and RUNX2, etc). Remarkably, while the G-CIMP-low subtype showed a general pattern of activation of MRs that includes this subtype within the IDH-mutant group of gliomas, when compared to the G-CIMP-high subtype, G-CIMP-low glioma displays a distinct loss of activation of neural cell fate/differentiation-specific MRs (see for example the activity of the crucial neural TFs NEUROD2, MEF2C and EMX1) with corresponding activation of a small but distinct set of TFs that drive cell cycle progression and proliferation (E2F1, E2F2, E2F7 and FOXM1). This finding indicates that the evolution of the G-CIMP-high into the G-CIMP-low subtype of glioma is driven by (i) loss of the activity of neural-specific TFs and (ii) gain of a proliferative capacity driven by activation of cell cycle/proliferation-specific MRs. Figure 8. View largeDownload slide Average MR activity in the seven glioma subtypes. Figure 8. View largeDownload slide Average MR activity in the seven glioma subtypes. Concerning the PA-like into LGm6-GBM, we note that, despite being sustained by an IDH-wildtype status, PA-like LGG cluster within the IDH-mutant subgroup of glioma, with higher activity of Neural cell fate/differentiation-specific MRs and inactive Mesenchymal-immune response MRs. Therefore, the evolution of PA-like LGG into LGm6-GBM is marked by gain of the hallmark aggressive MR activity of high grade glioma with corresponding loss of the MRs defining the neural cell of origin of these tumors. Taken together, the application of the RGBM approach to the recently reported Pan-Glioma dataset revealed the identity and corresponding biological activities of the MRs driving transformation of the G-CIMP-high into the G-CIMP-low subtype of glioma and PA-like into LGm6-GBM, thus, providing a clue to the yet undetermined nature of the transcriptional events driving the evolution among these novel glioma subtypes. RGBM identifies the master regulators of the mechanism of action of FGFR3-TACC3 fusion in glioblastoma FGFR3-TACC3 fusions are recurrent chromosomal rearrangement that generate in-frame oncogenic gene fusions first discovered in glioblastoma (GBM) (68) and subsequently found in many other tumors. Currently, FGFR3-TACC3 gene fusions are considered one the most recurrent chromosomal translocations across multiple types of human cancer (69). Recently, we used RGBM to identify PGC1α and ERRγ as the key MRs that are necessary for the activation of mitochondrial metabolism and oncogenesis of tumors harboring FGFR3-TACC3 (36). In this study, we have extensively validated the computational approach using a large set of experimental systems spanning from mouse and human cell cultures in vitro to tumor models of Drosophila, mice and humans in vivo (36). Here, we selected the set of 627 IDH-wildtype glioma from the expression dataset described above to build the RGBM network. To have a more comprehensive set of regulators, even without the availability of the PWMs, we used a predefined list of 2137 gene regulators/transcription factors (TRs) and an all-ones matrix as ABN, i.e. no prior mechanistic information. The final network contains 300 969 edges (median regulon size: 141) between the 2137 regulators and the 12 985 target genes. The key regulators of this oncogenic alteration were identified as those with the most significant differential activity between eleven TACC3-FGFR3 fusion-positive samples and 616 fusion-negative samples (Supplementary Figure S10). We then sought to identify and experimentally validate the gene targets of the PGC1α transcriptional co-activator inferred by RGBM in glioma harboring FGFR3-TACC3 gene fusions, which is a context of maximal activity for this MR. Under this scenario, RGBM identified a regulon of positively regulated targets of PGC1α comprising 243 genes. To validate the predictions made by RGBM for PGC1α target genes, we ectopically expressed PPARGC1A (the gene encoding for PGC1α) in immortalized human astrocytes and evaluated the changes of expression of the top 30 targets in the regulon predicted by RGBM by quantitative RT–PCR (qRT–PCR). We validated primers for efficient PCR from the cDNA of 22 of the top 30 targets and found that the expression of 17 of the 22 genes (77%) was up-regulated by PGC1α, thus confirming that they are bona-fide PGC1α target genes (Figure 9, Supplementary Table S5). This fraction of experimentally validated targets is notably high when compared to similar validation studies of gene network inference algorithms. For example in (70), the authors performed RNAi–mediated gene knockdown experiments in two colorectal cancer cell lines targeting eight key genes in the RAS pathway and evaluate the percentages of correctly identified targets from several gene network inference algorithms. They report an accuracy of 46% for the gene HRAS. Figure 9. View largeDownload slide qRT-PCR from HA–vector or HA–PPARGC1A. Data are fold changes relative to vector (dotted line) of one representative experiment (data are mean±standard deviation, n = 3 technical replicates). P-values were calculated using a two-tailed t-test with unequal variance. P-value: * < 0.05; ** < 0.01; *** < 0.001. Figure 9. View largeDownload slide qRT-PCR from HA–vector or HA–PPARGC1A. Data are fold changes relative to vector (dotted line) of one representative experiment (data are mean±standard deviation, n = 3 technical replicates). P-values were calculated using a two-tailed t-test with unequal variance. P-value: * < 0.05; ** < 0.01; *** < 0.001. DISCUSSION AND CONCLUSIONS In this paper, we proposed a novel GRN inference framework, whose core model for deducing transcriptional regulations for each target gene can either be boosting of regression stumps (GBM) or ensemble of decision trees (RF). We showcased that the proposed GBM based RGBM method provides efficient results with both the LS-Boost and the LAD-Boost loss functions. Similarly, the proposed RF based RGENIE method easily outperforms GENIE on several in-silico and two real (E. coli and S. cerevisiae) datasets. Our key contributions are: Sparsifying the GRN network inferred from tree-based ML techniques (GBM/RF) using a Tikonov regularization inspired optimal L-curve criterion on the edge-weight distribution obtained from the RVI scores of a target gene to determine the optimal set of TFs associated with it. Propose a simple heuristic based on the maximum variable importance score for all the genes to detect nodes with 0 in-degree or genes which are not regulated by other genes i.e. are upstream regulators. Incorporation of prior knowledge in the form of a mechanistic active binding network. Show that RGBM beats several state-of-the-art GRN inference methods like ARACNE, ENNET, GENIE w.r.t. evaluation metrics AUpr and AUroc by 10–15% for various DREAM challenge datasets. Show through synthetic RNA-Seq experiments that random-forest based methods are inferior to gradient boosting machines for inferring GRNs where very few TFs (hubs) are regulating a majority of the target genes. Identification of the main regulators of the different molecular subtypes of brain tumors i.e. master regulators driving transformation of the G-CIMP-high into G-CIMP-low and PA-like into LGm6-GBM subtypes of glioma. Identification and validation of the main regulators of the mechanism of action of FGFR3-TACC3 fusion in glioblastomas. AVAILABILITY RGBM is available for download on CRAN at https://cran.rproject.org/web/packages/RGBM. SUPPLEMENTARY DATA Supplementary Data are available at NAR online. ACKNOWLEDGEMENTS We would like to thank all reviewers for their valuable suggestions that helped to significantly improve this paper. FUNDING MiUR (Ministero dellUniversite della Ricerca) [FIRB2012-RBFR12QW4I]; Fondazione Biogem. Funding for open access charge: Qatar Computing Research Institute. Conflict of interest statement. None declared. REFERENCES 1. Plaisier C.L., OBrien S., Bernard B., Reynolds S., Simon Z., Toledo C.M., Ding Y., Reiss D.J., Paddison P.J., Baliga N.S. Causal mechanistic regulatory network for glioblastoma deciphered using systems genetics network analysis. Cell Syst. 2016; 3: 172– 186. Google Scholar CrossRef Search ADS PubMed 2. ENCODE Project Consortium The ENCODE (ENCyclopedia of DNA elements) project. Science . 2004; 306: 636– 640. CrossRef Search ADS PubMed 3. Han H., Shim H., Shin D., Shim J.E., Ko Y., Shin J., Kim H., Cho A., Kim E., Lee T.et al. TRRUST: a reference database of human transcriptional regulatory interactions. Sci. Rep. 2015; 5: 11432. Google Scholar CrossRef Search ADS PubMed 4. van Someren E.P., Wessels L.F.A., Backer E., Reinders M.J.T. Genetic network modeling. Pharmacogenomics . 2002; 3: 507– 525. Google Scholar CrossRef Search ADS PubMed 5. Karlebach G., Shamir R. Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol. 2008; 9: 770– 780. Google Scholar CrossRef Search ADS PubMed 6. Marbach D., Costello J.C., Küffner R., Vega N.M., Prill R.J., Camacho D.M., Allison K.R., Kellis M., Collins J.J., Stolovitzky G.et al. Wisdom of crowds for robust gene network inference. Nat. Methods . 2012; 9: 796– 804. Google Scholar CrossRef Search ADS PubMed 7. Gardner T.S., Faith J.J. Reverse-engineering transcription control networks. Phys. Life Rev. 2005; 2: 65– 88. Google Scholar CrossRef Search ADS PubMed 8. Friedman J., Hastie T., Tibshirani R.J. The Elements of Statistical Learning . 2001; 1: NY: Springer Series in Statistics. Google Scholar CrossRef Search ADS 9. Friedman N., Linial M., Nachman I., Pe’er D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 2000; 7: 601– 620. Google Scholar CrossRef Search ADS PubMed 10. Segal E., Wang H., Koller D. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics . 2003; 19: i264– i272. Google Scholar CrossRef Search ADS PubMed 11. Perrin B.E., Ralaivola L., Mazurie A., Bottani S., Mallet J., dAlche Buc F. Gene networks inference using dynamic Bayesian networks. Bioinformatics . 2003; 19: ii138– ii148. Google Scholar CrossRef Search ADS PubMed 12. Yu J., Smith V.A., Wang P.P., Hartemink A.J., Jarvis E.D. Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics . 2004; 20: 3594– 3603. Google Scholar CrossRef Search ADS PubMed 13. Qi J., Michoel T. Context-specific transcriptional regulatory network inference from global gene expression maps using double two-way t-tests. Bioinformatics . 2012; 28: 2325– 2332. Google Scholar CrossRef Search ADS PubMed 14. Prill R.J., Marbach D., Saez-Rodriguez J., Sorger P.K., Alexopoulos L.G., Xue X., Clarke N.D., Altan-Bonnet G., Stolovitzky G. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS One . 2010; 5: e9202. Google Scholar CrossRef Search ADS PubMed 15. Haury A.C., Mordelet F., Vera-Licona P., Vert J.P. TIGRESS: trustful inference of gene regulation using stability selection. BMC Syst. Biol. 2012; 6: 1. Google Scholar CrossRef Search ADS PubMed 16. Ceccarelli M., Cerulo L., Santone A. De novo reconstruction of gene regulatory networks from time series data, an approach based on formal methods. Methods . 2014; 69: 298– 305. Google Scholar CrossRef Search ADS PubMed 17. Markowetz F., Spang R. Inferring cellular networks–a review. BMC Bioinformatics . 2007; 8: 1. Google Scholar CrossRef Search ADS PubMed 18. Faith J.J., Hayete B., Thaden J.T., Mogno I., Wierzbowski J., Cottarel G., Kasif S., Collins J.J., Gardner T.S. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5: e8. Google Scholar CrossRef Search ADS PubMed 19. Margolin A.A., Nemenman I., Basso K., Wiggins C., Stolovitzky G., Favera R.D., Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics . 2006; 7: S7. Google Scholar CrossRef Search ADS PubMed 20. Zoppoli P., Morganella S., Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics . 2010; 11: 154. Google Scholar CrossRef Search ADS PubMed 21. Irrthum A., Wehenkel L., Geurts P.et al. Inferring regulatory networks from expression data using tree-based methods. PLoS One . 2010; 5: e12776. Google Scholar CrossRef Search ADS PubMed 22. Sławek J., Arodź T. ENNET: inferring large gene regulatory networks from expression data using gradient boosting. BMC Syst. Biol. 2013; 7: 1. Google Scholar CrossRef Search ADS PubMed 23. Ceccarelli M., Barthel F.P., Malta T.M., Sabedot T.S., Salama S.R., Murray B.A., Morozova O., Newton Y., Radenbaugh A., Pagnotta S.M.et al. Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell . 2016; 164: 550– 563. Google Scholar CrossRef Search ADS PubMed 24. Petralia F., Wang P., Yang J., Tu Z. Integrative random forest for gene regulatory network inference. Bioinformatics . 2015; 31: i197– i205. Google Scholar CrossRef Search ADS PubMed 25. Cover T.M., Thomas J.A. Elements of Information Theory . 2012; John Wiley & Sons. 26. Efron B., Tibshirani R.J. An Introduction to the Bootstrap . 1994; CRC press. 27. Aibar S., González-Blas C.B., Moerman T., Wouters J., Imrichová H., Atak Z.K., Hulselmans G., Dewaele M., Rambow F., Geurts P.et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods . 2017; 14: 1083– 1086. Google Scholar CrossRef Search ADS PubMed 28. Friedman J.H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 2001; 29: 1189– 1232. Google Scholar CrossRef Search ADS 29. Lim N., Şenbabaoğlu Y., Michailidis G., dAlché Buc F. OKVAR-Boost: a novel boosting algorithm to infer nonlinear dynamics and interactions in gene regulatory networks. Bioinformatics . 2013; 29: 1416– 1423. Google Scholar CrossRef Search ADS PubMed 30. Califano A., Alvarez M.J. The recurrent architecture of tumour initiation, progression and drug sensitivity. Nat. Rev. Cancer . 2016; 17: 116– 130. Google Scholar CrossRef Search ADS PubMed 31. Langfelder P., Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics . 2008; 9: 559. Google Scholar CrossRef Search ADS PubMed 32. Carro M.S., Lim W.K., Alvarez M.J., Bollo R.J., Zhao X., Snyder E.Y., Sulman E.P., Anne S.L., Doetsch F., Colman H.et al. The transcriptional network for mesenchymal transformation of brain tumours. Nature . 2010; 463: 318– 325. Google Scholar CrossRef Search ADS PubMed 33. Alvarez M.J., Shen Y., Giorgi F.M., Lachmann A., Ding B.B., Ye B.H., Califano A. Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nat. Genet. 2016; 48: 838– 847. Google Scholar CrossRef Search ADS PubMed 34. Hansen P.C., Jensen T.K., Rodriguez G. An adaptive pruning algorithm for the discrete L-curve criterion. J. Comput. Appl. Math. 2007; 198: 483– 492. Google Scholar CrossRef Search ADS 35. Calvetti D., Morigi S., Reichel L., Sgallari F. Tikhonov regularization and the L-curve for large discrete ill-posed problems. J. Computat. Appl. Math. 2000; 123: 423– 446. Google Scholar CrossRef Search ADS 36. Frattini V., Pagnotta S.M., Tala J.J., Fan M.V., Russo S.B., Garofano L., Lee L., Zhang J., Shi P., Lewis G.et al. A metabolic function associated with FGFR3-TACC3 gene fusions. Nature . 2018; 553: 222– 227. Google Scholar CrossRef Search ADS PubMed 37. Castellanos J.L., Gómez S., Guerra V. The triangle method for finding the corner of the L-curve. Appl. Numer. Math. 2002; 43: 359– 373. Google Scholar CrossRef Search ADS 38. Mathelier A., Fornes O., Arenillas D.J., Chen C., Denay G., Lee J., Shi W., Shyr C., Tan G., Worsley-Hunt R.et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2016; 44: D110– D115. Google Scholar CrossRef Search ADS PubMed 39. Jolma A., Yan J., Whitington T., Toivonen J., Nitta K.R., Rastas P., Morgunova E., Enge M., Taipale M., Wei G.et al. DNA-binding specificities of human transcription factors. Cell . 2013; 152: 327– 339. Google Scholar CrossRef Search ADS PubMed 40. Zhao Y., Stormo G.D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 2011; 29: 480– 483. Google Scholar CrossRef Search ADS PubMed 41. Kulakovskiy I.V., Vorontsov I.E., Yevshin I.S., Soboleva A.V., Kasianov A.S., Ashoor H., Ba-Alawi W., Bajic V.B., Medvedeva Y.A., Kolpakov F.A.et al. HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2016; 44: D116– D125. Google Scholar CrossRef Search ADS PubMed 42. Ernst J., Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods . 2012; 9: 215– 216. Google Scholar CrossRef Search ADS PubMed 43. Tibshirani R.J. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) . 1996; 58: 267– 288. 44. Meier L., Van De Geer S., Bühlmann P. The group lasso for logistic regression. J. R. Stat. Soc.: Ser. B (Statistical Methodology) . 2008; 70: 53– 71. Google Scholar CrossRef Search ADS 45. Tibshirani R.J., Saunders M., Rosset S., Zhu J., Knight K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) . 2005; 67: 91– 108. Google Scholar CrossRef Search ADS 46. Zou H., Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) . 2005; 67: 301– 320. Google Scholar CrossRef Search ADS 47. Omranian N., Eloundou-Mbebi J.M.O., Mueller-Roeber B., Nikoloski Z. Gene regulatory network inference using fused LASSO on multiple data sets. Scientific Rep. 2016; 6: 20533. Google Scholar CrossRef Search ADS 48. Rajapakse J.C., Mundra P.A. Stability of building gene regulatory networks with sparse autoregressive models. BMC Bioinformatics . 2011; 12: 1. Google Scholar CrossRef Search ADS PubMed 49. Liaw A., Wiener M. Classification and regression by randomforest. R News . 2002; 2: 18– 22. 50. Hansen P.C. The L-curve and its use in the Numerical Treatment of Inverse Problems . 1999; IMM, Department of Mathematical Modelling, Technical University of Denmark. 51. Hansen P.C., O’Leary D.P. The use of the L-curve in the regularization of discrete ill-posed problems. SIAM J. Sci. Comput. 1993; 14: 1487– 1503. Google Scholar CrossRef Search ADS 52. Hansen P.C. Regularization tools: A Matlab package for analysis and solution of discrete ill-posed problems. Numer. Algorith. 1994; 6: 1– 35. Google Scholar CrossRef Search ADS 53. Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. 1945; 1: 80– 83. Google Scholar CrossRef Search ADS 54. Sonoda Y., Ozawa T., Hirose Y., Aldape K.D., McMahon M., Berger M.S., Pieper R.O. Formation of intracranial tumors by genetically modified human astrocytes defines four pathways critical in the development of human anaplastic astrocytoma. Cancer Res. 2001; 61: 4956– 4960. Google Scholar PubMed 55. Prill R.J., Marbach D., Saez-Rodriguez J., Sorger P.K., Alexopoulos L.G., Xue X., Clarke N.D., Altan-Bonnet G., Stolovitzky G. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS One . 2010; 5: e9202. Google Scholar CrossRef Search ADS PubMed 56. Marbach D. y R.J., Schaffter T., Mattiussi C., Floreano D., Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl. Acad. Sci. U.S.A. 2010; 107: 6286– 6291. Google Scholar CrossRef Search ADS PubMed 57. Marbach D., Schaffter T., Mattiussi C., Floreano D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 2009; 16: 229– 239. Google Scholar CrossRef Search ADS PubMed 58. Schaffter T., Marbach D., Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics . 2011; 27: 2263– 2270. Google Scholar CrossRef Search ADS PubMed 59. Gama-Castro S., Salgado H., Peralta-Gil M., Santos-Zavaleta A., Muniz-Rascado L., Solano-Lira H., Jimenez-Jacinto V., Weiss V., Garcia-Sotelo J.S., Lopez-Fuentes A.et al. RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (gensor units). Nucleic Acids Res. 2011; 39: D98– D105. Google Scholar CrossRef Search ADS PubMed 60. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T.et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 2000; 25: 25– 29. Google Scholar CrossRef Search ADS PubMed 61. Albert R., Barabási A.L. Statistical mechanics of complex networks. Rev. Mod. Phys. 2002; 74: 47. Google Scholar CrossRef Search ADS 62. Johnson N.L., Kotz S., Balakrishnan N. Discrete Multivariate Distributions . 1997; 165: NY: Wiley. 63. Wen P.Y., Kesari S. Malignant gliomas in adults. N. Engl. J. Med. 2008; 359: 492– 507. Google Scholar CrossRef Search ADS PubMed 64. Lefebvre C., Rajbhandari P., Alvarez M.J., Bandaru P., Lim W.K., Sato M., Wang K., Sumazin P., Kustagi M., Bisikirska B.C.et al. A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers. Mol. Syst. Biol. 2010; 6: 377. Google Scholar CrossRef Search ADS PubMed 65. Noushmehr H., Weisenberger D.J., Diefes K., Phillips H.S., Pujara K., Berman B.P., Pan F., Pelloski C.E., Sulman E.P., Bhat K.P.et al. Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell . 2010; 17: 510– 522. Google Scholar CrossRef Search ADS PubMed 66. Mall R., Cerulo L., Bensmail H., Iavarone A., Ceccarelli M. Detection of statistically significant network changes in complex biological networks. BMC Syst. Biol. 2017; 11: 32. Google Scholar CrossRef Search ADS PubMed 67. Johnson W.E., Li C., Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics . 2007; 8: 118– 127. Google Scholar CrossRef Search ADS PubMed 68. Singh D., Chan J.M., Zoppoli P., Niola F., Sullivan R., Castano A., Liu E.M., Reichel J., Porrati P., Pellegatta S.et al. Transforming fusions of FGFR and TACC genes in human glioblastoma. Science . 2012; 337: 1231– 1235. Google Scholar CrossRef Search ADS PubMed 69. Lasorella A., Sanson M., Iavarone A. FGFR-TACC gene fusions in human glioma. Neuro-oncology . 2017; 19: 475– 483. Google Scholar PubMed 70. Olsen C., Fleming K., Prendergast N., Rubio R., Emmert-Streib F., Bontempi G., Haibe-Kains B., Quackenbush J. Inference and validation of predictive gene networks from biomedical literature and gene expression data. Genomics . 2014; 103: 329– 336. Google Scholar CrossRef Search ADS PubMed 71. Mall R., Langone R., Suykens Johan A.K. Kernel spectral clustering for big data networks. Entropy . 2013; 15: 1567– 1586. Google Scholar CrossRef Search ADS 72. Yip K.Y., Alexander R.P., Yan K.K., Gerstein M. Improved reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data. PLoS One . 2010; 5: e8121. Google Scholar CrossRef Search ADS PubMed 73. Pinna A., Soranzo N., De La Fuente A. From knockouts to networks: establishing direct cause-effect relationships through graph analysis. PLoS One . 2010; 5: e12912. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Highly selective retrieval of accurate DNA utilizing a pool of in situ-replicated DNA from multiple next-generation sequencing platformsLim, Hyeonseob;Cho, Namjin;Ahn, Jinwoo;Park, Sangun;Jang, Hoon;Kim, Hwangbeom;Han, Hyojun;Lee, Ji Hyun;Bang, Duhee
doi: 10.1093/nar/gky016pmid: 29361040
Abstract Scalable and cost-effective production of error-free DNA is critical to meet the increased demand for such DNA in the field of biological science. Methods based on ‘Dial-out PCR’ have enabled the high-throughput error-free DNA synthesis from a microarray-synthesized DNA pool by labeling with retrieval PCR tags, and retrieving error-free DNA of which the sequence is identified via next generation sequencing (NGS). However, most of the retrieved products contain byproducts due to background amplification of redundantly labeled DNAs. Here, we present a highly selective retrieval method of desired DNA from a pool of millions of DNA clones from NGS platforms. Our strategy is based on replicating entire sequence-verified DNA molecules from NGS plates to obtain population-controlled DNA pool. Using the NGS-replica pool, we could perform improved and selective retrieval of desired DNA from the replicated DNA pool compared to other dial-out PCR based methods. To evaluate the method, we tested this strategy by using 454, Illumina, and Ion Torrent platforms for producing NGS-replica pool. As a result, we observed a highly selective retrieval yield of over 95%. We anticipate that applications based on this method will enable the preparation of high-fidelity sequenced DNA from heterogeneous collections of DNA molecules. INTRODUCTION Scalable production of error-free DNA may provide a variety of genetic material in diverse fields of biological science. Currently, steps related to controlled-pore glass (CPG)-based oligonucleotide synthesis (1), molecular cloning (2), and selection by Sanger sequencing (3) are standard protocols for in vitro production of high-fidelity oligonucleotides (Supplementary Figure S1A). Due to the prohibitive cost of traditional oligonucleotide synthesis for high-throughput biological applications, oligonucleotides cleaved from microarrays have been used for large-scale DNA preparation as a low-cost alternative (4–7). However, a large proportion of microarray-derived oligonucleotides contains synthesizing errors depending on the size of oligonucleotides and the synthesis method (8). Selection of error-free oligonucleotides has been previously inefficient due to laborious cloning procedures and costly Sanger sequencing. While some error-reduction methods have been developed (9–12), these methods still involve labor- and cost-intensive efforts. Recently, Matzas et al. (13) made an advancement in accurate gene synthesis with respect to retrieval of sequence-verified DNA (Supplementary Figure S1B, upper). They prepared programmable oligonucleotides cleaved from a microarray, performed 454 sequencing (14), and utilized a robotic pick-in-place pipette to retrieve error-free sequence reads from the sequencing flow cell. In addition, Lee et al. reported ‘Sniper Cloning’ (15), which enables fast retrieval of targets on NGS platforms using laser-pulse technology (Supplementary Figure S1B, lower). However, these bead-based mega-clone strategies are currently limited to 454-based technologies and are not applicable in widely-utilized Illumina sequencing platforms (16). Alternatively, a PCR-based DNA retrieval method (17–19) termed ‘dial-out PCR’, which can specifically amplify desired DNAs from a pool of complex DNA libraries, (Supplementary Figure S2A) was also reported. In this method, designed oligonucleotides are synthesized on a programmable microarray, to which ∼20-bp degenerate barcoded nucleotides (‘dial-out tag’) (20) are attached as flanking sequences, and resulting oligonucleotides are subsequently read by NGS. The dial-out tag serves as both DNA identification tags and PCR priming loci to selectively retrieve the desired sequences. The dial-out PCR is cost-effective and does not require any specialized equipment, such as robotic pipettes or laser platforms, to retrieve clones. One of the dial-out PCR based methods recently showed a useful strategy utilizing combinatorial barcode tags (CBT) termed ‘static tag library’ (19) of which the dial-out PCR primers can be reused continuously at the next other attempts. Using this CBT approach, the great expense of preparing a myriad of barcoded primer pairs could be saved. However, the dial-out PCR-based retrieval of error-free DNA has a scalability limitation. Only a sub-population of prepared libraries is subjected to NGS, which results in a discrepancy between the pool used for dial-out PCR (‘pre-NGS pool’) and the pool sequenced on the NGS flow cell. Thus, it was not possible to identify the selected dial-out tag whether it was uniquely or redundantly labeled in pre-NGS pool (Supplementary Figure S2B). If the tags of retrieval target were misidentified as unique, non-targeted products would also be retrieved and left in the background. The previous report utilizing ‘static tag library’ (i.e. CBT library) (19) is also in agreement with this trend. Almost 22% of retrieved product was observed as byproducts when ca. 5 million pairs of dial-out tags were used for labeling only 250 target DNAs. We expect that dial-out PCR-based retrieval may become less specific when applied to an increasingly complex DNA mixture. To prohibit misidentification of a unique pair of dial-out tags, a new method is needed to control the number of molecules to synchronize the population of DNA pool for dial-out PCR and NGS data. Here, we present a method by replicating the library pool from a NGS plate or a flow cell whose entire clonal population is comprehensively sequenced. Using the method, we successfully reduced the pool's DNA population size to millions of clones (‘NGS-replica pool’) then NGS-replica pool was used for a dial-out PCR template instead of pre-NGS pool (Figure 1). This population size is significantly smaller than that of the pre-NGS pool, which leads to a lower probability of redundant labeling with CBT. To evaluate the method, 454-, Illumina- and Ion Proton-based sequencing platforms were tested to be replicated and used for retrieving microarray-derived DNA or sheared genomic DNA. Then, retrieval fidelity and selectivity were compared to general dial-out PCR method. Figure 1. View largeDownload slide Schematic of in situ replication of DNA molecules from next-generation sequencing (NGS) platforms and subsequent PCR-based retrieval of target sequences. (A) Process flow chart for PCR-based methods for the retrieval of error-free DNA targets from an NGS-replica pool. (B) Preparation strategy of 454 GS Junior sequencing-based retrieval. Combinatorial barcode-tagged (CBT) pools were processed from microarray-synthesized oligonucleotides and subsequently ligated to the sheared genomic DNA as flanking sequences. The library was replicated in a sealed NGS plate. (C) Preparation strategy of a pre-NGS pool (MiSeq and Ion Proton). The barcoded library (cgc50 pool) was directly synthesized on a microarray. (D) Schematic of library replication in a MiSeq flow cell. (E) Schematic of library replication using melt-off DNA in the Ion Proton system. This process could be automatically performed using an Ion OneTouch™ ES system. Figure 1. View largeDownload slide Schematic of in situ replication of DNA molecules from next-generation sequencing (NGS) platforms and subsequent PCR-based retrieval of target sequences. (A) Process flow chart for PCR-based methods for the retrieval of error-free DNA targets from an NGS-replica pool. (B) Preparation strategy of 454 GS Junior sequencing-based retrieval. Combinatorial barcode-tagged (CBT) pools were processed from microarray-synthesized oligonucleotides and subsequently ligated to the sheared genomic DNA as flanking sequences. The library was replicated in a sealed NGS plate. (C) Preparation strategy of a pre-NGS pool (MiSeq and Ion Proton). The barcoded library (cgc50 pool) was directly synthesized on a microarray. (D) Schematic of library replication in a MiSeq flow cell. (E) Schematic of library replication using melt-off DNA in the Ion Proton system. This process could be automatically performed using an Ion OneTouch™ ES system. Moreover, we extended our method to a tag-directed assembly method which utilizes full- or sub-assembled fragments as a retrieval target whose length are usually longer than the common length of sequencing read (21). In the previous method, retrieval step is proceeded after a round of full- or sub-assembly (up to almost 500 bp), barcode tagging, shotgun sequencing and de novo assembly of NGS data. By performing these steps, this method can use the longer building blocks and reduce the number of retrieving fragments. However, this method is limited by the absence of appropriate population-control method. To overcome the limitation, serial dilution step is only used to minimize the population so that huge NGS data are utilized to cover all constructs in the pool. We expected that our method would be useful for controlling the population without the dilution step. Thus, we tested whether the fully-assembled error-free KRAS (570 bp) and GFP (810 bp) genes could be selectively retrieved from the assembled construct. MATERIALS AND METHODS Simulation for demonstrating specificity of the CBT-based labeling method Prior to the experiment, we investigated specificity of CBT-based labeling method for pre-NGS pool (Supplementary Figure S3) and NGS-replica pool by counting the number of DNA molecules per CBT. For this simulation, we carried out the following procedures using Python scripts: (i) A virtual library (DNA pool) was composed of 100 million unique molecules (equivalent to ca. 0.03 ng NGS library containing ca. 300 bp fragments), and each DNA molecule was simplified as a unique integer. (ii) Each DNA molecule in the virtual library was randomly tagged with one of 2000 forward barcodes (f1, f2, f3, …, f2,000) and one of 2000 reverse barcodes (r1, r2, r3, …, r2000), which could generate 4 × 106 CBTs. The resulting sequences were considered as the ‘pre-NGS pool’. (iii) One hundred thousand barcode-tagged DNA molecules (throughput of a 454 GS Junior sequencer) were randomly picked from the pre-NGS pool and considered this library as ‘NGS-replica pool’. (iv) The number of DNA molecules per CBT was counted for all CBTs in the pre-NGS and NGS-replica pools. This value indicates the tagging specificity of the library; for example, a value of 1 denotes a unique CBT. CBT-labeled DNA library preparation for 454 sequencing-based experiment For the preparation of DNA substrates, human genomic DNA (NA 12878) was utilized as a model. We sheared 1 μg genomic DNA into 180-bp fragments using a M220 focused ultrasonicator™ (Covaris, Woburn, MA, USA), repaired both ends of the DNA fragments, dA tailed on the 3′ end of the DNA fragments, ligated NEBNext Adaptors (New England BioLabs, Ipswich, MA, USA) for flanking CBT library, and cleaved ideoxyU bases on the adaptor sequence. End-repair, dA-tailing, and ligation reactions were performed according to standard protocols of the SPARK™ DNA Sample Prep Kit (Enzymatics, Beverly, MA, USA). After the ligation, product was enriched by PCR using common primers. 14 μl ligated-product, 2.5 μl CBT_flk_fwd primer, 2.5 μl CBT_flk_rev primer, 6 μl dH2O, and 25 μl KAPA HiFi polymerase (KAPA Biosystems, Wilmington, MA, USA) were mixed, and the reaction was performed under the following conditions: 5 min at 95°C; 6 cycles of 30 s at 95°C, 30 s at 65°C, 30 s at 72°C; and 10 min at 72°C. The amplified-products were termed as the ‘sheared gDNA library’. For preparation of CBT sequences, 2133 forward and 2133 reverse barcode primer sequences were designed, and barcode libraries were synthesized using a programmable microarray (CustomArray, Inc., Bothell, WA, USA) (Figure 1B, Supplementary Table S1). We designed CBT sequences with little similarity and more specificity by following these design principles: (i) melting temperature (Tm) close to 60°C, (ii) designing barcodes with three or more base differences from one another at each nucleotide position and (iii) excluding barcodes with three or more repeated bases (e.g. ‘AAA’) to avoid homopolymer sequencing errors. For amplification of the forward barcoded oligonucleotide pool, 0.5 μl tagged oligonucleotide pool DNA, 2.5 μl 454_fwd primer, 2.5 μl flk_fwd primer, 25 μl KAPA HiFi polymerase, and 19.5 μl dH2O were mixed and placed in a thermal cycler. Polymerase chain reactions were performed under the following conditions: 5 min at 95°C; 20 cycles of 30 s at 95°C, 30 s at 60°C, and 30 s at 72°C; and 10 min at 72°C. Amplicons were electrophoresed on a 2% agarose gel and were purified with a MinElute Gel Extraction Kit (Qiagen, Valencia, CA, USA). The reverse barcode pool was amplified under the same conditions but with 454_rev and flk_rev primers instead of its forward counterparts. After amplification, the forward and reverse barcode pool DNA were flanked to both ends of the sheared human genomic DNA fragments by assembly PCR under the following conditions: 5 min at 95°C; 20 cycles of 30 s at 95°C, 30 s at 60°C, and 30 s at 72°C; and 10 min at 72 °C. The libraries were then subjected to 454 sequencing with a GS Junior sequencer (454 Life Sciences, Branford, CT, USA). NGS analysis of 454 sequencing data Raw FASTA format data were converted to FASTQ format. To align the sequence data to the human reference genome, barcode and adaptor sequences were trimmed, and the barcode information was saved in a new file using an in-house program. The trimmed sequences were aligned to hg19 (UCSC Genome Browser) using Novoalign (V2.07.18; http://www.novocraft.com), then the barcode information was re-attached to the aligned data. Improperly barcoded reads were removed, and duplicate reads with the same barcode, loci, and size information were counted. If multiple DNA substrates were tagged with the same barcode, that barcode was removed from the target retrieval list. Then, remaining sequences were considered candidates for retrieval. Replication of the 454 DNA library in a sealed plate The sequenced picotiter plate was removed from the 454 GS Junior sequencer before the final bleaching step, sealed using an in-house prepared gasket (Figure 1B), and filled with a PCR mixture of 60 μl 454_fwd primer, 60 μl 454_rev primer, 804 μl dH2O, 24 μl dNTPs, 12 μl Phusion DNA polymerase, and 240 μl 5X Phusion HF Buffer (New England BioLabs). The sequenced 454 DNA library plate was then replicated in an isothermal incubator via five cycles of 5 min at 95°C and 5 min at 70°C. The replicated pool (approximately 1 ml) was subsequently collected, purified with Agencourt AMPure XP beads (Beckman-Coulter, Indianapolis, IN, USA) and eluted with 50 μl dH2O. For enrichment of the NGS-replica pool, 10 additional cycles of PCR amplification were performed. Three microliters of the NGS-replica pool, 7 μl dH2O, 1 μl 454_fwd primer, 1 μl 454_rev primer and 10 μl KAPA HiFi polymerase were mixed per reaction (n = 8), and PCR was carried out under the following conditions: 5 min at 95°C; 10 cycles of 30 s at 95°C, 30 s at 60°C, and 30 s at 72°C; and 10 min at 72°C. Products were then purified with Agencourt AMPureXP beads. Random-barcode labeled DNA library preparation for testing platform compatibility Seventy-bp-long building block DNA flanked with restriction enzyme sites (EarI), 19-bp degenerate nucleotide-based barcode sequences (5′-NNNNANNNNTNNNNANNNN-3′ at both ends with 416$$\fallingdotseq$$ 4 × 109 complexity), and 20-bp sequences were designed for the synthesis of 50 cancer-associated genes as target sequences. (Figure 1C and Supplementary Table S2). The ‘cgc50 pool’ (3742 unique oligonucleotides) was designed, synthesized and cleaved from the microarray (CustomArray). One microliter of the cgc50 pool, 1 μl illu_flk_fwd primer, 1 μl illu_flk_rev primer, 7 μl dH2O and 10 μl KAPA HiFi polymerase were mixed, and the PCR amplification was performed as followed: 5 min at 95°C; 20 cycles of 30 s at 95°C, 30 s at 60°C and 30 s at 72°C; and 10 min at 72°C. Next, either Illumina adaptor or Proton adaptor was attached to the product. Illumina adaptor was ligated using standard protocols of the SPARK™ DNA Sample Prep Kit, and Proton adaptor was attached using PCR. 1 μl amplified product, 1 μl proton_fwd primer, 1 μl proton_rev primer, 7 ul dH2O and 10 μl KAPA HiFi polymerase were mixed and amplification was carried out as follows: 5 min at 95°C; 20 cycles of 30 s at 95°C, 30 s at 60°C and 30 s at 72°C; and 10 min at 72°C. Each library was sequenced using the Illumina MiSeq instrument (Illumina, San Diego, CA, USA) and Ion Proton instrument, respectively. NGS analysis and barcode verification of Illumina & Ion Proton sequencing data NGS data were analyzed by the following procedure. (i) Content sequence was obtained from sequences located between ‘CTCTTC’ and ‘GAAGAG’ sequence (i.e. EarI) in a raw FASTQ file. (ii) 19-bp left barcode located between ‘CTCTTC’ and left flanking sequence (i.e. ‘GACTCAGTGAGCGGAACGAT’), and 19-bp right barcode located between ‘GAAGAG’ and right flanking sequence (i.e. ‘ATCACCGACTGCCCATAGAG’) were obtained. (iii) Error-introduced contents were filtered (iv) Redundant pair of barcodes labeling more than two different contents were removed. Then, duplicates were counted. Replication of the Illumina DNA library in a flow cell PCR mixture was injected into the inlet, flow cell was sealed using sealing film (BioRad, Hercules, CA, USA) (Figure 1D), and replication reaction was carried out. The PCR mixture consisted of 5 μl illu_fwd primer, 5 μl illu_rev primer, 40 μl dH2O and 50 μl KAPA HiFi polymerase under the same conditions used for the picotiter plate-based replication. For enrichment of the NGS-replica pool, PCR was performed for 10 additional cycles. Each reaction (n = 10) consisted of 5 μl NGS-replica pool DNA, 3 μl dH2O, 1 μl illu_fwd primer, 1 μl illu_rev primer and 10 μl KAPA HiFi polymerase, which were mixed and amplified under the following conditions: 5 min at 95°C; 10 cycles of 30 s at 95°C, 30 s at 60°C, and 30 s at 72°C; and 10 min at 72°C. The products were then purified with Agencourt AMPureXP beads. Replication of Ion Proton DNA library using a melt-off library Instead of the Ion PI™ Chip, a melt-off waste of which the population is identical to the sequenced population was used as a template of NGS-replica pool (Figure 1E). Melt-off DNA was automatically collected from accessory instrument ‘Ion OneTouch™ ES’, and PCR purification using the MinElute PCR purification kit (Qiagen) was performed for neutralization. The product was amplified in reactions of 1 μl DNA, 1 μl proton_fwd primer, 1 μl proton_rev primer, 7 μl dH2O and 10 μl KAPA HiFi polymerase under the following conditions: 5 min at 95°C; 25 cycles of 30 s at 95°C, 30 s at 60°C and 30 s at 72°C; and 10 min at 72°C. The products were electrophoresed on a 2% agarose gel and purified with a MinElute Gel Extraction Kit (Qiagen). CBT-labeled DNA library preparation for Illumina sequencing-based experiment To use the CBT-based labeling method on the Illumina sequencer, another microarray oligonucleotides, consisting of cgc50 pool and CBT library, were redesigned and synthesized to be compatible with the Illumina platform (Supplementary Table S3). Then, cgc50 pool was labeled with CBT library using the same protocol of 454-based experiment with primers listed in Supplementary Tables S3 and S4. With Illumina MiSeq instrument, the product was sequenced and analyzed, and NGS-replica pool was obtained from flow-cell. Retrieval and validation of target DNA fragments for 454-, Illumina-, Ion Proton-based experiment Primers were prepared by solid-phase oligonucleotide synthesis (Macrogen, Seoul). For retrieval from the random barcode labeled library, while considering the low Tm of the 19-bp barcode sequence (Supplementary Figure S4 and Supplementary Tables S5–S8), 3-bp common sequences (e.g. ‘CTC’) were added to the retrieval primers. Then, each primer pair was mixed with 1 μl template (0.1–1 ng per retrieval), 1 μl forward tag primer, 1 μl reverse tag primer, 7 μl dH2O, and 10 μl KAPA HiFi polymerase. The retrieval reaction was performed under the following conditions: 5 min at 95°C; 30 cycles of 30 s at 95°C, 30 s at 60°C and 30 s at 72°C; and 10 min at 72°C. The products were electrophoresed on a 2% agarose gel and correct bands were size-selected and purified with a MinElute Gel Extraction Kit (Qiagen). If the band of the product was not clear in gel running data, five additional cycles were tested to sharpen the PCR band, knowing the non-target amplicon could also be sharpened. The retrieval products were validated by Sanger sequencing (Macrogen). Partial PCR products were purified using Ampure XP bead without size-selection, mixed, and subjected to NGS to evaluate target selectivity of dial-out retrieval. Then, primer and flanking sequences were trimmed, and data were aligned to the targeted sequence using Novoalign software. Aligned reads with mapping quality score of <20 were trimmed and the reads that did not result from the retrieval primer (not containing primer sequence) were filtered out. Ratio of reads, which did not contain a retrieved target, was calculated from cleaned reads. Constructing of synthetic gene libraries, and generating of NGS-replica pool KRAS and GFP were selected for fully-assembly target, and considered as sub-assembled product in Hiatt et al. (21). Each gene was designed into 60-bp-long building blocks, synthesized by solid-phase oligonucleotide synthesis (Macrogen, Seoul, all sequences related to this experiment were listed in Supplementary Table S9) and pooled respectively. 10 μl each oligo pool (0.1 μM per each oligo) and 10 μl KAPA HiFi polymerase were mixed, and assembly PCR was performed under the following conditions: 5 min at 95°C; 20 cycles of 30 s at 95°C, 30 s at 60°C and 30 s at 72°C; and 10 min at 72°C. The products were size-selected and amplified under 20 cycles of PCR. The assembled products were purified by Agencourt AMPure XP beads. Subsequently, 24 nt-degenerate barcodes (‘NNNNANNNNTNNNNANNNNTNNNN’) were flanked on both ends of each gene by PCR. 1 μl assembled product, 1 μl NNN_fwd for each gene, 1 μl NNN_rev for each gene, 7 μl dH2O, and 10 μl KAPA HiFi polymerase were mixed and amplified under following conditions: 5 min at 95°C; 8 cycles of 30 s at 95°C, 30 s at 60°C and 30 s at 72°C; and 10 min at 72°C. The products were purified by Agencourt AMPure XP beads, prepared to Illumina NGS library using the standard protocols of the SPARK™ DNA Sample Prep Kit, subjected to Illumina MiSeq instrument, and sequenced. Then, NGS-replica pool of each gene was obtained and barcode pair information was extracted and saved as an index file. Preparation of sheared sequencing library using NGS-replica pool for tag-directed assembly 200 ng of NGS-replica pool was sheared into 150–450 bp fragments using a M220 focused ultrasonicator™. The product was prepared as Illumina NGS library using standard protocols of the SPARK™ DNA Sample Prep Kit, and was sequenced using Illumina HiSeq platform. Sequence verification of synthetic gene libraries using tag-directed assembly From the NGS data, barcode and flanking sequences were trimmed, aligned to reference of each gene using Novoalign software and optionally downsampled using Picard (version 1.128). The barcodes were re-attached to the aligned data. To assemble the whole consensus sequence from the shot-gun data, the barcodes were mated based on index file. Contigs were made using breakpoint read (Figure 4A), and were merged into a consensus of which a sequence was decided by selecting major base on each position from the contigs. If indels existed in the contig, ‘I’ and ‘D’ symbols were used instead of base. Among these, error-free consensus sequences were selected and retrieved using dial-out PCR and validated by Sanger sequencing. To assess the accuracy, error-containing consensus were also tested at the same time. RESULTS General experimental scheme We formulated the above-described method based on the following procedures: (i) utilizing a pool of DNA molecules containing sheared human genomic DNA or oligonucleotides cleaved from a microarray, (ii) performing NGS of the DNA pool, (iii) generating the NGS-replica pool via in situ replication of sequenced DNA from the NGS platform and (iv) retrieving the desired DNA from the NGS-replica pool via PCR amplification (Figure 1). We applied this method to 454 GS Junior, Illumina MiSeq, and Ion Proton sequencing. Based on the sequencing capacities of each flow cell, CBT and degenerate barcode based labeling methods were used to cover the entire population of NGS-replica pool. Simulation for predicting success of 454 sequencing-based target DNA retrieval Before the experiment, we performed a simulation to test if CBT-based labeling is successful when approximately 2000 × 2000 barcodes pairs are used to tag substrates. Based on the Monte Carlo method, we randomly labeled a pre-NGS pool composed of 100 million unique molecules, or an NGS-replica pool containing ∼100 000 DNA clones from 454 Junior sequencing with 2000 × 2000 barcodes pairs (Supplementary Figure S3A). The unique barcode combination ensures that a single barcode pair combination labels one substrate exclusively. According to the simulation results, no unique barcode combinations existed in the pre-NGS pool (Supplementary Figure S3B). At least four substrates were redundantly labeled with one CBT, and, in most cases, 10–40 substrates shared one CBT. This trend could demonstrate why our previous dial-out PCR method using CBT primers to amplify specific DNA molecules from a pre-NGS pool was unsuccessful. In contrast, 98.9% of barcode combinations were identified as unique barcodes in the NGS-replica pool simulation. 454 sequencing-based target DNA retrieval Based on the simulation results, we prepared 2133 × 2133 CBT pairs (see Materials and Methods) for labeling the target DNA library. A human genomic DNA library was prepared as a target and labeled with CBT pairs. We chose sheared genomic DNAs as a target library model, because the model is a highly complex pool of DNA molecules that could help us evaluate the utility of our retrieval strategy. The library (pre-NGS pool) was subjected to sequence verification with a 454 GS Junior sequencer. After aligning the reads to CBT pair sequences and to the reference human genome, 23 308 reads were identified as candidates for retrieval using unique CBT pairs (Figure 2A). All the sequence-verified fragments were labeled uniformly with a low duplication bias (Figure 2B). In this case, we chose 48 retrieval targets from the retrieval candidates containing no more than 3-bp homopolymers to avoid false discovery by homopolymer sequencing errors (22) in 454 sequencing result. However, we note that homopolymer sequencing errors would not be considered when microarray-synthesized oligo is used for target. We will only select the target exactly matched with the desired sequence. Figure 2. View largeDownload slide 454 sequencing-based retrieval of target sequences. (A) Distribution of reads, (B) duplicate distribution, and (C) comparison of retrieval yields between pre-NGS and NGS-replica pools using the 454 GS Junior are shown (Red arrows denote non-specific products). (D, E) Plot for selectivity of each retrieved target from NGS-replica pool (D) and pre-NGS pool (E). Figure 2. View largeDownload slide 454 sequencing-based retrieval of target sequences. (A) Distribution of reads, (B) duplicate distribution, and (C) comparison of retrieval yields between pre-NGS and NGS-replica pools using the 454 GS Junior are shown (Red arrows denote non-specific products). (D, E) Plot for selectivity of each retrieved target from NGS-replica pool (D) and pre-NGS pool (E). To construct an NGS-replica pool from the 454 sequencing flow cell, the picotiter plate was assembled with an in-house gasket (Figure 1B) and filled with PCR mixture without leakage before the replication reaction was performed. We obtained the NGS-replica pool from the picotiter plate. Next, the retrieval process was carried out targeting 48 loci of the genome that was selected randomly. As a result, 48 targets from the NGS-replica pool were retrieved whereas none of the targets were selectively amplified from the pre-NGS pool based on agarose gel imaging (Figure 2C; Supplementary Table S5). All bands of retrieval products from the pre-NGS pool were shown smeary. Although some off-target bands were observed in products from NGS-replica pool, target bands were sharper than off-target bands except for the primer dimer. To assess the selectivity of retrieval reactions, products, except seven short or low-yielded targets, were mixed and validated using NGS. Although no difference was observed between 7 excluded targets and other targets in Sanger sequencing, we avoided NGS quality-drop by short NGS library and lower input concentration. We expected that the loss of the seven targets would not affect the general trends. According to the NGS data, 91% of contents were confirmed as desired targets retrieved from NGS-replica pool (Figure 2D), whereas only 0.15% of contents were observed as the targets from pre-NGS pool (Figure 2E). NGS results were also consistent with the Sanger sequencing and gel image results. Abundance, and other properties of the primer, did not affect the selectivity. The off-targets had proper flanking sequences on both sides. However, their contents aligned to another locus of the genome. We assumed that these off-targets were redundantly tagged contents or PCR errors (e.g. template switching). From this result, we demonstrated that the use of NGS-replica pool is helpful to reduce byproducts of dial-out PCR reaction. Illumina- and Ion Proton- sequencing based target DNA retrieval To apply our method based on Illumina MiSeq and Ion Proton platforms, different target, labeling strategy, and replicating method for obtaining NGS replica pool were used. Target library was designed to be comprised of 3742 oligonucleotides for synthesis of 50 cancer-associated genes reported in the catalogue of the Cancer Gene Census and termed as ‘cgc50 pool’. Twenty-bp degenerate sequences were used for labeling target library for almost unlimited dial-out PCR combinations, because 2133 × 2133 CBT pairs (4.5 million pair) could not cover the throughput of the Illumina MiSeq (15 million read) or Ion Proton platforms (60 million read). We estimated that preparation of a larger number of CBT pairs would be required for dealing with a NGS-replica pool from Illumina MiSeq sequencer. Based on the simulation result, 15 000 × 15 000 CBT pairs of CBTs would compose of 94.0% of barcode combinations as unique CBT pairs in NGS-replica pool containing 15 million DNA molecules. Then, 3742 labeled oligonucleotides were synthesized, cleaved from the microarray (Figure 1C), and amplified using PCR. Sequencing adaptors were attached to the library (pre-NGS pool) and Illumina MiSeq and Ion Proton sequencing were carried out. According to the results, low biased read count was observed and melting temperatures of primers and GC contents of target DNA were independent with read count (Figure 3 and Supplementary Figure S4). After analyzing the data, 48 error-free DNA fragments were chosen for each experiment. Figure 3. View largeDownload slide Illumina sequencing-based retrieval of target sequences. (A) Distribution of reads, and (B) duplicate distribution are shown. (C) Plot of the number of barcode pairs for each error-free fragment sorted in descending order. Nearly all designed error-free oligonucleotides were covered in the MiSeq run. Figure 3. View largeDownload slide Illumina sequencing-based retrieval of target sequences. (A) Distribution of reads, and (B) duplicate distribution are shown. (C) Plot of the number of barcode pairs for each error-free fragment sorted in descending order. Nearly all designed error-free oligonucleotides were covered in the MiSeq run. For in situ replication of the Illumina-sequenced pool, we injected the PCR mixture through the inlet of the MiSeq flow cell, sealed the inlet and outlet holes with an adhesive sealing film, and carried out replication of the entire DNA pool (Figure 1D). In case of Ion Proton based experiment, we collected the melted single-stranded DNA from the sequencing library, purified the DNA, and recovered it as a double-stranded DNA via PCR (Figure 1E). Then, NGS-replica pool of each platform was obtained, and the targets were retrieved from MiSeq-, Ion Proton-replica pool and pre-NGS pool of MiSeq. The pre-NGS pool of each platform was assumed to be almost the same, and the only difference was inclusion of a step for NGS adaptor attachment. Therefore,pre-NGS pool of the MiSeq was not examined. We observed 47 targets were retrieved from MiSeq-replica pool and all of the targets were retrieved from Ion Proton-replica pool (Supplementary Figure S5A, B and Supplementary Tables S6 and S7). However, in contrast to the retrieval using pre-454 NGS pool (retrieval yield of 0.15%), we noticed that ∼80% of targets (41 targets) were also retrieved from pre-NGS library for MiSeq platform. We assumed that the all pairs of twenty-bp degenerate tag, which could have 440 possible combinations (septillion scale), labeled the molecules uniquely. Comparison of target DNA retrieval efficiency among three sequencing methods Specifications and retrieval performance of the three tested platforms are shown in Table 1. Based on information provided by the manufacturers, the three platforms exhibit differences in capacity and possible read length. In terms of retrieval performance, retrieval yields were over 98% with all three sequencing approaches when the NGS- replica pool was used. Although the same substrate was used, the proportion of error-free fragments sequenced by MiSeq and Ion Proton platforms differed; 42.1% error-free reads were identified in the MiSeq reads, whereas only 6.64% of fragments were evaluated as error-free from Ion Proton sequencing. We presumably accounted for this discrepancy due to the characteristic weak point of Ion Proton platform that additional sequencing errors (e.g. homopolymeric insertion or deletion errors) (23) could be introducible during the process of sequencing repeated base. However, sufficient throughput and in-house programming for strict selection of error-free fragments could resolve this problem of sequencing error. In summary, we demonstrated that our method can be selectively applied to three major sequencing platforms. Specifications and retrieval performance of the three tested sequencing platforms Table 1. Specifications and retrieval performance of the three tested sequencing platforms Sequencing platform GS Junior MiSeq Ion Proton Retrieval barcode tagging method CBT Degenerate Degenerate Sequencing throughput (M reads) 0.1 15 60 Possible read length (bp) 400 300 × 2a 200 Barcoding capacity (M reads) 4.5 4300 4300 Barcode identified fragments from NGS analysis (%) 35 62 10 Error-free fragments (%) N/A 42 7 Error-free coverage (%) N/A 98 93 Retrieval yield (pre-NGS pool) (%) 0 85.4 - Retrieval yield (NGS-replica pool) (%) 98 98 100 Error-free validated proportion (%) 100 100 100 Sequencing platform GS Junior MiSeq Ion Proton Retrieval barcode tagging method CBT Degenerate Degenerate Sequencing throughput (M reads) 0.1 15 60 Possible read length (bp) 400 300 × 2a 200 Barcoding capacity (M reads) 4.5 4300 4300 Barcode identified fragments from NGS analysis (%) 35 62 10 Error-free fragments (%) N/A 42 7 Error-free coverage (%) N/A 98 93 Retrieval yield (pre-NGS pool) (%) 0 85.4 - Retrieval yield (NGS-replica pool) (%) 98 98 100 Error-free validated proportion (%) 100 100 100 aPaired end-sequencing strategy could be used in the Illumina MiSeq platform. Information regarding the throughput and possible read length is cited from the platforms’ manufacturers. Barcoding capacity represents the number of possible combinations of barcodes used in each experiment. The values of barcode identified fragments (%), error-free fragments (%) and error-free coverage (%) were determined from NGS data. The values of retrieval yield (pre-NGS pool) (%), retrieval yield (NGS-replica pool) (%) and error-free validated proportion (%) were determined from Sanger sequencing data. In the case of GS Junior sequencing, sheared genomic DNA was used as a substrate; therefore, error-free contents and error-free coverage were not evaluated. In the case of Ion Proton sequencing, the retrieval experiment from the pre-NGS pool was not performed, so its retrieval yield (pre-NGS pool) was not evaluated. Differences observed among retrieval yields (pre-NGS pool) and barcoding capacity of the sequencing platforms were due to differences in barcoding strategies. CBT, combinatorial barcode tag. View Large Table 1. Specifications and retrieval performance of the three tested sequencing platforms Sequencing platform GS Junior MiSeq Ion Proton Retrieval barcode tagging method CBT Degenerate Degenerate Sequencing throughput (M reads) 0.1 15 60 Possible read length (bp) 400 300 × 2a 200 Barcoding capacity (M reads) 4.5 4300 4300 Barcode identified fragments from NGS analysis (%) 35 62 10 Error-free fragments (%) N/A 42 7 Error-free coverage (%) N/A 98 93 Retrieval yield (pre-NGS pool) (%) 0 85.4 - Retrieval yield (NGS-replica pool) (%) 98 98 100 Error-free validated proportion (%) 100 100 100 Sequencing platform GS Junior MiSeq Ion Proton Retrieval barcode tagging method CBT Degenerate Degenerate Sequencing throughput (M reads) 0.1 15 60 Possible read length (bp) 400 300 × 2a 200 Barcoding capacity (M reads) 4.5 4300 4300 Barcode identified fragments from NGS analysis (%) 35 62 10 Error-free fragments (%) N/A 42 7 Error-free coverage (%) N/A 98 93 Retrieval yield (pre-NGS pool) (%) 0 85.4 - Retrieval yield (NGS-replica pool) (%) 98 98 100 Error-free validated proportion (%) 100 100 100 aPaired end-sequencing strategy could be used in the Illumina MiSeq platform. Information regarding the throughput and possible read length is cited from the platforms’ manufacturers. Barcoding capacity represents the number of possible combinations of barcodes used in each experiment. The values of barcode identified fragments (%), error-free fragments (%) and error-free coverage (%) were determined from NGS data. The values of retrieval yield (pre-NGS pool) (%), retrieval yield (NGS-replica pool) (%) and error-free validated proportion (%) were determined from Sanger sequencing data. In the case of GS Junior sequencing, sheared genomic DNA was used as a substrate; therefore, error-free contents and error-free coverage were not evaluated. In the case of Ion Proton sequencing, the retrieval experiment from the pre-NGS pool was not performed, so its retrieval yield (pre-NGS pool) was not evaluated. Differences observed among retrieval yields (pre-NGS pool) and barcoding capacity of the sequencing platforms were due to differences in barcoding strategies. CBT, combinatorial barcode tag. View Large Adjusting population-control by mixing control library Although sequencing platform is selectable, unique labeling is still difficult when barcoding capacity of CBT is as low as the previous case that 4.5 million pairs of CBT could not cover 15 million clusters in MiSeq flow-cells. We expected that capacity limit could be solved by adjusting the population of the MiSeq replica pool to be less than barcoding capacity by tuning the ratio of a control library such as PhiX or other differently indexed sequencing library. As the ratio of a control library increased, population ratio of the target library decreased as we intended. To test this approach, pre-NGS library of cgc50 library labeled with 4.5 million pairs of CBTs was prepared, mixed with the control library to account for 20% of the total sequencing throughput, and applied to MiSeq-sequencing based protocol. As a result, we obtained NGS replica pool with the population of 3 million clones (20% of 15 million). From the NGS replica pool, a total of 43 targets was retrieved, and verified by NGS. Selectivity was observed as 19.9% from NGS-replica pool (Supplementary Figure S7). Although the selectivity is lower than the 454-based experiment, retrieval selectivity from the NGS-replica pool was much higher than that from the pre-NGS pool, which was 0.07%. We presumed that this decreased selectivity is caused by some escaped redundant CBT during the data filtering procedure and misidentified as a unique CBT. This presumption can also explain the tendency of lowered selectivity of MiSeq compared to 454-sequencing. Additional CBT escapees could have happened in MiSeq because extra redundant CBTs exist in 3 million clones from MiSeq compared to 0.07 million clones from 454 based experiment. As shown in the simulation (Supplementary Figure S8), 66% of the combinations were identified as unique barcodes in MiSeq based simulation whereas 98.9% were unique in 454 based simulation. We expected that the use of additional CBT or adjusting the ratio to be less than 20% can improve the selectivity of retrieval. To investigate the selectivity at lower ratios, we first performed a simulation using various MiSeq throughputs (0.1%, 1%, 2.5%, 5%, 7.5%, 10%, 12.5%, 15%, 17.5% and 20%). We found there was a decrease in the unique CBT ratio as the sequencing population increased (Supplementary Figure S9A). Although a simulation result is not an exact match with a selectivity result, we did find a negative correlation between the sequencing population and the ratio of unique CBTs. Based on the simulation results, 5% of the sequencing population was examined. Sixteen targets were then retrieved and sequenced using NextSeq. We found 35% selectivity (on average) using the NGS-replica pool; 0.14% selectivity was observed using the pre-NGS pool (Supplementary Figure S9B and C). The selectivity was still lower than that of the 454-based experiment. However, we found that there was a 15% increase in the selectivity. Use of a greater number of CBTs would also improve the selectivity. Tag-directed assembly using NGS-replica library We expected that our method could be comprehensively utilized for other synthetic methods that require a population-size control. Tag-directed assembly (21) is one of the examples. Contrary to dial-out PCR, sub-assembled DNA product whose length is longer than the usual sequencing read is used as a building block of total assembly. To enable evaluation of the sub-assembled product, barcoding, shearing, NGS, and de novo assembly were performed in the tag-directed assembly method. During the process, serial dilution was proceeded for population control (21). Instead of using serial dilution, however, we adapted our method to control the DNA population, and NGS-replica pool was prepared to a shot-gun sequencing library (Figure 4A). Figure 4. View largeDownload slide Application to tag-directed assembly. (A) Schematic flow of population controlled tag-directed assembly method. (B) Distribution of coverage of each assembly (upper: KRAS and lower: GFP). (C) Distribution of relative depth according to their coordinate along the sequence. Figure 4. View largeDownload slide Application to tag-directed assembly. (A) Schematic flow of population controlled tag-directed assembly method. (B) Distribution of coverage of each assembly (upper: KRAS and lower: GFP). (C) Distribution of relative depth according to their coordinate along the sequence. To simplify the application, two constructs with coding sequences of KRAS and GFP genes (570 and 810 bp, respectively) were selected for targets. We assembled these genes by using multiple overlapping oligos. The products were subsequently labeled with a degenerate barcode containing 20 random bases, and were sequenced in MiSeq instrument to obtain NGS-replica pool. The barcode pair information obtained from paired-end sequencing data was used as indexes for shotgun data assembly for the secondary MiSeq. We randomly sheared NGS-replica pools, shot-gun sequenced in HiSeq, and in silico assembly was performed for each barcode pair. As a result, 69.8% of tag pairs identified at MiSeq were found in shotgun sequencing (65.7% for KRAS and 73.8% for GFP). Among them, 30.5% of KRAS pairs and 9.8% of GFP pairs were fully reconstructed using in silico assembly (Figure 4B and Supplementary Table S10). 17.3% of KRAS assemblies, and 2.46% of GFP assemblies were identified as error-free. We assumed that the difference between yields was accounted by additional needs of contigs to cover insufficient center region of longer construct. We observed that 8 and 20 contigs were minimally needed to assemble the full construct of KRAS and GFP, respectively, and the most of contigs were distributed on both side of the gene (Figure 4C). Among the consensus, 20 error-free and 15 error-containing targets for KRAS, and 1 error-free and 1 error-containing targets for GFP, whose retrieval tags have the appropriate melting temperature (55°C < Tm < 65 °C), were retrieved and validated by Sanger sequencing. Most retrieval products (34 of 37 products) were exactly matched with each analyzed sequence including indel and substitution error (Supplementary Table S11). However, unexpected heterozygous substitution errors were observed in two cases while large deletions were observed in two cases (in one case, both of errors were simultaneously observed, Supplementary Figure S10). We assumed that heterozygous substitution error might be introduced during a PCR amplification because the probability that two different molecules with the same molecular tag pair is very low (ca. 10−24). On the other hand, large deletion errors may be accounted by misalignment of breakpoint read. We expected that these unusual errors could be minimized by increasing the sequencing depth. DISCUSSION In this study, we developed a method for in situ replication of DNA from various NGS platforms that achieves highly efficient target DNA retrieval. We also overcame the major limitation of previous dial-out PCR methods: unavailability of cost-effective CBT-based labeling due to the discrepancy between DNA in the pre-NGS library and sequence information generated from the NGS. In our 454 GS Junior-based experiment, we introduced the CBT-based labeling method and showed the increase in selectivity of the dial-out PCR reaction when using the NGS-replica pool. This observation indicated that we could reduce the primer-preparing cost almost to a square root. Also, if we stored forward and reverse CBT primers in premade plates (usually 10 nmol per synthesis), primers could be reused for 1000 times (10 pmol per retrieval) meaning that CBT based labeling method would be less expensive by 1000-fold (∼0.002 USD per primer). However, in order to use CBTs more effectively, specificity should be enhanced for a large-scale retrieval. Although our result was encouraging, background amplification cannot be ignored at some retrieval reactions performed in Illumina-CBT based experiment. We expected that it can be analytically avoided by stringent filtering of CBT pairs. For example, rechecking unique primers using short-read aligner while considering hamming distances could be helpful for preventing omitted redundant CBT pairs. Also, we could improve the selectivity by designing additional barcodes as illustrated above (i.e. 15 000 × 15 000 CBT pairs for 94.0% of barcode combinations as unique CBT pairs in a NGS-replica pool containing 15 million DNA molecules). Despite this expandability, the cost of synthesizing 15 000 × 15 000 CBT primers remains high. However, we propose this obstacle could be overcome by introducing additional tags adjacent to the two flanking CBT sequences (Supplementary Figure S11). We showed our method is compatible to Illumina- and Ion Proton sequencing. Although degenerate barcodes were used as tags instead of CBT at first attempt due to the larger capacity of NGS platforms, we found another advantage of our method even in these cases. Increased efficiency of the retrieval yield from the MiSeq NGS-replica pool was observed compared to the yield from its corresponding pre-NGS pool. However, some drop-outs were observed, which means the targets were analyzed but not retrieved. We tried to find the reason of these drop-outs by investigating secondary structure or potential interactions between primers. However, we could not clearly explain the reason of the drop-outs. We also showed that CBT based labeling could be used in MiSeq based experiment by adjusting ratio of the library. Retrieval was performed from 3 million clones of which the number is equivalent to 30 times of GS Junior and we observed enhanced selectivity compared to pre-NGS pool. However, selectivity was relatively lower than 454 based experiment because mixing ratio was not optimized. Further optimization of experimental procedure should be studied to reduce the drop-outs and background amplifications. Although drop-outs and polymerase errors were rarely observed in this study, only a small proportion of the pool was examined. Investigation of the replica pool using repeated experiments and NGS will be helpful to understand the effects of drop-outs and background amplification. Estimation of appropriate CBT complexity, sequencing a population using simulation, or calculation of a general equation (see Supplementary Note 1) are also useful approaches. Exact prediction of selectivity is impossible because of the effects of systematic errors (e.g., PCR bias and template switching error). However, we found that the selectivity increased when we adjusted the population based on the simulation. Sequencing errors could also affect the uniqueness during analysis. However, we think that effect of sequencing error is negligible because we removed the error-containing target, regardless of sequencing or synthesis error. The error-containing fragment could be misidentified as error-free by sequencing error, but the probability is very low. Considering synthesis error rate (usually one error per 200 bases), sequencing error rate (0.1% for Illumina platform and 1% for Ion Torrent and 454 platforms), and probability of these errors occurring on the same position, we assumed that the probability would be <0.01%. Despite these improvements, the retrieval reaction is still a laborious process. As the target number is increased over the hundreds of thousands, the amount of replica pool will not be sufficient for retrieving all targets. Additional PCR can amplify the template amount, but additional errors will be introduced. Therefore, developing a high-throughput retrieval procedure is still desirable, and some target capture methods utilizing hybridization probes (24) or molecular inversion probes (25) could be the solution for the limitation. In contrast to PCR-based methods, target capture strategies are capable of accurate target enrichment from a heterogeneous pool via a simple reaction. Although these methods sometimes exhibit off-target capturing, they can be utilized when a pool of desired targets is required for retrieval by a high-throughput approach. Additionally, an instrument-based DNA retrieval method, such as Sniper Cloning, that enables retrieval of thousands of targets in a few hours could be a labor-saving solution by modifying the optical laser to operate on MiSeq or HiSeq platforms. However, it is not possible to efficiently adapt this approach to Illumina system yet. We also showed that our system could also be applied to a tag-directed assembly for controlling the population-size. As we mentioned, the ability to reduce the size of population was improved with our method compared with using the serial dilution. However, some cautions will be needed for designing a target. First, assembly or retrieval of a high-GC target such as CDKN2A gene of which the GC-content is 70% is difficult by a secondary structure (Supplementary Figure S12). Even if an alternative protocol using high-GC buffer could be helpful for the assembly PCR, more PCR errors would be introduced. Second, the length of assembled product is restricted by cluster formation limit in NGS flow cell (almost 1.5 kb is limit in Illumina device) for generating NGS-replica pool. Although the length of product is limited, among 51 710 CDSs of gene reported in RefSeq database, 19 281 CDSs were shorter than 1000 nt and 30 951 CDSs were shorter than 1500 nt. It demonstrates that almost 60% of gene could be synthesized using our method. We expect that this method could be used for retrieval of assembled gene for high-throughput synthesis of gene library. In summary, we have developed a method for reducing complex libraries and efficiently retrieving desired DNA sequences. Notably, we used the NGS flow cell and melt-off DNA, which is generally discarded after sequencing, as a source of the NGS-replica pool and introduced improvements in the target retrieval yield. Moreover, we demonstrated that CBT-based labeling is suitable for our method and provides a better cost-effective PCR-based alternative, as pre-designed primers can be used for more rapid retrieval of target sequences, even from a complex DNA pool. DATA AVAILABILITY Our sequencing data are available at the NCBI Sequence Read Archive (accession number SRP124419). SUPPLEMENTARY DATA Supplementary Data are available at NAR online. ACKNOWLEDGEMENTS We thank members of the Duhee Bang and Ji Hyun Lee laboratories for their critical comments. FUNDING Pioneer Research Center Program [NRF-2012-0009557]; Mid-career Researcher Program [2015R1A2A1A10055972]; Bio & Medical Technology Development Program [NRF-2016M3A9B6948494], Basic Science Research Program [NRF-2015R1A2A2A03006577] through National Research Foundation of Korea funded by the Ministry of Science, ICT & Future Planning. Funding for open access charge: Mid-career Researcher Program [2015R1A2A1A10055972] through National Research Foundation of Korea funded by the Ministry of Science, ICT & Future Planning. Conflict of interest statement. D.B., H.L., N.C., S.P., H.K. and H.H. are authors of a patent application for the method described in this paper (METHODS FOR RETRIEVING SEQUENCE-VERIFIED NUCLEIC ACID FRAGMENTS AND APPARATUSES FOR AMPLIFYING SEQUENCE VERIFIED NUCLEIC ACID FRAGMENTS (14/975873), Method of collecting nucleic acid fragments separated from the sequencing process (10-1648252) and Method of collecting sequence-verified nucleic acid fragments and the equipment for amplifying sequence-verified nucleic acid fragments (10-1576709). The remaining authors declare no competing financial interest. REFERENCES 1. Caruthers M.H. Gene synthesis machines: DNA chemistry and its uses. Science . 1985; 230: 281– 285. Google Scholar CrossRef Search ADS PubMed 2. Holton T.A., Graham M.W. A simple and efficient method for direct cloning of PCR products using ddT-tailed vectors. Nucleic Acids Res. 1991; 19: 1156. Google Scholar CrossRef Search ADS PubMed 3. Sanger F., Nicklen S., Coulson A.R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 1977; 74: 5463– 5467. Google Scholar CrossRef Search ADS PubMed 4. Borovkov A.Y., Loskutov A.V., Robida M.D., Day K.M., Cano J.A., Le Olson T., Patel H., Brown K., Hunter P.D., Sykes K.F. High-quality gene assembly directly from unpurified mixtures of microarray-synthesized oligonucleotides. Nucleic Acids Res. 2010; 38: e180. Google Scholar CrossRef Search ADS PubMed 5. Kim H., Jeong J., Bang D. Hierarchical gene synthesis using DNA microchip oligonucleotides. J. Biotechnol. 2011; 151: 319– 324. Google Scholar CrossRef Search ADS PubMed 6. Kosuri S., Eroshenko N., LeProust E.M., Super M., Way J., Li J.B., Church G.M. Scalable gene synthesis by selective amplification of DNA pools from high-fidelity microchips. Nat. Biotechnol. 2010; 28: 1295– 1299. Google Scholar CrossRef Search ADS PubMed 7. Quan J., Saaem I., Tang N., Ma S., Negre N., Gong H., White K.P., Tian J. Parallel on-chip gene synthesis and application to optimization of protein expression. Nat. Biotechnol. 2011; 29: 449– 452. Google Scholar CrossRef Search ADS PubMed 8. Baker M. Microarrays, megasynthesis. Nat. Methods . 2011; 8: 457– 460. Google Scholar CrossRef Search ADS 9. Carr P.A., Park J.S., Lee Y.J., Yu T., Zhang S., Jacobson J.M. Protein-mediated error correction for de novo DNA synthesis. Nucleic Acids Res. 2004; 32: e162. Google Scholar CrossRef Search ADS PubMed 10. Kim H., Han H., Shin D., Bang D. A fluorescence selection method for accurate large-gene synthesis. ChemBioChem . 2010; 11: 2448– 2452. Google Scholar CrossRef Search ADS PubMed 11. Linshiz G., Yehezkel T.B., Kaplan S., Gronau I., Ravid S., Adar R., Shapiro E. Recursive construction of perfect DNA molecules from imperfect oligonucleotides. Mol. Syst. Biol. 2008; 4: 191. Google Scholar CrossRef Search ADS PubMed 12. Saaem I., Ma S., Quan J., Tian J. Error correction of microchip synthesized genes using Surveyor nuclease. Nucleic Acids Res. 2012; 40: e23. Google Scholar CrossRef Search ADS PubMed 13. Matzas M., Stahler P.F., Kefer N., Siebelt N., Boisguerin V., Leonard J.T., Keller A., Stahler C.F., Haberle P., Gharizadeh B.et al. High-fidelity gene synthesis by retrieval of sequence-verified DNA identified using high-throughput pyrosequencing. Nat. Biotechnol. 2010; 28: 1291– 1294. Google Scholar CrossRef Search ADS PubMed 14. Margulies M., Egholm M., Altman W.E., Attiya S., Bader J.S., Bemben L.A., Berka J., Braverman M.S., Chen Y.J., Chen Z.et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature . 2005; 437: 376– 380. Google Scholar CrossRef Search ADS PubMed 15. Lee H., Kim H., Kim S., Ryu T., Kim H., Bang D., Kwon S. A high-throughput optomechanical retrieval method for sequence-verified clonal DNA from the NGS platform. Nat. Commun. 2015; 6: 6073. Google Scholar CrossRef Search ADS PubMed 16. Bentley D.R., Balasubramanian S., Swerdlow H.P., Smith G.P., Milton J., Brown C.G., Hall K.P., Evers D.J., Barnes C.L., Bignell H.R.et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature . 2008; 456: 53– 59. Google Scholar CrossRef Search ADS PubMed 17. Kim H., Han H., Ahn J., Lee J., Cho N., Jang H., Kim H., Kwon S., Bang D. ‘Shotgun DNA synthesis’ for the high-throughput construction of large DNA molecules. Nucleic Acids Res. 2012; 40: e140. Google Scholar CrossRef Search ADS PubMed 18. Schwartz J.J., Lee C., Shendure J. Accurate gene synthesis with tag-directed retrieval of sequence-verified DNA molecules. Nat. Methods . 2012; 9: 913– 915. Google Scholar CrossRef Search ADS PubMed 19. Klein J.C., Lajoie M.J., Schwartz J.J., Strauch E.M., Nelson J., Baker D., Shendure J. Multiplex pairwise assembly of array-derived DNA oligonucleotides. Nucleic Acids Res. 2016; 44: e43. Google Scholar CrossRef Search ADS PubMed 20. Shoemaker D.D., Lashkari D.A., Morris D., Mittmann M., Davis R.W. Quantitative phenotypic analysis of yeast deletion mutants using a highly parallel molecular bar-coding strategy. Nat. Genet. 1996; 14: 450– 456. Google Scholar CrossRef Search ADS PubMed 21. Hiatt J.B., Patwardhan R.P., Turner E.H., Lee C., Shendure J. Parallel, tag-directed assembly of locally derived short sequence reads. Nat Methods . 2010; 7: 119– 122. Google Scholar CrossRef Search ADS PubMed 22. Luo C., Tsementzi D., Kyrpides N., Read T., Konstantinidis K.T. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PLoS One . 2012; 7: e30087. Google Scholar CrossRef Search ADS PubMed 23. Salipante S.J., Kawashima T., Rosenthal C., Hoogestraat D.R., Cummings L.A., Sengupta D.J., Harkins T.T., Cookson B.T., Hoffman N.G. Performance comparison of Illumina and ion torrent next-generation sequencing platforms for 16S rRNA-based bacterial community profiling. Appl. Environ. Microbiol. 2014; 80: 7583– 7591. Google Scholar CrossRef Search ADS PubMed 24. Gnirke A., Melnikov A., Maguire J., Rogov P., LeProust E.M., Brockman W., Fennell T., Giannoukos G., Fisher S., Russ C.et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 2009; 27: 182– 189. Google Scholar CrossRef Search ADS PubMed 25. Yoon J.K., Ahn J., Kim H.S., Han S.M., Jang H., Lee M.G., Lee J.H., Bang D. microDuMIP: target-enrichment technique for microarray-based duplex molecular inversion probes. Nucleic. Acids Res. 2015; 43: e28. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Selective terminal methylation of a tRNA wobble baseMasuda, Isao;Takase, Ryuichi;Matsubara, Ryuma;Paulines, Mellie June;Gamper, Howard;Limbach, Patrick A;Hou, Ya-Ming
doi: 10.1093/nar/gky013pmid: 29361055
Abstract Active tRNAs are extensively post-transcriptionally modified, particularly at the wobble position 34 and the position 37 on the 3′-side of the anticodon. The 5-carboxy-methoxy modification of U34 (cmo5U34) is present in Gram-negative tRNAs for six amino acids (Ala, Ser, Pro, Thr, Leu and Val), four of which (Ala, Ser, Pro and Thr) have a terminal methyl group to form 5-methoxy-carbonyl-methoxy-uridine (mcmo5U34) for higher reading-frame accuracy. The molecular basis for the selective terminal methylation is not understood. Many cmo5U34-tRNAs are essential for growth and cannot be substituted for mutational analysis. We show here that, with a novel genetic approach, we have created and isolated mutants of Escherichia coli tRNAPro and tRNAVal for analysis of the selective terminal methylation. We show that substitution of G35 in the anticodon of tRNAPro inactivates the terminal methylation, whereas introduction of G35 to tRNAVal confers it, indicating that G35 is a major determinant for the selectivity. We also show that, in tRNAPro, the terminal methylation at U34 is dependent on the primary m1G methylation at position 37 but not vice versa, indicating a hierarchical ranking of modifications between positions 34 and 37. We suggest that this hierarchy provides a mechanism to ensure top performance of a tRNA inside of cells. INTRODUCTION Transfer RNAs (tRNAs) are fundamental for translation of the genetic code. Although these nucleic acids are transcribed with only four nucleotides (A, C, G, U), the diversity of the four building blocks is substantially expanded by post-transcriptional modifications, which enhance the structure and activity of the L-shaped molecules in all living cells (1). Collectively, more than 100 different chemical moieties have been introduced to modify tRNAs in all three domains of life (2). Each chemical moiety is created after transcription by an enzyme or a pathway of enzymes. Most of these modifying enzymes target position 34 of tRNA, the wobble position of the anticodon, or position 37, on the 3′-side of the anticodon, to produce chemical moieties that improve the quality of decoding. Although the cellular genomic space dedicated to genes for tRNA modification enzymes is large, few of these enzymes are understood at the mechanistic level. A critical barrier to progress is the lack of the ability to isolate tRNA molecules to study modifications, particularly when a tRNA is required for growth and cannot be readily changed with mutations for enzymatic analysis. Another critical barrier is the insufficient understanding of the inter-dependence between positions 34 and 37 when each harbors a distinct modification. Little is known whether the two modifications are independent of each other or whether one determines the activity of the other. Overcoming these critical barriers is an important step forward to address the biology of tRNA in living cells. We focus on the cmo5 modification of the U34 wobble base in Gram-negative tRNAs (3), which is associated with isoacceptors specific for six amino acids: Ala, Ser, Pro, Thr, Leu and Val (4,5). While the unmodified U34 can read all four nucleotides in bacteria (6–10), the cmo5 modification improves the quality of reading each (11–15) while the additional s2 modification restricts the reading specificity to A, G and less so U (16). The cmo5U modification is synthesized via multiple enzymatic reactions (Figure 1A). First, U34 is hydroxylated at the 5-position to form 5-hydroxyU (ho5U34) by an as-yet unknown enzyme (15). Second, ho5U34 is converted to cmo5U34 by the combined action of two S-adenosyl methionine (AdoMet)-dependent enzymes (17). Specifically, CmoA transfers the carbon dioxide of prephenate to the sulfonyl methyl of AdoMet to generate carboxy-S-AdoMet (Cx-AdoMet), and subsequently CmoB transfers the carboxy-methyl of Cx-AdoMet to the 5-hydroxy of ho5U34 to synthesize cmo5U34. Because prephenate is derived from chorismate, biosynthesis of cmo5U is linked through chorismate to biosynthesis of aromatic amino acids and vitamins (18,19). Interestingly, the carboxyl of cmo5U34 in four of the six families of tRNA isoacceptors (Ala, Ser, Pro, Thr) is further methylated to mcmo5U34 (Figure 1A) in a terminal methyl transfer reaction catalyzed by CmoM (5). The presence of the methyl ester in mcmo5U34 further ensures reading frame accuracy by suppressing +1-frameshifts (5). However, the selectivity for four out of six families of tRNAs by CmoM is not understood at the molecular level, thus limiting our insight into the cmo5U34 pathway. Figure 1. View largeDownload slide The maturation of mcmo5U34 in E. coli tRNA. (A) Chemical structures in the pathway of modifying U34 to ho5U34 to cmo5U34 and to mcmo5U34 catalyzed by enzymes. The symbol ‘?’ indicates unknown. (B, C) Sequence and cloverleaf structure of E. coli UGG tRNAPro and GGG tRNAPro encoded by the proM and proL genes, respectively. The wobble base in each is shown in red and the G34U mutation in the proL gene to generate a chimeric tRNA is shown in an arrow. (D) Growth analysis of E. coli JM109 cmoM-KO cells, showing that expression of the maintenance chimeric tRNAPro in proM-KO cells was sufficient to promote growth (green), albeit slower relative to cells containing chromosomal proM (black), while growth of cells maintained by the chimeric tRNAVal (blue) was only slightly slower relative to cells containing valUXYTZ genes on the chromosome (black). Growth of cells containing all tRNA genes (proM+, valUXYTZ+, black), or lacking valUXYTZ and maintained by G34U-valW (blue), or lacking proM and maintained by G34U-proL (green) are shown. Errors are standard deviation (SD), n = 3. Figure 1. View largeDownload slide The maturation of mcmo5U34 in E. coli tRNA. (A) Chemical structures in the pathway of modifying U34 to ho5U34 to cmo5U34 and to mcmo5U34 catalyzed by enzymes. The symbol ‘?’ indicates unknown. (B, C) Sequence and cloverleaf structure of E. coli UGG tRNAPro and GGG tRNAPro encoded by the proM and proL genes, respectively. The wobble base in each is shown in red and the G34U mutation in the proL gene to generate a chimeric tRNA is shown in an arrow. (D) Growth analysis of E. coli JM109 cmoM-KO cells, showing that expression of the maintenance chimeric tRNAPro in proM-KO cells was sufficient to promote growth (green), albeit slower relative to cells containing chromosomal proM (black), while growth of cells maintained by the chimeric tRNAVal (blue) was only slightly slower relative to cells containing valUXYTZ genes on the chromosome (black). Growth of cells containing all tRNA genes (proM+, valUXYTZ+, black), or lacking valUXYTZ and maintained by G34U-valW (blue), or lacking proM and maintained by G34U-proL (green) are shown. Errors are standard deviation (SD), n = 3. The four tRNA families with the terminal methyl group in mcmo5U34 share G35 in the middle position of the anticodon, whereas the two without the methyl group share A35 in common (Supplementary Figure S1). This raises the hypothesis that G35 is a major determinant for recognition by CmoM. In the crystal structure of the ribosomal 30S subunit in complex with an A-site-bound anticodon–stem–loop of tRNAVal, which lacks the terminal methylation, the carboxylate of the unmethylated cmo5U34 in tRNAVal interacts with the N6 of the neighboring base A35 to stabilize the wobble pairing with all four bases at the third position of the codon (14). If A35 is replaced with G35, the N6 is replaced with an O6, which would block the interaction. The addition of a methyl ester to the cmo5 moiety in the form of mcmo5U34 would neutralize the repulsion between the O6 and the negative charge of the carboxylate, thus providing a basis for the hypothesis that the terminal methylation is necessary for stabilizing the pairing with anticodons that contain G35. However, testing and validating the hypothesis has not been done and the difficulty is associated with at least two reasons. First, the substrate for CmoM, cmo5U34-tRNA, must be isolated from cells in a cmoM-knockout (cmoM-KO) background. An in vitro reconstitution of the cmo5U34 substrate from a tRNA transcript is not yet possible due to the lack of information on the enzyme that synthesizes ho5U34. Second, tRNA isoacceptors harboring cmo5U34 are usually essential for cell growth (e.g. those for Ser, Leu, Pro and Thr) and cannot be readily made with a site-specific substitution inside of cells without compromising cell survival. We show here the development of a genetic approach (Supplementary Figure S2) that enables isolation of mutants of Escherichia coli tRNAPro and tRNAVal with a site-specific substitution at position 35 of the anticodon. This approach is robust and amendable to generate sufficient quantities of tRNA for detailed biochemical and mechanistic studies. It is broadly applicable to isolate mutants of other tRNA species that are essential for growth. Using this approach, we show that substitution of G35 in the UGG isoacceptor of tRNAPro (UGG tRNAPro) abolishes the terminal methyl transfer by CmoM, while introduction of G35 to the UAC isoacceptor of tRNAVal (UAC tRNAVal) confers the activity, supporting the notion that G35 is a major determinant for tRNA recognition by CmoM. Additionally, using tRNAPro isolated from this approach, we explore the inter-dependence of the primary m1G methylation at position 37 present in this tRNA (20) and the terminal methylation at position 34. We show that the m1G37 methylation is important for the terminal methylation at position 34 but not vice versa, providing new insight into the ranking and ordering of the two modifications in the cellular production of a functional tRNA. MATERIALS AND METHODS tRNA plasmids All primers for creating tRNA plasmids are listed (Supplementary Table S1). The maintenance plasmid for expression of a chimeric tRNA was based on pACYC184. The gene for the G34U mutant of the isoacceptor GGG tRNAPro (G34U-proL, Supplementary Figure S3) with the lpp constitutive promoter and the rrnC transcription terminator sequence was isolated from a derivative of pGFIB (21,22) as a PvuII fragment and cloned into the EcoRV site of pACYC. The G34U mutation was created with Quikchange (Agilent) using primers GGGtoUGG-F and GGGtoUGG-R. The gene for the G34U mutant of the isoacceptor GAC tRNAVal (G34U-valW, Supplementary Figure S4) was created by hybridization of a pair of oligos GACtoUAC-F and GACtoUAC-R, inserted into the EcoRI and PstI sites of pGFIB, isolated as a PvuII fragment with the lpp promoter and rrnC terminator sequence, and cloned into the EcoRV site of pACYC184. The Bacillus subtilis tRNACys/GCA (BscysT) was amplified from pTFMA (23) using primers BscysT-F and BsCysT-R and cloned into pGFIB at EcoRI and PstI sites. The PvuII fragment encoding the gene, the lpp promoter, and the rrnC terminator, was sub-cloned to pACYC184 at the EcoRV site to generate a cysT-KO maintenance plasmid. The test plasmid for expression of a wild-type (WT) or a mutant tRNA was based on pKK223–3. The gene for UGG tRNAPro (encoded by proM) or UAC tRNAVal (encoded by identical genes valUXY and valTZ) was cloned into the EcoRI and PstI sites. Anticodon mutants were created by Quikchange with primers UGGtoUAG-F and UGGtoUAG-R for tRNAPro and primers UACtoUGC-F and UACtoUGC-R for tRNAVal. The amber suppressor mutant of E. coli cysT on pGFIB was made by Quikchange using primers GCAtoCUA-F and GCAtoCUA-R, which changed the anticodon from GCA to CUA. E. coli strains All of the E. coli strains in this study are listed (Supplementary Table S2). The cmoB-KO locus of E. coli cmoB-KO strain (from Dr. Steven Almo) was transferred to the published E. coli trmD-KO strain maintained by trm5 (24). After transfer, the kanamycin marker (kanR) of the cmoB-KO locus was removed by the flippase recombinase (FLP) encoded in pCP20 (25). To construct the vehicle for expression of a mutant version of UGG tRNAPro (Supplementary Figure S3), E. coli JM109 was used as the host and transduced with the P1 lysate of an E. coli cmoM-KO strain (Coli Genetic Stock Center, Yale University) and removed of the kanR of the cmoM-KO locus by FLP. The resulting JM109-cmoM-KO strain was introduced with the maintenance plasmid expressing the chimeric G34U-proL. The strain was then deleted of the UGG tRNAPro gene, using the λ Red recombinase of pKD46 (26) to catalyze homologous recombination at the proM locus with a proM-kanR PCR fragment amplified by primers proMKO-F and proMKO-R from pKD4. The kanR for selection of the resultant JM109-cmoM-KO-proM-KO strain was removed by FLP. The resultant strain was transformed with the test pKK223–3 plasmid with proM to generate the expression vehicle for a mutant version of proM. To construct the expression vehicle of a mutant form of UAC tRNAVal (Supplementary Figure S4), the E. coli JM109-cmoM-KO strain above was introduced with the maintenance plasmid carrying the chimeric G34U-valW. E. coli JM109-cmoM-KO carries five genes for tRNAVal (valUXY and valTZ, each encoding the same sequence, and was made into two derivatives, one deleted of the cluster of valUXY (using primers valUXYKO-F and valUXYKO-R), and the other deleted of the cluster of valTZ (using primers valTZKO-F and valTZKO-R). The kanR of the first resultant strain JM109-cmoM-KO-valUXY-KO was removed by FLP and the strain was transduced with the P1 lysate of the second resultant strain JM109-cmoM-KO-valTZ-KO to generate the JM109-cmoM-KO-valUXYTZ-KO strain. After removal of kanR by FLP, the resultant strain was transformed with the test pKK223–3 plasmid carrying the gene for UAC tRNAVal as a vehicle for expressing a mutant version of the tRNA. To construct the E. coli tRNACys/GCA(cysT)-KO (Supplementary Figure S5), the kanR (amplified by PCR from pKD4) was used to target the cysT chromosomal locus in JM109 using primers cysTKO-F and cysTKO-R. After the resultant strain was introduced with a maintenance plasmid expressing BscysT, the cysT was removed by λ Red recombinase and the kanR removed by FLP. This cysT-KO locus, together with the maintenance plasmid for expressing BscysT and a pGFIB-derived test plasmid for expressing the amber-reading form of cysT, was transduced into an E. coli XAC-1 strain with an internal amber codon in the chromosomal lacZ (21). Growth assay Strain JM109-cmoM-KO-proM-KO maintained by chimeric G34U-proL, strain JM109-cmoM-KO-valUXYTZ-KO maintained by chimeric G34U-valW, and strain JM109-cmoM-KO-cysT-KO maintained by BscysT, were cultured in LB medium overnight at 37°C. Cells were then inoculated into 25 ml LB at 1:100 at 37°C and the OD600 was monitored over time using Tecan Infinite M200 Pro plate reader (Tecan). The growth assay for strain JM109-cmoM-KO-proM-KO maintained by chimeric G34U-proL, harboring either pKK223–3-G35A-proM or an empty pKK223–3, was performed similarly except that the overnight culture was inoculated to 30 ml fresh LB to OD = 0.04 and 0.4 mM IPTG was added at T = 2 h. Overnight cultures of the two strains were also serially diluted and spotted on LB plates with Amp and Chl and 0.4 mM IPTG and incubated at 37°C. X-gal assay The E. coli strain XAC-1 cysT-KO maintained by pACYC184-BscysT was transformed with the pGFIB-cysT plasmid either with GCA or CUA anticodon, and streaked and grown on an M9 plate containing 20 μg/ml 5-bromo-4-chloro-3-indolyl-β-d-galactopyranoside (X-gal) at 37°C overnight. The amber suppression activity was monitored as the color development by the β-galactosidase-catalyzed hydrolysis of X-gal. Recombinant proteins The E. coli cmoM gene was amplified from E. coli MG1655 genomic DNA using primers cmoM-F and cmoM-R and cloned into pET28 (Novagen) at NdeI and NotI sites. This pET28-cmoM was transformed into E. coli BL21(DE3) and over-expressed for 4 h at 37°C in LB containing 0.1 mM IPTG. The dual 6xHis-tagged recombinant CmoM was purified with HisLink Protein Purification Resin (Promega) as described previously (5). The recombinant E. coli TrmD was prepared as described (27–29). The recombinant E. coli prolyl-tRNA synthetase (ProRS) and leucyl-tRNA synthetase (LeuRS) were purified as described (30). Substrate tRNAs All substrate tRNAs for this study were expressed from the tac-controlled pKK223–3 plasmid and isolated from E. coli cells. Each tRNA was pulled down from total RNA by an affinity resin coupled with an oligonucleotide complementary to a sequence of the target tRNA (24,31,32). For tRNAVal, due to its partial sequence homology with UGC tRNAAla, two successive affinity purification steps were performed. The first used the affinity oligo Val-AP1 and the second used the affinity oligo Val-AP2 (Supplementary Table S1). The quality of UGG and UAG tRNAPro in cmo5U34- or mcmo5U34-modified state is shown as a single band in denaturing gel analysis (Supplementary Figure S6A), indicating homogeneity. WT tRNAPro and tRNAVal were each purified from the over-expression plasmid pKK223–3 driven by the tac promoter in cmoM-KO cells. To isolate UGG tRNAPro lacking m1G37 but containing a distinct state of the wobble modification, we used E. coli strain MG1655 trmD-KO as the basis (24). The trmD-KO strain was made into cmoB-KO or cmoM-KO by P1 transduction and the kanR was removed. Cells were then transformed with pKK223–3-proM and grown to saturation in an overnight culture with 0.2% (w/v) arabinose (Ara) to allow expression of trm5. Cells were then freshly diluted into Ara-free LB at 1:100 and grown for 3 h, followed by a second dilution into fresh Ara-free LB at 1:5 and growth for 2 h. After these two dilutions, cells were harvested and primer extension analysis showed that the cellular level of m1G37-tRNA was below 5%. Kinetic analysis Methylation or aminoacylation was assayed at 37°C by monitoring the incorporation of [3H-methyl] of AdoMet or [3H]-amino acid to tRNA as acid precipitable counts (33–37). Aliquots were removed over time and acid-precipitable counts measured. Data for steady-state assays of Vo as a function of tRNA concentration were fit to the hyperbolic Michaelis–Menten equation [y = m1 × x/(m2 + x)] (38), where m1 and m2 were kcat and Km, respectively. Data for single-turnover assays of a time course were fit to a single exponential equation [y = m1 × (1 – exp(–m2 × x))] (38,39), where m1 and m2 were the plateau level and the kobs, respectively. Data of kobs as a function of enzyme concentration were fit to the hyperbolic Michaelis–Menten equation. For CmoM methylation, steady-state assays of plateau levels were performed with 1.0 μM enzyme, 1.0 μM tRNA, and 20 μM [3H-methyl]-AdoMet in a buffer of 50 mM HEPES–KOH (pH 7.5), 100 mM KCl, 10 mM MgCl2, and 7 mM β-mercaptoethanol. Substrate tRNAs were isolated from a vehicle strain expressing E. coli WT or a mutant version of UGG tRNAPro or UAC tRNAVal. Single-turnover assays were performed with 1–16 μM enzyme, 1.0 μM tRNA, and 20 μM [3H-methyl]-AdoMet in the same buffer. The substrate UGG tRNAPro with m1G37 was isolated from a vehicle strain expressing the WT tRNA, whereas the substrate UGG tRNAPro without m1G37 was isolated from the same vehicle strain with trmD-KO maintained by the Ara-controlled trm5 on pACYC, which also expressed the lpp-controlled chimeric G34U mutant of GGG tRNAPro. For TrmD methylation, steady-state assays were performed with 10 nM enzyme, 0.5–12.5 μM tRNA, and 28.9 μM [3H-methyl]-AdoMet in a buffer of 100 mM Tris–HCl, pH 8.0, 24 mM NH4Cl2, 4 mM DTT, 0.024 mg/ml BSA, and 6 mM MgCl2. The three states of UGG tRNAPro were isolated from an E. coli trmD-KO strain (24) that expressed the tRNA from the pKK223–3 plasmid. The ho5U34-state was isolated from the strain with additional cmoB-KO (with the kanR removed), the cmo5U34-state was isolated from the strain with additional cmoM-KO, and the mcmo5U34-state was isolated from the trmD-KO strain. For aminoacylation with Pro, the UAG variant of tRNAPro/UGG was isolated from E. coli Pro vehicle strain (Supplementary Table S2). Plateau charging assays were conducted using 1.0 μM ProRS, 1.0 μM tRNAPro/UAG, 20 μM [3H]-Pro in a buffer of 20 mM KCl, 50 mM HEPES pH 7.5, 4 mM DTT, 0.2 mg/ml BSA, 10 mM MgCl2, and 2 mM ATP. For aminoacylation of the tRNA with Leu, LeuRS and [3H]-Leu were used instead. LC–MS/MS analyses WT and mutant tRNAPro species (3 μg), purified from cells, were digested with RNase T1 (50 U/μg tRNA) in a 220 mM ammonium acetate buffer at 37°C for 2 h. Upon completion, the samples were lyophilized and resuspended in 10 μl of the mobile phase A buffer (8 mM TEA/220 mM HFIP, pH 7.0). For liquid chromatography and mass spectrometry (LC–MS/MS) analysis, tRNA fragments from T1 digestion were separated in a Poroshell 120-EC-C18 column (1 × 50 mm, 2.7 μm particle size) using mobile phase A and mobile phase B (50% buffer A in methanol). The column oven was set at 60°C with the Thermo Surveyor HPLC attached to the Thermo LTQ-XL linear ion trap mass spectrometer. The LC gradient starts with 5–95% mobile phase B in 45 min at a flow rate of 60 μl/min. Post column equilibration is set for 10 min prior to the next injection. The MS operating parameters are: capillary temperature of 275°C, spray voltage of 3 kV and sheath, and auxiliary and sweep gases at 35, 14, 10 arbitrary units respectively. Data were recorded in negative mode, with a scan range from 600 to 2000 m/z. The product ion scans were obtained using collision induced dissociation at the normalized collision energy of 35%. All of the predicted tRNA sequences were obtained from the genomic tRNA database (http://gtrnadb.ucsc.edu/). The prediction of m/z and fragmentation of each was calculated by the Mongo Oligo Mass Calculator (http://mods.rna.albany.edu/masspec/Mongo-Oligo). RESULTS AND DISCUSSION A genetic approach We chose E. coli UGG tRNAPro, encoded by the proM gene, as an example. This isoacceptor is a substrate for CmoM to convert cmo5U34 to mcmo5U34 (Figure 1B) (5). The other Pro isoacceptors in E. coli contain the GGG and CGG anticodon, respectively. Of the three, the UGG isoacceptor is essential for growth (15), the elimination of which would leave cells unable to efficiently translate the Pro codon CCA, ultimately leading to cell death. This growth-essentiality makes it impossible to introduce mutations to G35 to test recognition by CmoM. To overcome this limitation, we developed a genetic approach (Supplementary Figure S2), in which we created a plasmid-borne chimeric tRNAPro to maintain cell growth, enabling us to remove the native proM gene from the chromosome while expressing a test version of the proM gene from a second plasmid for isolation of its tRNA product for enzymatic analysis. This approach consists of three key components. First, a chimeric tRNAPro was expressed from the strong and constitutive lpp promoter in a pACYC-derived maintenance plasmid (21). The chimera was based on the sequence of the GGG isoacceptor (encoded by the proL gene, Figure 1C) but was made with the G34U mutation to carry the UGG anticodon for reading the Pro codon CCA. The remaining Pro codons would be read by the other two isoacceptors whose genes were intact on the chromosome. Second, the native proM was removed from its chromosomal locus in an operon containing tRNA genes argX, hisR, and leuT (Supplementary Figure S3). This was achieved in an E. coli cmoM-KO strain by λ Red recombination (26). The cmoM-KO strain was necessary as the host to produce tRNA species containing the cmo5U34 wobble base but lacking the terminal methylation for evaluation as a substrate for CmoM. Third, the test version of proM was then expressed from pKK223–3 that was compatible with the maintenance pACYC. Expression of the test proM was driven from the IPTG-inducible tac promoter to produce high levels of the tRNA product for purification and kinetic analysis. In this 3-component system, expression of the maintenance chimeric tRNAPro in proM-KO cells was sufficient to promote growth (Figure 1D), although the growth was slower relative to cells with proM intact on the chromosome. This establishes an E. coli vehicle strain that could be used to produce a mutant version of the proM tRNA from the test plasmid without losing cell viability. Using a similar genetic approach, we created an E. coli vehicle strain that was used to produce a mutant version of the UAC tRNAVal (Supplementary Figure S4). The WT isoacceptor is encoded in five genes of the same sequence on the chromosome (valU, valX, valY, valT and valZ), each of which produces a tRNA transcript that is modified to cmo5U34 without the CmoM-catalyzed terminal methylation (5). In this vehicle strain, all of the isoacceptor genes were removed from the chromosome, while cell viability was maintained by expression of a chimeric tRNAVal based on the sequence of the valW-encoded GAC isoacceptor but made to contain the UAC anticodon. Growth of cells maintained by the chimeric tRNAVal was only slightly slower relative to cells containing valUXYTZ genes (Figure 1D). Expression from the test plasmid that harbored a mutant form of one of the valUXYTZ genes then provided a mechanism to isolate the mutant tRNA for enzymatic analysis. We showed that this genetic approach was applicable to other single-gene tRNA species. For example, cysT is a single gene for GCA tRNACys in E. coli. Using the same genetic approach, we created a cysT-KO vehicle strain that was maintained viable by the homologous gene of BscysT in a maintenance plasmid and able to express the amber-reading version of cysT from a test plasmid (Supplementary Figure S5). The viability was consistent with our finding that BscysT tRNA is readily charged by the endogenous E. coli cysteinyl-tRNA synthetase (CysRS) (23). The amber-suppressor mutant of E. coli cysT was used to test its adaptor activity during live-cell protein synthesis. Indeed, the cysT-KO vehicle system, when introduced to strain XAC-1 harboring lacZ with an amber mutation, conferred a blue color on X-gal indicator plates (Supplementary Figure S5), indicating suppression of the amber mutation. These data demonstrate the adaptor activity of the amber-reading test tRNACys during protein synthesis in an E. coli strain where the essential single-gene cysT has been eliminated, validating the broad utility of our genetic approach. G35 as the major determinant for recognition by CmoM With the genetic approach in hand, we tested the importance of G35 as a determinant for tRNA recognition by CmoM. We created an E. coli vehicle strain that produced the G35A mutant of UGG tRNAPro and a separate strain that produced the reciprocal A35G mutant of UAC tRNAVal (Figure 2A and B). Each mutant tRNA, as well as its WT counterpart (over-expressed from the pKK223–3 plasmid), was expressed in the cmoM-KO strain to abort the terminal methylation and to accumulate tRNA in the cmo5U34 precursor state. Each was produced to high levels and purified to homogeneity (Supplementary Figure S6A) from binding to a complementary oligonucleotide on a solid support. As shown by LC-MS/MS, each purified tRNA contained cmo5U34 in the precursor state, as exemplified by the UAG variant of tRNAPro (Figure 2C, Supplementary Figure S6B), and was used as a substrate for CmoM. Figure 2. View largeDownload slide G35 is the major determinant of tRNA recognition by CmoM. (A, B) Sequence and cloverleaf structure of E. coli UGG tRNAPro and UAC tRNAVal. The substitution of position 35 in the anticodon of each is shown in an arrow. (C) The purified mutant UAG tRNAPro contains the precursor state cmo5U34 as shown in LC-MS/MS. (D) Plateau methylation by CmoM is high for the WT UGG but low for the variant UAG of tRNAPro. (E) Plateau methylation by CmoM is low for the WT UAC but high for the variant UGC of tRNAVal. Data for WT tRNAs are in dark blue, while those for mutant tRNAs are in red. Errors are SD, n = 3. (F) Titration of the single-turnover rate constant kobs of methyl transfer to E. coli UGG tRNAPro as a function of CmoM concentration showed saturation kinetics. (G) Titration of the single-turnover rate constant kobs of methyl transfer to E. coli UGC variant of tRNAVal as a function of CmoM concentration showed no saturation. (H) Kinetic parameters of kchem, Kd (tRNA), and kchem/Kd derived from fitting the data of UGG tRNAPro in (F) to a hyperbolic equation. The kchem/Kd value for the UGC variant of tRNAVal was determined from fitting the data of kobs at the enzyme concentration of 1 μM in (G) to a linear equation. Errors are SD, n = 3. N.D. = not determined. Assays were performed in single turnover conditions at 37°C, where the enzyme was 1–16 μM, tRNA was 1.0 μM, and [3H-methyl]-AdoMet was 20 μM in a buffer containing 50 mM HEPES–KOH, pH 7.5, 100 mM KCl, 10 mM MgCl2 and 7 mM β-mercaptoethanol. Figure 2. View largeDownload slide G35 is the major determinant of tRNA recognition by CmoM. (A, B) Sequence and cloverleaf structure of E. coli UGG tRNAPro and UAC tRNAVal. The substitution of position 35 in the anticodon of each is shown in an arrow. (C) The purified mutant UAG tRNAPro contains the precursor state cmo5U34 as shown in LC-MS/MS. (D) Plateau methylation by CmoM is high for the WT UGG but low for the variant UAG of tRNAPro. (E) Plateau methylation by CmoM is low for the WT UAC but high for the variant UGC of tRNAVal. Data for WT tRNAs are in dark blue, while those for mutant tRNAs are in red. Errors are SD, n = 3. (F) Titration of the single-turnover rate constant kobs of methyl transfer to E. coli UGG tRNAPro as a function of CmoM concentration showed saturation kinetics. (G) Titration of the single-turnover rate constant kobs of methyl transfer to E. coli UGC variant of tRNAVal as a function of CmoM concentration showed no saturation. (H) Kinetic parameters of kchem, Kd (tRNA), and kchem/Kd derived from fitting the data of UGG tRNAPro in (F) to a hyperbolic equation. The kchem/Kd value for the UGC variant of tRNAVal was determined from fitting the data of kobs at the enzyme concentration of 1 μM in (G) to a linear equation. Errors are SD, n = 3. N.D. = not determined. Assays were performed in single turnover conditions at 37°C, where the enzyme was 1–16 μM, tRNA was 1.0 μM, and [3H-methyl]-AdoMet was 20 μM in a buffer containing 50 mM HEPES–KOH, pH 7.5, 100 mM KCl, 10 mM MgCl2 and 7 mM β-mercaptoethanol. Recombinant E. coli CmoM was over-expressed with a His-tag at both the N- and C-termini and purified to greater than 95% homogeneity. Using [3H-methyl]-AdoMet as the methyl donor, we monitored the CmoM-catalyzed incorporation of [3H-methyl] from the donor to tRNA as acid-precipitable counts on filer pads as we have shown with other AdoMet-dependent methyl transferases (27–29,38–40). Assays in a steady-state multiple-turnover condition showed that the G35 base in the anticodon is indeed the major determinant for recognition and discrimination of tRNA by CmoM. While the WT UGG tRNAPro was fully methylated by CmoM, the single G35A substitution abolished the methylation (Figure 2D). Conversely, while the WT UAC tRNAVal was an inactive substrate for CmoM, the single A35G substitution was sufficient to confer full methylation to this tRNA (Figure 2E). Thus, despite in different sequence contexts of tRNAPro and tRNAVal, the G35 base alone determines the methylation by CmoM. To more closely evaluate the role of G35 in the methylation by CmoM, we determined the kinetic parameters of individual tRNA species (Figure 2F, G). CmoM is a member of the class I AdoMet-dependent methyl transferases (PDB: 4HTF), which use the Rossmann dinucleotide-fold to accommodate the methyl donor (41,42). We had shown that enzymes of the Rossmann-fold are rate-limited by the release of products (30,38,43), rather than by the chemistry involving the SN2 nucleophilic attack on the sulfonium center of AdoMet. This kinetic signature indicates that the rate of chemistry is faster relative to the release of products and that, to probe the chemistry, kinetic assays must be performed in single-turnover assays to monitor chemistry, rather than in steady-state assays to monitor product release (38). We developed single-turnover assays for CmoM, where the enzyme was in excess of each tRNA to permit rapid equilibrium binding in one turnover, such that the rate constant kobs of the turnover was not limited by binding but by the chemistry of methyl transfer (44). A titration of kobs as a function of enzyme concentration then provided the basis to determine the saturating kobs as kchem (the intrinsic rate constant of methyl transfer) and the enzyme concentration that achieved half of the saturation kinetics as the thermodynamic dissociation constant Kd for the tRNA. Analysis of kobs as a function of CmoM concentration for the WT UGG tRNAPro showed that all data points of single-turnover assays were fit to a hyperbolic equation without a lag phase (Figure 2F), supporting the notion that methyl transfer was performed in rapid binding equilibrium. The kobs of each turnover was a composite term, representing all of the steps from binding to methyl transfer, including a conformational transition of the enzyme-tRNA-AdoMet complex that might have occurred before chemistry. The absence of a lag phase indicated that all steps leading to and including the chemistry were fast and that the kinetics was limited by the rate of methyl transfer. While we were unable to determine the kinetics of the UAG mutant of tRNAPro or the WT UAC tRNAVal, due to extremely low levels of activity, we were able to measure the kinetics of the UGC variant of tRNAVal (Figure 2G). Interestingly, the latter showed a linear increase of kobs as a function of CmoM concentration, indicating that the substrate tRNA was non-optimal (45) and that a conformational transition of the enzyme-tRNA-AdoMet complex was necessary to adopt an active state. For example, this conformational transition might involve re-arrangement of the anticodon stem-loop of tRNAVal, which differs from that of tRNAPro and all other CmoM substrates by having a C30–G40 pair instead of a G30–C40 pair (Supplementary Figure S1). The kinetic distinction between the WT tRNAPro and the active mutant of tRNAVal demonstrates that our single-turnover assays have the ability to discern subtle differences between substrates during methyl transfer. The saturation kinetics of the WT UGG tRNAPro showed that the single-turnover parameters for CmoM (kchem = 0.018 ± 0.003 s−1; Kd (tRNA) = 5.3 ± 1.9 μM, and kchem/Kd = 0.0034 ± 0.0013 s−1μM−1, Figure 2H) are well within the range of other AdoMet-dependent tRNA methyl transferases (27,29,38–40). For the UGC variant of tRNAVal, because partial saturation kinetics began to appear at 8 μM of the enzyme, fitting the data to a hyperbolic equation showed kchem ∼ 0.66 s−1, Kd ∼ 26 μM, and kchem/Kd ∼ 0.025 s−1μM−1 (not shown). To independently validate the kinetics, we performed single-turnover assays to measure kchem/Kd under conditions where the enzyme was sub-Kd such that the kobs was a close representation of the efficiency. Kinetic analysis with the enzyme at 1 μM (∼25-fold below the Kd) showed the kchem/Kd of 0.024 ± 0.003 s−1μM−1, Figure 2H), closely similar to the estimated value from the hyperbolic fit. This result reveals an increase of almost 7-fold efficiency for the mutant tRNAVal relative to the WT tRNAPro (Figure 2H) and is consistent with the faster kinetics of the mutant tRNA to reach the plateau (Figure 2E). Thus, once the initial non-optimal enzyme-tRNA-AdoMet conformation was overcome in the mutant tRNA, the structure of the transformed complex was favorable for high efficiency of methyl transfer in a process driven by G35. These data support the hypothesis that the single G35 base is the major determinant of tRNA recognition by CmoM. To validate the design principle of our genetic approach, which is particularly useful for tRNA species essential for cell survival, we tested the hypothesis that mutations in an essential tRNA would create cellular toxicity. Using UGG tRNAPro as an example, where the UGG anticodon is a major determinant of charging of tRNAPro by ProRS (46), we showed that the creation of the UAG anticodon mutation, which reads Leu codons CU[G/A], reduced specific charging with Pro to 90% and increased mis-charging with Leu to 10% (Supplementary Figure S6C). While charging was assayed in vitro with purified enzymes, the mutant UAG tRNAPro was purified from E. coli cells lacking only cmoM but retaining genes for all other modification enzymes, indicating that the data obtained from the mutant tRNA should be relevant in vivo. Indeed, growth analysis showed that cells expressing the UAG variant of tRNAPro had reduced viability relative to cells without it, even with expression of the maintenance UGG-proL gene (Supplementary Figure S6D, E). The reduced viability is consistent with the notion that the charged UAG tRNAPro has delivered Pro to Leu codons CU[G/A], resulting in mis-translation during protein synthesis. This toxicity can only be identified in our genetic approach, where a basal level of viability is maintained, but not in an environment, where cell death occurs due to substitution of the natural UGG anticodon in the chromosome. Dependence on m1G37 for terminal methylation by CmoM UGG tRNAPro provides an opportunity to explore the inter-dependent relationship of the modifications at positions 34 and 37. In this tRNA, mcmo5U is a complex modification conserved at position 34 among γ-proteobacteria (5), whereas m1G37 is a primary methylation (to the N1 of guanine) conserved across all bacterial species (both Gram-positive and Gram-negative) (47,48). Here we focused on the relationship between these two modifications. The terminal methylation is not essential for growth and its functional role in UGG tRNAPro has not been characterized, although it has an effect on reading-frame accuracy for another mcmo5U34-containing tRNA (encoded by alaT, alaU and alaV, Supplementary Figure S1) (5). In contrast, the m1G37 methylation is the major determinant in UGG tRNAPro for reading-frame accuracy and its elimination substantially increases the frequency of ribosomal shifts in both enzyme- and cell-based assays (24,32). Given that the two types of methylation occur in distinct chemical structure and space with distinct evolutionary distribution, we addressed the question of whether they are independent of each other or whether one is dependent on the other. Answering this question would shed light on the development of a functional UGG tRNAPro. We showed that the presence of m1G37 affected the terminal methylation level by comparing CmoM activity on UGG tRNAPro isolated with and without the primary methylation. The tRNA with m1G37 was isolated from the E. coli vehicle strain described above, whereas that without m1G37 was isolated from a derivative of the vehicle strain that also lacked the trmD gene. Because trmD is essential for growth (24), a simple knockout cannot be made. In the derivative, the chromosomal trmD was eliminated, but cell viability was maintained by the Ara-controlled expression of the human counterpart trm5 (40) from the pACYC maintenance plasmid (Supplementary Figure S7) (24). Cells were grown to saturation in the presence of Ara to maintain the activity of trm5 and 1% of these cells were then transferred to fresh media without Ara and grown for 3 h, followed by another transfer of 20% cells to fresh media and growth for 2 h. This serial dilution turned off trm5 and depleted cellular m1G37-tRNA to less than 5% of total tRNA (49) before isolation of UGG tRNAPro. Single-turnover kinetic analysis of CmoM showed a 5-fold reduction in kchem for the tRNA lacking m1G37 versus with m1G37 (Figure 3A, B), indicating a 5-fold reduction in the number of enzyme molecules engaged in methyl transfer per unit of time. Thus, the presence of m1G37 is important for the action of CmoM. Although the tRNA lacking m1G37 had a 6-fold more favorable Kd for the enzyme, resulting in an overall kchem/Kd similar to that of the control, we note that the driver for the overall level of methylation is determined by kchem in the context of a cell. This was based on the observation that the CmoM methylation to UGG tRNAPro progressively increases to the highest level in the stationary phase (5). The importance of m1G37 for the level of terminal methylation at U34 by CmoM is supported by the crystal structure of the ribosomal 30S subunit with the anticodon–stem–loop of tRNAPro (50), showing that the anticodon loop would be disordered without m1G37 but become organized when the methylation was in place. This ordering and organization of the anticodon loop helped to align the three anticodon bases to properly pair with the cognate codon. The dependence on m1G37 for the action of CmoM suggests that the organization of the anticodon by the primary methylation enables the wobble base better recognized for the terminal methylation. Figure 3. View large Download slide m1G37 is required for the terminal methylation in mcmo5U34 in E. coli UGG tRNAPro. (A) Titration of kobs as a function of CmoM concentration for the terminal methylation of mcmo5U34 in tRNA with (+) and without (–) m1G37. Errors are SD, n = 3. (B) Kinetic parameters derived from the titration in (A). Errors are SD, n = 3. (C) Analysis of kinetic parameters of TrmD for the ho5U34-, the cmo5U34-, and the mcmo5U34-state of the tRNA. The sequence and structure of the anticodon-stem loop of each state is shown and the status of the wobble modification (in red) was confirmed by LC–MS/MS previously (Supplementary Figure S6B). Note that the tRNA isolated from wt cells contained 40% mcmo5U34 and 60% cmo5U34 based on the ratio of peak areas of LC–MS/MS. (D) A model showing the hierarchical ranking of m1G37 over mcmo5U34 relative to the rest of the residues in the anticodon loop. In this ranking, the importance of each residue is shown by the size of the circle, showing that m1G37 is required for the wobble modification but not vice versa and that the wobble modification is important for the tRNA decoding quality. Figure 3. View large Download slide m1G37 is required for the terminal methylation in mcmo5U34 in E. coli UGG tRNAPro. (A) Titration of kobs as a function of CmoM concentration for the terminal methylation of mcmo5U34 in tRNA with (+) and without (–) m1G37. Errors are SD, n = 3. (B) Kinetic parameters derived from the titration in (A). Errors are SD, n = 3. (C) Analysis of kinetic parameters of TrmD for the ho5U34-, the cmo5U34-, and the mcmo5U34-state of the tRNA. The sequence and structure of the anticodon-stem loop of each state is shown and the status of the wobble modification (in red) was confirmed by LC–MS/MS previously (Supplementary Figure S6B). Note that the tRNA isolated from wt cells contained 40% mcmo5U34 and 60% cmo5U34 based on the ratio of peak areas of LC–MS/MS. (D) A model showing the hierarchical ranking of m1G37 over mcmo5U34 relative to the rest of the residues in the anticodon loop. In this ranking, the importance of each residue is shown by the size of the circle, showing that m1G37 is required for the wobble modification but not vice versa and that the wobble modification is important for the tRNA decoding quality. Conversely, we showed that the terminal methylation in mcmo5U34 was unaffected by m1G37. For broader impact, we explored not only the terminal methylation but also the two precursor states of the wobble modification. Using the trmD-KO strain that we constructed (24), which provided tRNA lacking m1G37 for analysis of TrmD methylation, we isolated UGG tRNAPro from cells with the wobble base in three different states: the ho5U34-state was isolated from the strain with additional cmoB-KO, the cmo5U34-state from the strain with additional cmoM-KO, and the mcmo5U34-state from the trmD-KO strain. Because the tRNA sequence in each case had no mutation, each was purified by over-expression of the wt gene from the pKK223–3 plasmid in the trmD-KO strain harboring additional gene-knockout of choice. The modification status of each state was previously confirmed by LC-MS/MS in the strain expressing trmD (Supplementary Figure S6B). Kinetic analysis of TrmD was performed in a steady-state multi-turnover condition. This was appropriate for TrmD, which unlike CmoM is a non-Rossmann-fold enzyme and is rate-limited at the chemistry of methyl transfer (38). Because slow chemistry caused slow product release, steady-state assays that monitored product release (38) would faithfully report the chemistry events of TrmD. For the fully modified mcmo5U34-state, the TrmD kinetic parameters kcat (0.29 ± 0.01 s−1) and Km (2.7 ± 0.1 μM) (Figure 3C) are within a 2-fold range of those previously published for the unmodified transcript of the tRNA (38), indicating that post-transcriptional modifications are not critical for TrmD. Indeed, the kcat, Km and kcat/Km values for all three states of the tRNA were similar with each other (Figure 3C), supporting the notion that the TrmD activity does not depend on the status of the wobble modification. Our results establish a hierarchical ranking of the TrmD-catalyzed m1G37 over the modification state at the wobble base in UGG tRNAPro (Figure 3D). While we only studied the effect of m1G37 on the terminal methylation of the wobble base, we suggest that the effect would be maintained throughout all states of the wobble modification, because the primary methylation is required to organize the anticodon structure to permit each modification step. Thus, m1G37 is important not only for reading-frame accuracy of the tRNA, but also for initiating a complex modification process to the wobble base to improve the decoding quality of the latter. This new finding raises the possibility that the maturation of a functional UGG tRNAPro must have acquired m1G37 before mcmo5U34. Without the former to maintain the reading frame and cell survival (24), the latter would not form and the quality of tRNA decoding would be poor. Thus, the priority placement of m1G37 over mcmo5U34 has biological significance in that tRNA must support cell growth before it improves the decoding quality. The interdependent relationship between tRNA positions 34 and 37 has been probed for two other cases. In one, the formation of Cm32 (2′-O-methylation of C32) and Gm34 (2′-O-methylation of G34) in tRNAPhe is found to drive the conversion of m1G37 to yW (wybutosine) (51). In the second, the synthesis of t6A37 (threonylcarbamoyl adenosine) in tRNAIle is found to precede before aminoacylation by IleRS and possibly lysidinylation of C34 by TilS (52). The hierarchical ranking of the first case, placing positions 32 and 34 over 37, is opposite from our finding, whereas that of the second, placing position 37 over 34, is similar. To determine if there are rules governing the hierarchy, further studies are needed. For example, four other tRNA families with cmo5U34 or mcmo5U34 contain a different modification at position 37 (Supplementary Figure S1). These are m6A37 (N6-methyl-A37) for UAC tRNAVal (53), ms2i6A37 (2-methylthio-N6-isopentenyl-A37) for UGA tRNASer (54,55), ct6A37 (cyclic N6-threonylcarbamoyl-A37) for UGU tRNAThr (56,57), and m1G37 for UAG tRNALeu (20). Little is known about the hierarchical relationship between the two positions in the development of each tRNA. Based on our work here, we suggest that each tRNA has a priority placement of modifications between positions 34 and 37 to ensure the functional state of the tRNA in the context of a cell. SUPPLEMENTARY DATA Supplementary data are available at NAR online. ACKNOWLEDGEMENTS We thank Dr. Steven Almo for the E. coli cmoB-KO strain. FUNDING NIH [GM1081601 and GM114343 to Y.M.H.]; JSPS postdoctoral fellowship (to I.M.); Defense Threat Reduction Agency HDTRA [1-15-1-0033 to P.L.]. Funding for open access charge: NIH [GM081601 to Y.M.H., GM114343 to Y.M.H.]. Conflict of interest statement. None declared. REFERENCES 1. Suzuki T. Grosjean H 2005; 12: Springer-Verlag Berlin and Heidelberg, GmbH & Co. 23– 69. 2. Machnicka M.A., Milanowska K., Osman Oglou O., Purta E., Kurkowska M., Olchowik A., Januszewski W., Kalinowski S., Dunin-Horkawicz S., Rother K.M.et al. MODOMICS: a database of RNA modification pathways–2013 update. Nucleic Acids Res. 2013; 41: D262– D267. Google Scholar CrossRef Search ADS PubMed 3. Murao K., Saneyoshi M., Harada F., Nishimura S. Uridin-5-oxy acetic acid: a new minor constituent from E. coli valine transfer RNA I. Biochem. Biophys. Res. Commun. 1970; 38: 657– 662. Google Scholar CrossRef Search ADS PubMed 4. Sorensen M.A., Elf J., Bouakaz E., Tenson T., Sanyal S., Bjork G.R., Ehrenberg M. Over expression of a tRNALeu isoacceptor changes charging pattern of leucine tRNAs and reveals new codon reading. J Mol Biol . 2005; 354: 16– 24. Google Scholar CrossRef Search ADS PubMed 5. Sakai Y., Miyauchi K., Kimura S., Suzuki T. Biogenesis and growth phase-dependent alteration of 5-methoxycarbonylmethoxyuridine in tRNA anticodons. Nucleic Acids Res. 2016; 44: 509– 523. Google Scholar CrossRef Search ADS PubMed 6. Osawa S., Jukes T.H., Watanabe K., Muto A. Recent evidence for evolution of the genetic code. Microbiol. Rev. 1992; 56: 229– 264. Google Scholar PubMed 7. Grosjean H., de Crecy-Lagard V., Marck C. Deciphering synonymous codons in the three domains of life: co-evolution with specific tRNA modification enzymes. FEBS Lett. 2010; 584: 252– 264. Google Scholar CrossRef Search ADS PubMed 8. Lagerkvist U. Unconventional methods in codon reading. Bioessays . 1986; 4: 223– 226. Google Scholar CrossRef Search ADS PubMed 9. Rogalski M., Karcher D., Bock R. Superwobbling facilitates translation with reduced tRNA sets. Nat. Struct. Mol. Biol. 2008; 15: 192– 198. Google Scholar CrossRef Search ADS PubMed 10. Claesson C., Lustig F., Boren T., Simonsson C., Barciszewska M., Lagerkvist U. Glycine codon discrimination and the nucleotide in position 32 of the anticodon loop. J. Mol. Biol. 1995; 247: 191– 196. Google Scholar CrossRef Search ADS PubMed 11. Takai K., Takaku H., Yokoyama S. Codon-reading specificity of an unmodified form of Escherichia coli tRNA1Ser in cell-free protein synthesis. Nucleic Acids Res. 1996; 24: 2894– 2899. Google Scholar CrossRef Search ADS PubMed 12. Phelps S.S., Malkiewicz A., Agris P.F., Joseph S. Modified nucleotides in tRNALys and tRNAVal are important for translocation. J. Mol. Biol. 2004; 338: 439– 444. Google Scholar CrossRef Search ADS PubMed 13. Samuelsson T., Elias P., Lustig F., Axberg T., Folsch G., Akesson B., Lagerkvist U. Aberrations of the classic codon reading scheme during protein synthesis in vitro. J. Biol. Chem. 1980; 255: 4583– 4588. Google Scholar PubMed 14. Weixlbaumer A., Murphy F.V.t., Dziergowska A., Malkiewicz A., Vendeix F.A., Agris P.F., Ramakrishnan V. Mechanism for expanding the decoding capacity of transfer RNAs by modification of uridines. Nat. Struct. Mol. Biol. 2007; 14: 498– 502. Google Scholar CrossRef Search ADS PubMed 15. Nasvall S.J., Chen P., Bjork G.R. The modified wobble nucleoside uridine-5-oxyacetic acid in tRNAPro(cmo5UGG) promotes reading of all four proline codons in vivo. RNA . 2004; 10: 1662– 1673. Google Scholar CrossRef Search ADS PubMed 16. Yokoyama S., Watanabe T., Murao K., Ishikura H., Yamaizumi Z., Nishimura S., Miyazawa T. Molecular mechanism of codon recognition by tRNA species with modified uridine in the first position of the anticodon. Proc. Natl. Acad. Sci. U.S.A. 1985; 82: 4905– 4909. Google Scholar CrossRef Search ADS PubMed 17. Kim J., Xiao H., Bonanno J.B., Kalyanaraman C., Brown S., Tang X., Al-Obaidi N.F., Patskovsky Y., Babbitt P.C., Jacobson M.P.et al. Structure-guided discovery of the metabolite carboxy-SAM that modulates tRNA function. Nature . 2013; 498: 123– 126. Google Scholar CrossRef Search ADS PubMed 18. Bjork G.R. A novel link between the biosynthesis of aromatic amino acids and transfer RNA modification in Escherichia coli. J. Mol. Biol. 1980; 140: 391– 410. Google Scholar CrossRef Search ADS PubMed 19. Hagervall T.G., Jonsson Y.H., Edmonds C.G., McCloskey J.A., Bjork G.R. Chorismic acid, a key metabolite in modification of tRNA. J. Bacteriol. 1990; 172: 252– 259. Google Scholar CrossRef Search ADS PubMed 20. Qian Q., Bjork G.R. Structural requirements for the formation of 1-methylguanosine in vivo in tRNA(Pro)GGG of Salmonella typhimurium. J. Mol. Biol. 1997; 266: 283– 296. Google Scholar CrossRef Search ADS PubMed 21. Hou Y.M., Schimmel P. A simple structural feature is a major determinant of the identity of a transfer RNA. Nature . 1988; 333: 140– 145. Google Scholar CrossRef Search ADS PubMed 22. Hou Y.M., Schimmel P. Evidence that a major determinant for the identity of a transfer RNA is conserved in evolution. Biochemistry . 1989; 28: 6800– 6804. Google Scholar CrossRef Search ADS PubMed 23. Hou Y.M., Motegi H., Lipman R.S., Hamann C.S., Shiba K. Conservation of a tRNA core for aminoacylation. Nucleic Acids Res. 1999; 27: 4743– 4750. Google Scholar CrossRef Search ADS PubMed 24. Gamper H.B., Masuda I., Frenkel-Morgenstern M., Hou Y.M. Maintenance of protein synthesis reading frame by EF-P and m1G37-tRNA. Nat. Commun. 2015; 6: 7226. Google Scholar CrossRef Search ADS PubMed 25. Cherepanov P.P., Wackernagel W. Gene disruption in Escherichia coli: TcR and KmR cassettes with the option of Flp-catalyzed excision of the antibiotic-resistance determinant. Gene . 1995; 158: 9– 14. Google Scholar CrossRef Search ADS PubMed 26. Datsenko K.A., Wanner B.L. One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl. Acad. Sci. U.S.A. 2000; 97: 6640– 6645. Google Scholar CrossRef Search ADS PubMed 27. Christian T., Sakaguchi R., Perlinska A.P., Lahoud G., Ito T., Taylor E.A., Yokoyama S., Sulkowska J.I., Hou Y.M. Methyl transfer by substrate signaling from a knotted protein fold. Nat. Struct. Mol. Biol. 2016; 23: 941– 948. Google Scholar CrossRef Search ADS PubMed 28. Christian T., Evilia C., Williams S., Hou Y.M. Distinct origins of tRNA(m1G37) methyltransferase. J. Mol. Biol. 2004; 339: 707– 719. Google Scholar CrossRef Search ADS PubMed 29. Christian T., Hou Y.M. Distinct determinants of tRNA recognition by the TrmD and Trm5 methyl transferases. J. Mol. Biol. 2007; 373: 623– 632. Google Scholar CrossRef Search ADS PubMed 30. Zhang C.M., Perona J.J., Ryu K., Francklyn C., Hou Y.M. Distinct kinetic mechanisms of the two classes of aminoacyl-tRNA synthetases. J. Mol. Biol. 2006; 361: 300– 311. Google Scholar CrossRef Search ADS PubMed 31. Liu C., Gamper H., Liu H., Cooperman B.S., Hou Y.M. Potential for interdependent development of tRNA determinants for aminoacylation and ribosome decoding. Nat. Commun. 2011; 2: 329. Google Scholar CrossRef Search ADS PubMed 32. Gamper H.B., Masuda I., Frenkel-Morgenstern M., Hou Y.M. The UGG isoacceptor of tRNAPro is naturally prone to frameshifts. Int. J. Mol. Sci. 2015; 16: 14866– 14883. Google Scholar CrossRef Search ADS PubMed 33. Lahoud G., Goto-Ito S., Yoshida K., Ito T., Yokoyama S., Hou Y.M. Differentiating analogous tRNA methyltransferases by fragments of the methyl donor. RNA . 2011; 17: 1236– 1246. Google Scholar CrossRef Search ADS PubMed 34. Sakaguchi R., Giessing A., Dai Q., Lahoud G., Liutkeviciute Z., Klimasauskas S., Piccirilli J., Kirpekar F., Hou Y.M. Recognition of guanosine by dissimilar tRNA methyltransferases. RNA . 2012; 18: 1687– 1701. Google Scholar CrossRef Search ADS PubMed 35. Sakaguchi R., Lahoud G., Christian T., Gamper H., Hou Y.M. A divalent metal ion-dependent N1-methyl transfer to G37-tRNA. Chem. Biol. 2014; 21: 1351– 1360. Google Scholar CrossRef Search ADS PubMed 36. Naganuma M., Sekine S., Chong Y.E., Guo M., Yang X.L., Gamper H., Hou Y.M., Schimmel P., Yokoyama S. The selective tRNA aminoacylation mechanism based on a single G*U pair. Nature . 2014; 510: 507– 511. Google Scholar CrossRef Search ADS PubMed 37. Hamann C.S., Hou Y.M. An RNA structural determinant for tRNA recognition. Biochemistry . 1997; 36: 7967– 7972. Google Scholar CrossRef Search ADS PubMed 38. Christian T., Lahoud G., Liu C., Hou Y.M. Control of catalytic cycle by a pair of analogous tRNA modification enzymes. J. Mol. Biol. 2010; 400: 204– 217. Google Scholar CrossRef Search ADS PubMed 39. Christian T., Lahoud G., Liu C., Hoffmann K., Perona J.J., Hou Y.M. Mechanism of N-methylation by the tRNA m1G37 methyltransferase Trm5. RNA . 2010; 16: 2484– 2492. Google Scholar CrossRef Search ADS PubMed 40. Christian T., Gamper H., Hou Y.M. Conservation of structure and mechanism by Trm5 enzymes. RNA . 2013; 19: 1192– 1199. Google Scholar CrossRef Search ADS PubMed 41. Schubert H.L., Blumenthal R.M., Cheng X. Many paths to methyltransfer: a chronicle of convergence. Trends Biochem. Sci. 2003; 28: 329– 335. Google Scholar CrossRef Search ADS PubMed 42. Martin J.L., McMillan F.M. SAM (dependent) I AM: the S-adenosylmethionine-dependent methyltransferase fold. Curr. Opin. Struct. Biol. 2002; 12: 783– 793. Google Scholar CrossRef Search ADS PubMed 43. Liu C., Gamper H., Shtivelband S., Hauenstein S., Perona J.J., Hou Y.M. Kinetic quality control of anticodon recognition by a eukaryotic aminoacyl-tRNA synthetase. J. Mol. Biol. 2007; 367: 1063– 1078. Google Scholar CrossRef Search ADS PubMed 44. Christian T., Evilia C., Hou Y.M. Catalysis by the second class of tRNA(m1G37) methyl transferase requires a conserved proline. Biochemistry . 2006; 45: 7463– 7473. Google Scholar CrossRef Search ADS PubMed 45. Dupasquier M., Kim S., Halkidis K., Gamper H., Hou Y.M. tRNA integrity is a prerequisite for rapid CCA addition: implication for quality control. J. Mol. Biol. 2008; 379: 579– 588. Google Scholar CrossRef Search ADS PubMed 46. Cusack S., Yaremchuk A., Krikliviy I., Tukalo M. tRNAPro anticodon recognition by Thermus thermophilus prolyl-tRNA synthetase. Structure . 1998; 6: 101– 108. Google Scholar CrossRef Search ADS PubMed 47. Bjork G.R., Jacobsson K., Nilsson K., Johansson M.J., Bystrom A.S., Persson O.P. A primordial tRNA modification required for the evolution of life?. EMBO J. 2001; 20: 231– 239. Google Scholar CrossRef Search ADS PubMed 48. Bystrom A.S., Bjork G.R. The structural gene (trmD) for the tRNA(m1G)methyltransferase is part of a four polypeptide operon in Escherichia coli K-12. Mol. Gen. Genet. 1982; 188: 447– 454. Google Scholar CrossRef Search ADS PubMed 49. Masuda I., Sakaguchi R., Liu C., Gamper H., Hou Y.M. The temperature sensitivity of a mutation in the essential tRNA modification enzyme tRNA methyltransferase D (TrmD). J. Biol. Chem. 2013; 288: 28987– 28996. Google Scholar CrossRef Search ADS PubMed 50. Maehigashi T., Dunkle J.A., Miles S.J., Dunham C.M. Structural insights into +1 frameshifting promoted by expanded or modification-deficient anticodon stem loops. Proc. Natl. Acad. Sci. U.S.A. 2014; 111: 12740– 12745. Google Scholar CrossRef Search ADS PubMed 51. Guy M.P., Podyma B.M., Preston M.A., Shaheen H.H., Krivos K.L., Limbach P.A., Hopper A.K., Phizicky E.M. Yeast Trm7 interacts with distinct proteins for critical modifications of the tRNAPhe anticodon loop. RNA . 2012; 18: 1921– 1933. Google Scholar CrossRef Search ADS PubMed 52. Thiaville P.C., El Yacoubi B., Kohrer C., Thiaville J.J., Deutsch C., Iwata-Reuyl D., Bacusmo J.M., Armengaud J., Bessho Y., Wetzel C.et al. Essentiality of threonylcarbamoyladenosine (t6A), a universal tRNA modification, in bacteria. Mol. Microbiol. 2015; 98: 1199– 1221. Google Scholar CrossRef Search ADS PubMed 53. Golovina A.Y., Sergiev P.V., Golovin A.V., Serebryakova M.V., Demina I., Govorun V.M., Dontsova O.A. The yfiC gene of E. coli encodes an adenine-N6 methyltransferase that specifically modifies A37 of tRNA1Val(cmo5UAC). RNA . 2009; 15: 1134– 1141. Google Scholar CrossRef Search ADS PubMed 54. Schweizer U., Bohleber S., Fradejas-Villar N. The modified base isopentenyladenosine and its derivatives in tRNA. RNA Biol. 2017; 14: 1– 12. Google Scholar CrossRef Search ADS PubMed 55. Motorin Y., Bec G., Tewari R., Grosjean H. Transfer RNA recognition by the Escherichia coli delta2-isopentenyl-pyrophosphate:tRNA delta2-isopentenyl transferase: dependence on the anticodon arm structure. RNA . 1997; 3: 721– 733. Google Scholar PubMed 56. Miyauchi K., Kimura S., Suzuki T. A cyclic form of N6-threonylcarbamoyladenosine as a widely distributed tRNA hypermodification. Nat. Chem. Biol. 2013; 9: 105– 111. Google Scholar CrossRef Search ADS PubMed 57. Kimura S., Miyauchi K., Ikeuchi Y., Thiaville P.C., Crecy-Lagard V., Suzuki T. Discovery of the beta-barrel-type RNA methyltransferase responsible for N6-methylation of N6-threonylcarbamoyladenosine in tRNAs. Nucleic Acids Res. 2014; 42: 9350– 9365. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
A rapid fluorescent indicator displacement assay and principal component/cluster data analysis for determination of ligand–nucleic acid structural selectivitydel Villar-Guerra, Rafael;Gray, Robert D;Trent, John O;Chaires, Jonathan B
doi: 10.1093/nar/gky019pmid: 29361140
Abstract We describe a rapid fluorescence indicator displacement assay (R-FID) to evaluate the affinity and the selectivity of compounds binding to different DNA structures. We validated the assay using a library of 30 well-known nucleic acid binders containing a variety chemical scaffolds. We used a combination of principal component analysis and hierarchical clustering analysis to interpret the results obtained. This analysis classified compounds based on selectivity for AT-rich, GC-rich and G4 structures. We used the FID assay as a secondary screen to test the binding selectivity of an additional 20 compounds selected from the NCI Diversity Set III library that were identified as G4 binders using a thermal shift assay. The results showed G4 binding selectivity for only a few of the 20 compounds. Overall, we show that this R-FID assay, coupled with PCA and HCA, provides a useful tool for the discovery of ligands selective for particular nucleic acid structures. INTRODUCTION Nucleic acids are dynamic biomolecules that can adopt a wide variety of structures (1–3). In addition to the familiar duplex structure, nucleic acids can form a variety of non-canonical triplex (4,5), quadruplex (6) or i-motif (3,7,8) structures. Some of these structures have been detected in living cells (9–11) where they may play key roles in a variety of biological processes, including replication, transcription, oncogene expression, and telomere functions (11–17). Identification of ligands that selectively bind to duplex, triplex, i-motif, and G-quadruplex (G4) structures is an important and challenging goal of drug discovery (12,18–32). G4 structures, in particular, have emerged as promising therapeutic targets for anti-cancer therapies (13,18,27–29,33–42). A major challenge is to identify compounds that bind specifically to one of the several possible G4 structures without binding to duplex DNA. Over the last decades, a number of ligands that bind G4 structures have been described (38,43). Unfortunately, none have successfully progressed completely through clinical trials to become useful drugs. This failure might be because most G4 binders found to date are polyaromatic compounds with poor drug-like properties and nonselective binding (42). Indeed, few G4 binders have been rigorously evaluated for selectivity with respect to their interaction with other nucleic acid structures, in part because there is no convenient assay for evaluating their structural-selective binding. A number of potential binding assays have been developed including chip-based (44), ligand fishing (45), FRET melting (46) and equilibrium dialysis (21,23,25). However, these methods may be expensive, time consuming or not amenable to high-throughput or even rapid screening. The fluorescent indicator displacement assay (FID) (27,28,47–56), which relies on displacement of a fluorescent DNA-binding ligand by the test compound, was adapted for the discovery of new G4 binding ligands. However, most previous FID studies evaluated binding only to a single G4 structure, but did not assess binding to other DNA structures or to different G4 folds. Here, we describe an efficient and inexpensive rapid FID assay (R-FID) designed to explore the structural selectivity of binding. The assay measures the affinity and selectivity of a library of compounds against eight different DNA sequences and structures (four duplex, two triplex and two G4 structures). It was tested using a library of 30 compounds with defined DNA binding modes and preferences. We caution that all FID assays are indirect measures of compound binding, requiring competition with the prebound signal ligand. While test compound binding is registered in favorable cases, the competition reaction must always be kept in mind since it may lead to underestimation of binding affinity. Importantly, we also developed a chemometric approach for evaluation of the large data set to facilitate visualization of the affinity, structural selectivity and binding mode of the compound library. Traditional graphical methods for the analysis of large data sets are often ineffective for easily visualizing binding selectivity. Data analysis therefore becomes a bottleneck in drug discovery. We present here a novel application of principal component analysis (PCA) and hierarchical cluster analysis (HCA) as powerful tools for analysis of large FID data sets. We demonstrate that PCA and HCA can be used to visualize binding properties and to clearly identify the binding selectivity of a library of compounds. This visualization provides a direct and easily interpretable method to depict complex, often hidden, relationships between affinity and selectivity. One may question why yet another variation of the FID assay needed. As part of a drug discovery platform it is essential, once a ligand is found that hits the desired target, to evaluate its binding specificity. We envision this FID assay as a rapid secondary screen for that purpose. Primary high-throughput screening of large compound libraries for target hits can be reliably done by thermal shift assays, particularly differential scanning fluorometry. The FID assay we describe will fill an important need by providing rapid specificity screen to identify the best candidates to continue along the drug discovery platform. MATERIALS AND METHODS Preparation of oligonucleotides samples The sequences, names of the oligonucleotides and buffers used in this study are given in Table 1. Oligonucleotides were purchased as desalted lyophilized powders from Integrated DNA Technologies, Inc. (Coralville, IA) and used without further purification. Stock solutions were prepared at 2 mM concentration in Milli-Q water and stored at 4°C for a least 24 h before use. Further dilutions to working concentrations were prepared using the following buffers optimized for specific DNA structures: Buffer A (6 mM K2HPO4, 4 mM KH2PO4, 15 mM KCl, pH 7.2), Buffer B (1 mM K2HPO4, 9 mM KH2PO4, 15 mM KCl, pH 6.2), Buffer C (6 mM K2HPO4, 4 mM KH2PO4, 15 mM KCl, 1 mM MgCl2, pH 7.2). To induce quadruplex, triplex or duplex formation, stocks solutions of each oligonucleotide were diluted with the appropriate buffer to ∼200 μM and annealed by heating in a boiling water bath for 20 min followed by overnight cooling to room temperature. The annealed samples were stored at 4°C. DNA strand concentrations were estimated from the absorbance at 260 nm after thermal denaturation for 5 min at 90°C using molar extinction coefficients supplied by the manufacture. Oligonucleotide solutions that contained 1% (v/v) dimethyl sulfoxide (DMSO) were prepared by adding the necessary volume of DMSO to the previously annealed nucleic acid solution. All nucleic acid structures were characterized by circular dichroism and UV absorbance spectroscopies, and by thermal denaturation (see Supplementary Material). Oligonucleotides sequence and their molar extinction coefficients used in this study Table 1. Oligonucleotides sequence and their molar extinction coefficients used in this study ID Name Sequencea ϵ (M−1cm−1) 260 nmb Bufferc 1 AT Duplex 5′-ATA TAT ATC CCC ATA TAT AT-3′ 205700 A 2 GC Duplex 5′-GCG CGC GCT TTT GCG CGC GC-3′ 166300 A 3 A/T Duplex 5′-AAA AAA AAC CCC TTT TTT TT-3′ 191300 A 4 H-Tel Duplex 5′-GGG TTA GGG TTT TCC CTA ACC C-3′ 202200 A 5 AT Triplex 5′-AAA AAA AAC CCC TTT TTT TTC CCC TTT TTT TT-3′ 284900 C 6 GC Triplex 5′-CCC CCC CCT TTT GGG GGG GGT TTT CCC CCC CC-3′ 261600 B 7 H-Tel i-Motif 5′-CCC TAA CCC TAA CCC TAA CCC T-3′ 193700 B 8 H-Tel Quadruplex 5′-AGG GTT AGG GTT AGG GTT AGG G-3′ 228500 A 9 c-Myc Quadruplex 5′-TGA GGG TGG GTA GGG TGG GTA A-3′ 228700 A 10 c-Myc i-Motif 5′-TTC CCC ACC CTC CCC ACC CTC CCC TAA-3′ 222200 B 11 A/U RNA duplex 5′-rArArA rArArA rArArC rCrCrC rUrUrU rUrUrU rUrU-3′ 202900 C 12 DNA-RNA duplex 5′-GGG TTA GGG TTT TrCrC rCrUrA rArCrC rC-3′ 202900 A ID Name Sequencea ϵ (M−1cm−1) 260 nmb Bufferc 1 AT Duplex 5′-ATA TAT ATC CCC ATA TAT AT-3′ 205700 A 2 GC Duplex 5′-GCG CGC GCT TTT GCG CGC GC-3′ 166300 A 3 A/T Duplex 5′-AAA AAA AAC CCC TTT TTT TT-3′ 191300 A 4 H-Tel Duplex 5′-GGG TTA GGG TTT TCC CTA ACC C-3′ 202200 A 5 AT Triplex 5′-AAA AAA AAC CCC TTT TTT TTC CCC TTT TTT TT-3′ 284900 C 6 GC Triplex 5′-CCC CCC CCT TTT GGG GGG GGT TTT CCC CCC CC-3′ 261600 B 7 H-Tel i-Motif 5′-CCC TAA CCC TAA CCC TAA CCC T-3′ 193700 B 8 H-Tel Quadruplex 5′-AGG GTT AGG GTT AGG GTT AGG G-3′ 228500 A 9 c-Myc Quadruplex 5′-TGA GGG TGG GTA GGG TGG GTA A-3′ 228700 A 10 c-Myc i-Motif 5′-TTC CCC ACC CTC CCC ACC CTC CCC TAA-3′ 222200 B 11 A/U RNA duplex 5′-rArArA rArArA rArArC rCrCrC rUrUrU rUrUrU rUrU-3′ 202900 C 12 DNA-RNA duplex 5′-GGG TTA GGG TTT TrCrC rCrUrA rArCrC rC-3′ 202900 A aThe strand orientation is 5′ to 3′ from left to right, and ‘r’ denotes a ribonucleotide. bMolar nearest neighbor extinction coefficients of the single-strand at 260 nm. cBuffer conditions: A = 6 mM K2HPO4, 4 mM KH2PO4, 15 mM KCl, pH 7. 2; B = 1 mM K2HPO4, 9 mM KH2PO4, 15 mM KCl, pH 6.2; C = 6 mM K2HPO4, 4 mM KH2PO4, 15 mM KCl, 1 mM MgCl2, pH 7. 2. View Large Table 1. Oligonucleotides sequence and their molar extinction coefficients used in this study ID Name Sequencea ϵ (M−1cm−1) 260 nmb Bufferc 1 AT Duplex 5′-ATA TAT ATC CCC ATA TAT AT-3′ 205700 A 2 GC Duplex 5′-GCG CGC GCT TTT GCG CGC GC-3′ 166300 A 3 A/T Duplex 5′-AAA AAA AAC CCC TTT TTT TT-3′ 191300 A 4 H-Tel Duplex 5′-GGG TTA GGG TTT TCC CTA ACC C-3′ 202200 A 5 AT Triplex 5′-AAA AAA AAC CCC TTT TTT TTC CCC TTT TTT TT-3′ 284900 C 6 GC Triplex 5′-CCC CCC CCT TTT GGG GGG GGT TTT CCC CCC CC-3′ 261600 B 7 H-Tel i-Motif 5′-CCC TAA CCC TAA CCC TAA CCC T-3′ 193700 B 8 H-Tel Quadruplex 5′-AGG GTT AGG GTT AGG GTT AGG G-3′ 228500 A 9 c-Myc Quadruplex 5′-TGA GGG TGG GTA GGG TGG GTA A-3′ 228700 A 10 c-Myc i-Motif 5′-TTC CCC ACC CTC CCC ACC CTC CCC TAA-3′ 222200 B 11 A/U RNA duplex 5′-rArArA rArArA rArArC rCrCrC rUrUrU rUrUrU rUrU-3′ 202900 C 12 DNA-RNA duplex 5′-GGG TTA GGG TTT TrCrC rCrUrA rArCrC rC-3′ 202900 A ID Name Sequencea ϵ (M−1cm−1) 260 nmb Bufferc 1 AT Duplex 5′-ATA TAT ATC CCC ATA TAT AT-3′ 205700 A 2 GC Duplex 5′-GCG CGC GCT TTT GCG CGC GC-3′ 166300 A 3 A/T Duplex 5′-AAA AAA AAC CCC TTT TTT TT-3′ 191300 A 4 H-Tel Duplex 5′-GGG TTA GGG TTT TCC CTA ACC C-3′ 202200 A 5 AT Triplex 5′-AAA AAA AAC CCC TTT TTT TTC CCC TTT TTT TT-3′ 284900 C 6 GC Triplex 5′-CCC CCC CCT TTT GGG GGG GGT TTT CCC CCC CC-3′ 261600 B 7 H-Tel i-Motif 5′-CCC TAA CCC TAA CCC TAA CCC T-3′ 193700 B 8 H-Tel Quadruplex 5′-AGG GTT AGG GTT AGG GTT AGG G-3′ 228500 A 9 c-Myc Quadruplex 5′-TGA GGG TGG GTA GGG TGG GTA A-3′ 228700 A 10 c-Myc i-Motif 5′-TTC CCC ACC CTC CCC ACC CTC CCC TAA-3′ 222200 B 11 A/U RNA duplex 5′-rArArA rArArA rArArC rCrCrC rUrUrU rUrUrU rUrU-3′ 202900 C 12 DNA-RNA duplex 5′-GGG TTA GGG TTT TrCrC rCrUrA rArCrC rC-3′ 202900 A aThe strand orientation is 5′ to 3′ from left to right, and ‘r’ denotes a ribonucleotide. bMolar nearest neighbor extinction coefficients of the single-strand at 260 nm. cBuffer conditions: A = 6 mM K2HPO4, 4 mM KH2PO4, 15 mM KCl, pH 7. 2; B = 1 mM K2HPO4, 9 mM KH2PO4, 15 mM KCl, pH 6.2; C = 6 mM K2HPO4, 4 mM KH2PO4, 15 mM KCl, 1 mM MgCl2, pH 7. 2. View Large Sample preparation Thiazole Orange (TO) and DMSO (99.9 % purity) were purchased from Sigma-Aldrich (St. Louis, MO, USA) and used without further purification. The test compounds were purchased from commercial sources and were used without further purification. Stock solutions of test compounds were prepared by dissolving a weighed amount in DMSO to give a concentration of 1–10 mM and stored in the dark at −20°C to prevent light-induced degradation. Working solutions of test compounds were prepared immediately before use at 15 μM concentration in the appropriate buffer containing 1% (v/v) DMSO. Serial dilutions of the stock ligand solutions contained 1% (v/v) DMSO. Stock solutions for FID screening were made in the same buffer used for the DNA preparation (Table 1). Oligonucleotide stock solutions (6 μM) were prepared in the appropriate buffer solutions with 1% (v/v) of DMSO using the pre-folded oligonucleotide stock solution (200 μM) prepared as described above. Thiazole orange displacement assay FID assays were conducted in duplicate at room temperature (20oC) without control with a Tecan Safire2 microplate reader (Tecan US, Durham, NC, USA) with the following parameters: excitation and emission bandwidth of 5 nm, gain of 125, integration time of 200 μs and 16 reads. Emission spectra were measured at 1 nm intervals from 510 to 750 nm with an excitation wavelength of 500 nm. Assays were carried out in 96-well NBS™ black, flat bottom polystyrene microplates (Cat# 3650, Corning Inc., NY, USA). Each assay solution contained 1 μM TO, 2 μM oligonucleotide and 5 μM test compound in 150 μl. The instrumental gain was adjusted using the fluorescence of the most fluorescent sample (1 μM TO and 2 μM AT triplex 5). A representation of a 96-well plate assay is depicted in Supplementary Figure S1. This configuration allowed testing five compounds and eight oligonucleotide structures per plate along with appropriate controls. The percentage TO displacement (%FID) was calculated from the corrected fluorescence emission intensity at 527 nm (λex = 500 nm) using Equation (1). The fluorescence intensity F was corrected by subtracting the fluorescence signal of the test compound/DNA mixture and that of unbound TO: \begin{equation*}\begin{array}{@{}*{3}{l}@{}} {\% {\rm FID}}& = &{100 - \left( {100\ \times \frac{F}{{{F_0}}}} \right)}\\ F& = &{{F_{\left( {{\rm Ligand} + {\rm DNA} + {\rm TO}} \right)}} - {F_{\left( {{\rm Buffer} + {\rm TO}} \right)}} - {F_{\left( {{\rm DNA} + {\rm Ligand}} \right)}}}\\ F_0 & = &{{F_{\left( {{\rm DNA} + {\rm TO}} \right)}} - {F_{\left( {{\rm Buffer} + {\rm TO}} \right)}}} \end{array}\end{equation*} (1) The %FID was calculated using the fluorescence intensity at the TO emission maximum of 527 nm rather than using the integrated intensity over the wavelength range 510–750 nm as has been used in previous implementations of the FID method (28,47,49,54,57). We observed that interference can be introduced when the integrated intensity is used instead of the single wavelength intensity, especially when fluorescent compounds are tested (Supplementary Figures S2–3). Principal component analysis (PCA) and cluster analysis Multivariate data analysis, principal component analysis, and cluster analysis of the FID assay data were performed with R software version 3.2.3 (10 December 2015), using the R package FactoMineR (58). These programs are freely available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org. Hierarchical clustering on the principal components (59) was performed on the principal components using the Euclidean distance as a measure of similarity between individuals with the Ward criterion as agglomeration method. The initial clusters obtained by hierarchical clustering were further consolidated by partitional clustering using the k-means algorithm. The results of the cluster analysis are presented as dendrograms in which the horizontal axis represents the compounds and the vertical axis represents the degree of similarity of the individuals. RESULTS AND DISCUSSION Oligonucleotide design and characterization A library of 12 oligonucleotides that form representative duplex, triplex, i-motif and G4 structures was designed for use in the new FID assay. Unimolecular structures were selected to avoid potential problems related to concentration and the molecularity of folding. Hairpin duplex structures are represented by deoxyribo-oligonucleotides 1–4 and the RNA oligonucleotides 11–12. These structures represent B-form DNA with AT- and GC-rich base compositions, an A-form RNA and a DNA–RNA hybrid. Sequences 5 and 6 form AT and GC intramolecular triplex DNA structures, respectively. These triplex structures (H-DNA structures) are important in several biological processes (60). Oligonucleotides 7 and 10 represent i-motif structures (7,61) corresponding to the human telomeric sequence (62) and the c-MYC NHE III1 wild-type C-rich promoter sequence (3). Oligonucleotides 8 and 9 form representative G4 structures. H-Tel (8) is the human telomeric repeat sequence (63,64) that is important in cancer (16,39) and which is polymorphic depending on environmental conditions (63,65–67) An antiparallel hybrid (‘3+1’) structure is formed under the conditions of the assay. c-Myc (9) is a sequence variant of the wild-type NHE III1 sequence of the c-Myc promoter that forms a homogeneous parallel G-quadruplex structure in potassium solution (68). Each of these nucleic acid structures have different stabilities and require optimization of experimental conditions to ensure their formation. The influence of buffer composition, pH, DMSO and TO on the folding and stability of the structures was evaluated using UV-Vis absorption, circular dichroism (CD) spectroscopy (69–74), thermal difference spectra (TDS) (75), and UV/CD thermal denaturation (Supplementary Figures S4–S15). A summary of the physical and spectroscopic characteristics of the oligonucleotides under different buffer conditions is given in Supplementary Table S1. The effect of TO and DMSO (used as a solvent for the ligands) on the stability and folding of the triplex oligonucleotides was also evaluated (Supplementary Figures S16–S18). CD and melting experiments showed that TO and DMSO stabilized the triplex structures without changing their conformation. In summary, these experiments show that all of the test oligonucleotides formed the desired structure and were sufficiently stable (Tm > 25°C) to be used in the FID assay. The affinity of TO for the oligonucleotides in the structure array was determined by fluorimetric titrations carried out under the optimized buffer conditions (76). The results are summarized in Figure 1 and Supplementary Table S2. The corresponding binding isotherms and emission spectra are in Supplementary Figures S4–S15, panels H and I. The degree of fluorescence enhancement of TO binding depended on the DNA structure (Supplementary Table S2). In summary, the titrations showed that TO binds with similar affinity (Ka ∼ 9 × 105 to 6 × 106 M−1) to all of the oligonucleotide structures with the exceptions of i-motifs 7 and 10 and RNA duplexes 11 and 12, where lower affinities (Ka < 6 × 105 M−1) were observed. Because of their low TO affinity, these structures were omitted from the test panel. A simple 1:1 DNA:TO model provided adequate fits to all of our titration data for the protocol used, except for c-Myc G4 where a secondary binding process was observed at high c-Myc:TO ratios. This weak apparent binding may result from TO-mediated quadruplex dimerization as recently reported for some viral G4s (77). Since the contribution of the low affinity binding was negligible under the conditions of the FID assay, this low-affinity process was ignored in evaluating equilibrium constant for TO:c-Myc G4 interaction. Figure 1. View largeDownload slide (A) Saturation curves for fluorescence titrations of TO with different oligonucleotides. (B) Bar graph showing binding constants of TO for different DNA structures used. Figure 1. View largeDownload slide (A) Saturation curves for fluorescence titrations of TO with different oligonucleotides. (B) Bar graph showing binding constants of TO for different DNA structures used. FID assay design Assay conditions were optimized to give high sensitivity, a high signal-to-noise ratio, and to minimize the effects of different binding stoichiometries. Corrections were applied to account for any interference from intrinsic ligand fluorescence or from inner filter effects. To obtain the maximum sensitivity, the gain of the microplate reader was set to maximize the fluorescence of AT triplex 5 which exhibited the largest change in emission intensity. The reagent concentrations of 1μM TO and 2 μM nucleic acid were selected because these concentrations gave ∼75% saturation (Figure 1), a point in the binding isotherm sensitive to competitive displacement and which minimizes potential problems from the binding of multiple probe molecules to the structure (26,56). The lower oligonucleotide/TO ratio used in this work differs from previous assays where excess TO was used (27,28,49,55,56). The lower ratio ensures that only ligands with high affinity will be detected. Supplementary Figure S19 shows a simulation of the expected TO displacement as a function of ligand binding affinity. This simulation uses the exact concentrations in our assay and the experimentally determined TO binding constants for each structure and assumes 1:1 binding. The simulations show that under the conditions of our assay the onset of TO displacement requires a ligand affinity of >105 M–1, and complete displacement requires a ligand affinity of >108 M–1. The effect of test compound concentration was also examined. The results show that good discrimination between oligonucleotides in the library was obtained at a compound concentration of 5 μM (Supplementary Figure S20). We also assessed the inner filter effect of fluorescent ligands ethidium bromide, doxorubicin, NMM580, TmPyP2 and TmPyP4 and a non-fluorescent ligand, netropsin, which was selected as a negative control because its absorption spectrum does not overlap with the absorption or emission spectra of TO. The results show that the absorbance of these compounds at 5 μM is <0.06 at the wavelengths of TO excitation (500 nm) and emission (527 nm) (Supplementary Figures S21 and S22). Therefore, the inner filter effect is negligible. For instance, 5 μM doxorubicin or TmPyP2 have absorbances at 500 and 528 nm that are similar (Supplementary Figure S21) but their %FID profiles for individual duplex structures differs (Figure S20C and E). This shows that the difference in binding selectivity for these structures results from different affinities rather than an inner filter effect. In addition, the importance of assessing the spectroscopic properties of potential ligands for possible interference with the FID assay, is illustrated by a %FID > 100% for TmPyP4 (Supplementary Figure S20F). This apparent anomaly most likely results from fluorescence of the ligand at the emission wavelength of TO (Supplementary Figure S22). FID assay: reference library We validated the FID assay for nucleic acid structural selectivity with a reference library of 30 nucleic acid ligands tested against eight oligonucleotide structures. The ligand library contains various intercalators (e.g. ethidium bromide), minor groove binders (e.g. netropsin), triplex binders (e.g. coralyne) and G4 ligands (e.g. NMM) (Supplementary Table S3, Supplementary Figure S23). In addition, a variety of other chemical structures including heterocycles, carbohydrates, and peptides, each with different physical, chemical, spectroscopic and binding properties, were in the reference library. The results of the FID assay of the 30 ligands binding to eight nucleic acid sequences and structures (four duplex, two triplex and two quadruplex) are presented as a bar graph in Figure 2. The data report the results of 240 individual binding interactions. Positive and a few negative values of %FID were observed. Positive values can be explained by a fluorescence decrease resulting from displacement of TO from the nucleic acid by the competitor ligand. In these cases, %FID is directly related to the affinity of a compound for a given nucleic acid structure. The higher the %FID, the higher affinity of the compound for a nucleic acid structure. For example, netropsin showed 84%FID for AT duplex 1 compared to 6% for GC duplex 2. This is consistent with the known selectivity of netropsin for AT over GC duplexes (23,25). Figure 2. View largeDownload slide Results of FID assay performed with 30 nucleic acid binders and eight oligonucleotide structures. Each bar represents %FID. The error bars correspond to standard deviation of the %FID values from two different plate measurements. Figure 2. View largeDownload slide Results of FID assay performed with 30 nucleic acid binders and eight oligonucleotide structures. Each bar represents %FID. The error bars correspond to standard deviation of the %FID values from two different plate measurements. Negatives values of %FID are more difficult to rationalize (54). A number of factors might account for TO fluorescence enhancement leading to a negative %FID: (a) interaction between TO and the test compound; (b) increased binding affinity of TO induced by the test compound; c) similar photo-physical behavior of the compound with TO. To further investigate the negative %FID observed with $$\alpha$$-naphtoflavone, coralyne and berberine interacting with the A/T duplex, the AT triplex and H-tel G4 structures, we determined absorbance, excitation and emission spectra of TO in the presence and absence of ligand under the same experimental conditions used for the FID (Figures S24–S28). Both α-naphthoflavone (Supplementary Figure S24) and coralyne (Supplementary Figures S25–S26) enhanced TO fluorescence both in the presence and absence of oligonucleotide. This accounts for the observed negative %FID and suggests an interaction between these compounds and TO. For berberine (Supplementary Figures S27–S28), significant differences were observed in the fluorescence spectra when the experiment was conducted in a quartz cuvette compared to a 96-well plate. The %FID observed for berberine using a quartz cuvette was –1%, in contrast to –15% observed when a 96-well plate was used. This suggests non-specific binding of berberine to the plastic plate. Statistical data analysis: principal component and cluster analysis Traditional methods of reporting FID HTS data such as bar graphs are cumbersome for detailed specificity analysis since it is difficult to grasp global trends. A major goal of the present study was to develop an efficient, quantitative approach for more effectively evaluating ligand–nucleic acid interactions. The methods of principle component analysis and hierarchal cluster analysis provide the tools for such a method. PCA is a multivariate statistical method to reduce the dimensionality of data without significant loss of information using derived variables (principal components) (78–81). This is a non-supervised classification method that performs unbiased clustering of the data. PCA is particularly useful for finding hidden patterns and aiding in interpretation of high dimension, complex data sets (82). PCA projects the multi-dimensional data onto a new lower dimension subspace. The axes in this new subspace, known as principal components (PCs), are linear combinations of the original variables with their origin located at the center of the multi-dimensional data. The PCs are independent (orthogonal) to each other and retain as much of the information in the original variables as possible. The first principal component (PC1) corresponds to the axis that contains the greatest variation of the data. The second principal component (PC2) is orthogonal to the first and oriented in the direction of the second greatest amount of variation of the data, and so on. In addition to PCA, we used HCA as an unsupervised classification method to find clusters of compounds. HCA calculates the distance between observations in multidimensional space and forms clusters of individuals based on the similarity of their variables. The results of PCA and HCA performed on the FID data of our reference library of 30 nucleic acids binders and 8 oligonucleotide structures are shown as biplot graphs, PC1–PC2 (Figure 3, left panel) and PC2–PC3 (Figure 3, right panel) (83). The variables (oligonucleotides) are represented by arrows and the individuals (compounds) by points colored according to their cluster. (We note that these plots were constructed using mean values of %FID. As one referee suggested, unaveraged primary data could be used in these graphs to show the possible variability in the clusters.) The results of HCA performed on the PCs, after consolidation by an additional partitional clustering with a K-means algorithm (84), are shown as dendrograms in Supplementary Figures S29 and S30. The horizontal axis represents the compounds and the vertical axis measures the distance or similarity between individuals in the clusters. The more similar the affinity and selectivity of the individual ligands for the different DNA structures, the lower are the nodes in the tree. The levels at which the hierarchical tree was cut to initialize the k-means algorithm are represented by rectangles. Figure 3. View largeDownload slide PCA and HCA of FID results for 30 nucleic acid binders and 8 oligonucleotide structures. Left: biplot of PC1 and PC2 Right: biplot of PC2 and PC3. The center of each cluster is represented by larger point size. Figure 3. View largeDownload slide PCA and HCA of FID results for 30 nucleic acid binders and 8 oligonucleotide structures. Left: biplot of PC1 and PC2 Right: biplot of PC2 and PC3. The center of each cluster is represented by larger point size. Interpretation of the PC1–PC2 biplot Four well-separated clusters are observed in the biplot of PC1–PC2 (Figure 3, left), which explains 89% of the variance of the data. Cluster 1 (black) includes the compounds that exhibit low or null capacity for displacement of TO (low %FID) from all the nucleic acid structures. Cluster 2 (blue) and cluster 3 (green) contain compounds situated close to the origin of PC1 and which showed medium affinity. Cluster 3 (green) contains minor groove binders, pentamidine, netropsin, berenil, DAPI and Hoechst, that have a selectivity for AT sequences (23,25). Cluster 4 (red) contains DNA intercalators, ethidium bromide, doxorubicin, WP762 and porphyrins TmPy4, TmPy2, that exhibit high values of %FID for almost all of the DNA structures. Analysis of the loading factors (the magenta arrows) reveals that all of the oligonucleotides have a positive correlation and similar contribution to PC1 as indicated by the length and direction of projection of the variables on to PC1 (Figure 3, left). The biplot PC1–PC2 shows that oligonucleotides with similar AT or GC base composition cluster together along PC2. Low correlation between AT-rich (AT triplex, AT duplex and A/T duplex) and GC-rich (GC triplex and GC duplex) oligonucleotides is observed along PC2. H-Tel duplex, H-Tel G4 and c-Myc G4, with similar AT and GC content (∼50% AT), are situated between the AT-rich and GC-rich oligonucleotides. PC1 therefore can be interpreted in terms of global affinity for different DNA structures and PC2 can be interpreted as showing different ligand affinities for AT- or GC-rich oligonucleotides. The projection of a compound onto a particular oligonucleotide vector is interpreted as its relative affinity for a particular DNA structure. For instance, the projections of the compounds WP762, methyl blue, NMM580, methyl green onto the GC duplex vector within the PC1–PC2 biplot (Figure 3, left), shows that the order of affinity for GC duplex (2) is approximately WP762 > methyl blue > NMM580 > methyl green, which agrees with the affinity ranking obtained from the %FID. In summary, the biplot PC1–PC2 allows easy visualization and classification of compounds based on their affinity for the DNA structures. PC1 separates the compounds based on their global affinity and PC2 separates them based on their AT or GC affinity. In this way, compounds situated at the left, center and right correspond to compounds with high, medium or low affinity for the different DNA structures. Interpretation of the PC2–PC3 biplot The PC2–PC3 biplot (Figure 3, right) projection provides visualization of the selectivity of a compound for a particular DNA structure. Three groups of variables hidden in the PC1–PC2 biplot are now observed in the PC2–PC3 biplot (Figure 3, right). Each group of variables points in different directions with an angle between them of ∼120°. The first group contains AT-rich structures (AT duplex, A/T duplex and AT triplex) and points to the bottom left corner of the PC2–PC3 biplot. The second group contains GC-rich structures (GC duplex, GC triplex and H-Tel duplex) and points to the bottom right corner. The third group contains only the G4 structures H-Tel and c-Myc and points in a positive direction of PC3. This orientation of the variables reflects differences in the structures of these oligonucleotides. Therefore, information about the selectivity for a particular DNA structure can be obtained from the location of the compounds in the PC2–PC3 biplot. The biplot PC2–PC3 shows four well-separated clusters of compounds. Cluster 1 (black) contains the minor groove binders pentamidine, netropsin, berenil, DAPI and Hoechst that are selective for AT in duplexes and triplexes (25). Cluster 2 (red) contains the DNA intercalators ethidium bromide, actinomycin D, doxorubicin, WP762 and also the minor groove ligand chromomycin A3 that prefers GC base pairs (85). These compounds have higher affinity for GC-rich DNA structures (25). Cluster 3 (green) contains NMM580 and methylene blue that are selective for G4s (21,25,86,87). Cluster 4 (blue) includes the remaining compounds that have low selectivity. To summarize, the PC2–PC3 biplot shows a higher level of discrimination between different DNA structures. PC2 separates compounds based on AT or GC base composition and PC3 discriminates based on G4 binding. Application to the NCI diversity set III library In order to demonstrate the utility of this FID assay as a secondary screen in the context of a drug discovery platform, we used the NCI diversity set III library that contains 1598 compounds as a test. As a primary screen of the library, differential scanning fluorometry was used to identify compounds that bound to a FRET-labeled G4 structure, a human telomere sequence 5′AGGG(TTAGGG)3 in the hybrid form. Supplementary Figure S31 shows the distribution of Tm shifts obtained, with several compounds with ΔTm > 10 degrees, indicative of tight binding. We selected 20 of these with Tm shifts ranging from 6 to 31 degrees (Supplementary Figures S31–S32) for secondary screening using R-FID, based on their apparent affinity and compound availability. The results of this screen are shown in Figure 4. Figure 4. View largeDownload slide Results of the FID assay for the 20 compounds with the highest Tm shift from the NCI diversity set III library and eight oligonucleotide structures. Each bar represents the %FID of a given compound for an oligonucleotide sequence. The error bars correspond to standard deviation of the %FID values from two different plate measurements. Figure 4. View largeDownload slide Results of the FID assay for the 20 compounds with the highest Tm shift from the NCI diversity set III library and eight oligonucleotide structures. Each bar represents the %FID of a given compound for an oligonucleotide sequence. The error bars correspond to standard deviation of the %FID values from two different plate measurements. To our surprise, not all of the 20 compounds that registered as hits in the thermal shift assay were able to displace TO from the unlabeled G4 hybrid structure used in our FID assay, even though the identical sequence was used in both assays. No correlation between the magnitude of ΔTm and the %FID recorded (Supplementary Figure S33) was observed. To understand this result, several differences between the FID and thermal shift assays need to be emphasized. First, the thermal shift assay requires a FRET labeled G4 structure, which might influence binding, while the FID assay does not. Second, the thermal shift assay uses a large molar excess of ligand over G4 (250:1), while a much lower molar ratio (5:1), is used in the FID assay, perhaps resulting in differing extents of ligand binding. Third, the FID assay is by design a competition binding assay and the apparent ligand affinity is relative to the affinity of the TO probe and is reduced from the true affinity. Recall that, as discussed above, a ligand binding constant of >105 M−1 is required to register TO displacement in the FID assay. Even with these considerations, the lack of correlation between the thermal shift and FID assays remains puzzling. Additional explanations for the differences are needed. First, it is possible that the ligand may bind to a different site than TO and instead of displacement, a ternary (ligand-G4-TO) complex may form. For example, if TO is bound by an ‘end pasting’ mode and stacked on a terminal G-quartet, the ligand might bind within a groove and not perturb the bound TO. Second, it is possible that the FRET labels in the thermal shift assay might somehow form a binding pocket to facilitate ligand binding, and since they are not present in the FID assay, no ligand binding site would be available. We conclude that a positive FID confirms and validates ligand binding registered by the primary thermal shift assay, but that a valid initial hit might be invisible in the FID assay. Since the intent of our FID assay is as a secondary screen to assess binding selectivity, we focused further analysis on those compounds that showed appreciable FID (>10%) in their interaction with the human telomere hybrid G4. Five of these eight compounds with an FID response had ΔTm > 25°C in the primary screen. The most striking result to emerge from inspection of Figure 4 is that only one of the eight G4 hybrid binders exhibited any specificity. Compounds 2, 6, 14, 15 and 18 (Figure 4, Supplementary Figure S32) bind to all structures in the FID assay. Compound 1 has a selectivity for AT-oligonucleotides and a shape similar to the classical minor groove binders. Compound 7 is interesting because it shows a clear preference for GC-oligonucleotides, with significant binding to the GC triplex structure and GC-rich duplexes in addition to binding to the two quadruplex structures. Compound 8 emerges as the most selective compound for G4 structures. Interestingly, although it was selected for binding to an antiparallel hybrid G4 structure, it clearly binds preferentially to the c-myc parallel G4 structure. It shows some binding to AT-rich triplex and duplex structures, but not to GC-rich duplex or triplex structures. The key point is that the FID assay fulfilled its intended purpose as a secondary screen for binding specificity. While it is disappointing that most hits from the primary thermal shift assay seem to lack specificity, it is essential to find that out sooner rather than later to avoid further development of less than ideal compounds for targeting G4 structures. Perhaps such lack of specificity is a contributing reason for the unsuccessful discovery of G4 targeted drugs so far. A PCA analysis provides further insight into the selectivity of the eight compounds, with the results shown in Figure 5 (Supplementary Figures S34–S35). In this analysis, the FID data of Figure 4 for the eight compounds showing TO displacement for the hTel G4 was combined with the FID data for the 30 reference compounds. The interesting results that emerge are that compound 8 clusters with known quadruplex binders, compound 1 clusters with known groove binders, and compound 7 clusters with known GC-specific intercalators. The remaining five compounds have little or no selectivity. These results illustrate the utility of the FID assay and PCA analysis for defining specificity and possible binding modes. Inspection of the structure of compound 8 shows that it is similar to crystal violet that is in our set of reference compounds, but seems to have improved selectivity for G4 binding. Figure 5. View largeDownload slide PCA and HCA of FID results for the combination of 30 nucleic acids binders and the eight compounds with the highest thermal shift (ΔTm) and % FID for H-Tel quadruplex from the NCI diversity set III library with 8 oligonucleotide structures. Left: biplot of PC1 and PC2. Right: biplot of PC2 and PC3. The center of each cluster is represented by larger point size. Figure 5. View largeDownload slide PCA and HCA of FID results for the combination of 30 nucleic acids binders and the eight compounds with the highest thermal shift (ΔTm) and % FID for H-Tel quadruplex from the NCI diversity set III library with 8 oligonucleotide structures. Left: biplot of PC1 and PC2. Right: biplot of PC2 and PC3. The center of each cluster is represented by larger point size. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. FUNDING National Institutes of Health [CA35635 to J.B.C., GM077422 to J.B.C., J.O.T.]; James Graham Brown Foundation. Funding for open access charge: James Graham Brown Foundation. Conflict of interest statement. None declared. REFERENCES 1. Neidle S. Principles of Nucleic Acid Structure . 2007; San Diego: Academic Press, Inc. 2. Sinden R.R. DNA Structure and Function . 1994; San Diego: Academic Press Inc. 3. Dai J., Hatzakis E., Hurley L.H., Yang D. I-Motif structures formed in the human c-MYC promoter are highly dynamic–insights into sequence redundancy and I-motif stability. PLoS ONE . 2010; 5: e11647. Google Scholar CrossRef Search ADS PubMed 4. Frank-Kamenetskii M.D., Mirkin S.M. Triplex DNA Structures. Annu. Rev. Biochem. 1995; 64: 65– 95. Google Scholar CrossRef Search ADS PubMed 5. Felsenfeld G., Davies D.R., Rich A. Formation of a three-stranded polynucleotide molecule. J. Am. Chem. Soc. 1957; 79: 2023– 2024. Google Scholar CrossRef Search ADS 6. Burge S., Parkinson G.N., Hazel P., Todd A.K., Neidle S. Quadruplex DNA: sequence, topology and structure. Nucleic Acids Res. 2006; 34: 5402– 5415. Google Scholar CrossRef Search ADS PubMed 7. Guéron M., Leroy J.-L. The i-motif in nucleic acids. Curr. Opin. Struct. Biol. 2000; 10: 326– 331. Google Scholar CrossRef Search ADS PubMed 8. Völker J., Klump H.H., Breslauer K.J. The energetics of i-DNA tetraplex structures formed intermolecularly by d(TC5) and intramolecularly by d[(C5T3)3C5]. Biopolymers . 2007; 86: 136– 147. Google Scholar CrossRef Search ADS PubMed 9. Biffi G., Tannahill D., McCafferty J., Balasubramanian S. Quantitative visualization of DNA G-quadruplex structures in human cells. Nat. Chem. 2013; 5: 182– 186. Google Scholar CrossRef Search ADS PubMed 10. Mergny J.-L. Alternative DNA structures: G4 DNA in cells: itae missa est. Nat. Chem. Biol. 2012; 8: 225– 226. Google Scholar CrossRef Search ADS PubMed 11. Buske F.A., Mattick J.S., Bailey T.L. Potential in vivo roles of nucleic acid triple-helices. RNA Biol. 2011; 8: 427– 439. Google Scholar CrossRef Search ADS PubMed 12. Han H., Hurley L.H. G-quadruplex DNA: a potential target for anti-cancer drug design. Trends Pharmacol. Sci. 2000; 21: 136– 142. Google Scholar CrossRef Search ADS PubMed 13. Miller K.M., Rodriguez R. G-quadruplexes: selective DNA targeting for cancer therapeutics. Expert Rev. Clin. Pharmacol. 2011; 4: 139– 142. Google Scholar CrossRef Search ADS PubMed 14. Lipps H.J., Rhodes D. G-quadruplex structures: in vivo evidence and function. Trends Cell Biol. 2009; 19: 414– 422. Google Scholar CrossRef Search ADS PubMed 15. Belotserkovskii B.P., Mirkin S.M., Hanawalt P.C. DNA sequences that interfere with transcription: implications for genome function and stability. Chem. Rev. 2013; 113: 8620– 8637. Google Scholar CrossRef Search ADS PubMed 16. Rhodes D., Lipps H.J. G-quadruplexes and their regulatory roles in biology. Nucleic Acids Res. 2015; 43: 8627– 8637. Google Scholar CrossRef Search ADS PubMed 17. Kouzine F., Wojtowicz D., Baranello L., Yamane A., Nelson S., Resch W., Kieffer-Kwon K.-R., Benham C.J., Casellas R., Przytycka T.M.et al. Permanganate/S1 nuclease footprinting reveals non-B DNA structures with regulatory potential across a mammalian genome. Cell Syst. 2017; 4: 344– 356. Google Scholar CrossRef Search ADS PubMed 18. Hurley L.H. DNA and its associated processes as targets for cancer therapy. Nat. Rev. Cancer . 2002; 2: 188– 200. Google Scholar CrossRef Search ADS PubMed 19. Chaires J.B. A small molecule—DNA binding landscape. Biopolymers . 2015; 103: 473– 479. Google Scholar CrossRef Search ADS PubMed 20. Chaires J.B., Mergny J.-L. Targeting DNA. Biochimie . 2008; 90: 973– 975. Google Scholar CrossRef Search ADS PubMed 21. Ragazzon P., Chaires J.B. Use of competition dialysis in the discovery of G-quadruplex selective ligands. Methods . 2007; 43: 313– 323. Google Scholar CrossRef Search ADS PubMed 22. Chaires J.B. A Competition Dialysis Assay for the Study of Structure-Selective Ligand Binding to Nucleic Acids. Curr. Protoc. Nucleic Acid Chem. 2003; 11, doi:10.1002/0471142700.nc0803s11. 23. Ragazzon P.A., Garbett N.C., Chaires J.B. Competition dialysis: a method for the study of structural selective nucleic acid binding. Methods . 2007; 42: 173– 182. Google Scholar CrossRef Search ADS PubMed 24. Holt P.A., Buscaglia R., Trent J.O., Chaires J.B. A discovery funnel for nucleic acid binding drug candidates. Drug Dev. Res. 2011; 72: 178– 186. Google Scholar CrossRef Search ADS PubMed 25. Ren J., Chaires J.B. Sequence and structural selectivity of nucleic acid binding ligands. Biochemistry . 1999; 38: 16067– 16075. Google Scholar CrossRef Search ADS PubMed 26. Boger D.L., Fink B.E., Brunette S.R., Tse W.C., Hedrick M.P. A simple, high-resolution method for establishing DNA binding affinity and sequence selectivity. J. Am. Chem. Soc. 2001; 123: 5878– 5891. Google Scholar CrossRef Search ADS PubMed 27. Largy E., Hamon F., Teulade-Fichou M.-P. Development of a high-throughput G4-FID assay for screening and evaluation of small molecules binding quadruplex nucleic acid structures. Anal. Bioanal. Chem. 2011; 400: 3419– 3427. Google Scholar CrossRef Search ADS PubMed 28. Largy E., Saettel N., Hamon F., Dubruille S., Teulade-Fichou M.P. Screening of a chemical library by HT-G4-FID for discovery of selective G-quadruplex binders. Curr. Pharm. Des. 2012; 18: 1992– 2001. Google Scholar CrossRef Search ADS PubMed 29. Alcaro S., Musetti C., Distinto S., Casatti M., Zagotto G., Artese A., Parrotta L., Moraca F., Costa G., Ortuso F.et al. Identification and characterization of new DNA G-quadruplex binders selected by a combination of ligand and structure-based virtual screening approaches. J. Med. Chem. 2013; 56: 843– 855. Google Scholar CrossRef Search ADS PubMed 30. Kumar S., Xue L., Arya D.P. Neomycin−neomycin dimer: an all-carbohydrate scaffold with high affinity for AT-rich DNA duplexes. J. Am. Chem. Soc. 2011; 133: 7361– 7375. Google Scholar CrossRef Search ADS PubMed 31. Glass L.S., Bapat A., Kelley M.R., Georgiadis M.M., Long E.C. Semi-automated high-throughput fluorescent intercalator displacement-based discovery of cytotoxic DNA binding agents from a large compound library. Bioorg. Med. Chem. Lett. 2010; 20: 1685– 1688. Google Scholar CrossRef Search ADS PubMed 32. Holt P.A., Ragazzon P., Strekowski L., Chaires J.B., Trent J.O. Discovery of novel triple helical DNA intercalators by an integrated virtual and actual screening platform. Nucleic Acids Res. 2009; 37: 1280– 1287. Google Scholar CrossRef Search ADS PubMed 33. Nasiri H.R., Bell N.M., McLuckie K.I.E., Husby J., Abell C., Neidle S., Balasubramanian S. Targeting a c-MYC G-quadruplex DNA with a fragment library. Chem. Commun. 2014; 50: 1704– 1707. Google Scholar CrossRef Search ADS 34. Watkins D., Ranjan N., Kumar S., Gong C., Arya D.P. An assay for human telomeric G-quadruplex DNA binding drugs. Bioorg. Med. Chem. Lett. 2013; 23: 6695– 6699. Google Scholar CrossRef Search ADS PubMed 35. Ranjan N., Arya D. Targeting C-myc G-quadruplex: dual recognition by aminosugar-bisbenzimidazoles with varying linker lengths. Molecules . 2013; 18: 14228– 14240. Google Scholar CrossRef Search ADS PubMed 36. Schoonover M., Kerwin S.M. G-quadruplex DNA cleavage preference and identification of a perylene diimide G-quadruplex photocleavage agent using a rapid fluorescent assay. Bioorg. Med. Chem. 2012; 20: 6904– 6918. Google Scholar CrossRef Search ADS PubMed 37. Yang D., Okamoto K. Structural insights into G-quadruplexes: towards new anticancer drugs. Future Med. Chem. 2010; 2: 619– 646. Google Scholar CrossRef Search ADS PubMed 38. Neidle S. Quadruplex nucleic acids as novel therapeutic targets. J. Med. Chem. 2016; 59: 5987– 6011. Google Scholar CrossRef Search ADS PubMed 39. Maizels N. G4‐associated human diseases. EMBO Rep. 2015; 16: 910– 922. Google Scholar CrossRef Search ADS PubMed 40. Balasubramanian S., Hurley L.H., Neidle S. Targeting G-quadruplexes in gene promoters: a novel anticancer strategy. Nat. Rev. Drug Discov. 2011; 10: 261– 275. Google Scholar CrossRef Search ADS PubMed 41. Neidle S. Human telomeric G-quadruplex: The current status of telomeric G-quadruplexes as therapeutic targets in human cancer. FEBS J. 2010; 277: 1118– 1125. Google Scholar CrossRef Search ADS PubMed 42. Hansel-Hertsch R., Di Antonio M., Balasubramanian S. DNA G-quadruplexes in the human genome: detection, functions and therapeutic potential. Nat. Rev. Mol. Cell Biol. 2017; 18: 279– 284. Google Scholar CrossRef Search ADS PubMed 43. Islam M.K., Jackson P.J., Rahman K.M., Thurston D.E Recent advances in targeting the telomeric G-quadruplex DNA sequence with small molecules as a strategy for anticancer therapies. Future Med. Chem. 2016; 8: 1259– 1290. Google Scholar CrossRef Search ADS PubMed 44. Lytton-Jean A.K.R., Han M.S., Mirkin C.A. Microarray detection of duplex and triplex DNA binders with DNA-modified gold nanoparticles. Anal. Chem. 2007; 79: 6037– 6041. Google Scholar CrossRef Search ADS PubMed 45. Xu N., Yang H., Cui M., Wan C., Liu S. High-performance liquid chromatography–electrospray ionization-mass spectrometry ligand fishing assay: a method for screening triplex DNA binders from natural plant extracts. Anal. Chem. 2012; 84: 2562– 2568. Google Scholar CrossRef Search ADS PubMed 46. De Cian A., Guittat L., Shin-ya K., Riou J.-F., Mergny J.-L. Affinity and selectivity of G4 ligands measured by FRET. Nucleic Acids Symp. Ser. 2005; 49: 235– 236. Google Scholar CrossRef Search ADS 47. Boger D.L., Tse W.C. Thiazole orange as the fluorescent intercalator in a high resolution fid assay for determining DNA binding affinity and sequence selectivity of small molecules. Bioorg. Med. Chem. 2001; 9: 2511– 2518. Google Scholar CrossRef Search ADS PubMed 48. Lewis M.A., Long E.C. Fluorescent intercalator displacement analyses of DNA binding by the peptide-derived natural products netropsin, actinomycin, and bleomycin. Bioorg. Med. Chem. 2006; 14: 3481– 3490. Google Scholar CrossRef Search ADS PubMed 49. Monchaud D., Allain C., Teulade-Fichou M.-P. Development of a fluorescent intercalator displacement assay (G4-FID) for establishing quadruplex-DNA affinity and selectivity of putative ligands. Bioorg. Med. Chem. Lett. 2006; 16: 4842– 4845. Google Scholar CrossRef Search ADS PubMed 50. Monchaud D., Allain C., Teulade-Fichou M.P. Thiazole orange: a useful probe for fluorescence sensing of G-quadruplex-ligand interactions. Nucleosides Nucleotides Nucleic Acids . 2007; 26: 1585– 1588. Google Scholar CrossRef Search ADS PubMed 51. Tse W.C., Boger D.L. A fluorescent intercalator displacement assay for establishing DNA binding selectivity and affinity. Acc. Chem. Res. 2004; 37: 61– 69. Google Scholar CrossRef Search ADS PubMed 52. Tse W.C., Boger D.L. A fluorescent intercalator displacement assay for establishing DNA binding selectivity and affinity. Curr. Protoc. Nucleic Acid Chem. 2001; 20: 8.5.1– 8.5.11. 53. Jamroskovic J., Livendahl M., Eriksson J., Chorell E., Sabouri N. Identification of compounds that selectively stabilize specific G-quadruplex structures by using a thioflavin T-displacement assay as a tool. Chemistry . 2016; 22: 18932– 18943. Google Scholar CrossRef Search ADS PubMed 54. Tran P.L.T., Largy E., Hamon F., Teulade-Fichou M.-P., Mergny J.-L. Fluorescence intercalator displacement assay for screening G4 ligands towards a variety of G-quadruplex structures. Biochimie . 2011; 93: 1288– 1296. Google Scholar CrossRef Search ADS PubMed 55. Chan D.S.-H., Yang H., Kwan M.H.-T., Cheng Z., Lee P., Bai L.-P., Jiang Z.-H., Wong C.-Y., Fong W.-F., Leung C.-H.et al. Structure-based optimization of FDA-approved drug methylene blue as a c-myc G-quadruplex DNA stabilizer. Biochimie . 2011; 93: 1055– 1064. Google Scholar CrossRef Search ADS PubMed 56. Monchaud D., Allain C., Bertrand H., Smargiasso N., Rosu F., Gabelica V., De Cian A., Mergny J.L., Teulade-Fichou M.P. Ligands playing musical chairs with G-quadruplex DNA: A rapid and simple displacement assay for identifying selective G-quadruplex binders. Biochimie . 2008; 90: 1207– 1223. Google Scholar CrossRef Search ADS PubMed 57. Riechert-Krause F., Autenrieth K., Eick A., Weisz K. Spectroscopic and calorimetric studies on the binding of an indoloquinoline drug to parallel and antiparallel DNA triplexes. Biochemistry . 2012; 52: 41– 52. Google Scholar CrossRef Search ADS PubMed 58. Le S., Josse J., Husson F. FactoMineR: An R package for multivariate analysis. J. Stat. Softw. 2008; 25: 1– 18. Google Scholar CrossRef Search ADS 59. Husson F., Josse J., Pagès J. Principal component methods—hierarchical clustering—partitional clustering: why would we need to choose for visualizing data. 2010; Technical Report https://francoishusson.files.wordpress.com/2017/02/hcpc_husson_josse.pdf. 60. Praseuth D., Guieysse A.L., Hélène C. Triple helix formation and the antigene strategy for sequence-specific control of gene expression. Biochim. Biophys. Acta (BBA) - Gene Struct. Expression . 1999; 1489: 181– 206. Google Scholar CrossRef Search ADS 61. Phan A.T., Leroy J.-L. Intramolecular i-motif structures of telomeric DNA. J. Biomol. Struct. Dyn. 2000; 17: 245– 251. Google Scholar CrossRef Search ADS PubMed 62. Phan A.T., Guéron M., Leroy J.-L. The solution structure and internal motions of a fragment of the cytidine-rich strand of the human telomere. J. Mol. Biol. 2000; 299: 123– 144. Google Scholar CrossRef Search ADS PubMed 63. Wang Y., Patel D.J. Solution structure of the human telomeric repeat d[AG3(T2AG3)3] G-tetraplex. Structure . 1993; 1: 263– 282. Google Scholar CrossRef Search ADS PubMed 64. Parkinson G.N., Lee M.P.H., Neidle S. Crystal structure of parallel quadruplexes from human telomeric DNA. Nature . 2002; 417: 876– 880. Google Scholar CrossRef Search ADS PubMed 65. Dai J., Carver M., Yang D. Polymorphism of human telomeric quadruplex structures. Biochimie . 2008; 90: 1172– 1183. Google Scholar CrossRef Search ADS PubMed 66. Lane A.N., Chaires J.B., Gray R.D., Trent J.O. Stability and kinetics of G-quadruplex structures. Nucleic Acids Res. 2008; 36: 5482– 5515. Google Scholar CrossRef Search ADS PubMed 67. Ambrus A., Chen D., Dai J., Bialis T., Jones R.A., Yang D. Human telomeric sequence forms a hybrid-type intramolecular G-quadruplex structure with mixed parallel/antiparallel strands in potassium solution. Nucleic Acids Res. 2006; 34: 2723– 2735. Google Scholar CrossRef Search ADS PubMed 68. Ambrus A., Chen D., Dai J., Jones R.A., Yang D. Solution structure of the biologically relevant G-quadruplex element in the human c-MYC promoter. implications for G-quadruplex stabilization. Biochemistry . 2005; 44: 2048– 2058. Google Scholar CrossRef Search ADS PubMed 69. Vorlíčková M., Kejnovská I., Bednářová K., Renčiuk D., Kypr J. Circular dichroism spectroscopy of DNA: from duplexes to quadruplexes. Chirality . 2012; 24: 691– 698. Google Scholar CrossRef Search ADS PubMed 70. Bishop G.R., Chaires J.B. Characterization of DNA structures by circular dichroism. Curr. Protoc. Nucleic Acid Chem. 2002; 11: 7.11.1– 7.11.8. Google Scholar CrossRef Search ADS 71. Paramasivan S., Rujan I., Bolton P.H. Circular dichroism of quadruplex DNAs: applications to structure, cation effects and ligand binding. Methods . 2007; 43: 324– 331. Google Scholar CrossRef Search ADS PubMed 72. Jaumot J., Eritja R., Navea S., Gargallo R. Classification of nucleic acids structures by means of the chemometric analysis of circular dichroism spectra. Anal. Chim. Acta . 2009; 642: 117– 126. Google Scholar CrossRef Search ADS PubMed 73. Kypr J., Kejnovská I., Renčiuk D., Vorlíčková M. Circular dichroism and conformational polymorphism of DNA. Nucleic Acids Res. 2009; 37: 1713– 1725. Google Scholar CrossRef Search ADS PubMed 74. del Villar-Guerra R., Gray R.D., Chaires J.B. Characterization of quadruplex DNA structure by circular dichroism. Curr. Protoc. Nucleic Acid Chem. 2017; 68: 17.8.1– 17.8.16. Google Scholar CrossRef Search ADS 75. Mergny J.-L., Li J., Lacroix L., Amrane S., Chaires J.B. Thermal difference spectra: a specific signature for nucleic acid structures. Nucleic Acids Res. 2005; 33: e138. Google Scholar CrossRef Search ADS PubMed 76. Qu X., Chaires J.B. Analysis of drug-DNA binding data. Methods Enzymol. 2000; 321: 353– 369. Google Scholar CrossRef Search ADS PubMed 77. Krafčíková P., Demkovičová E., Víglaský V. Ebola virus derived G-quadruplexes: Thiazole orange interaction. Biochim. Biophys. Acta (BBA) - Gen. Subj. 2017; 1861: 1321– 1328. Google Scholar CrossRef Search ADS 78. Jolliffe I.T. Principal Component Analysis . 2002; NY: Springer. 79. Wold S., Esbensen K., Geladi P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987; 2: 37– 52. Google Scholar CrossRef Search ADS 80. Ringner M. What is principal component analysis. Nat Biotech . 2008; 26: 303– 304. Google Scholar CrossRef Search ADS 81. Lever J., Krzywinski M., Altman N. Points of significance: principal component analysis. Nat. Methods . 2017; 14: 641– 642. Google Scholar CrossRef Search ADS 82. Ivosev G., Burton L., Bonner R. Dimensionality reduction and visualization in principal component analysis. Anal. Chem. 2008; 80: 4933– 4944. Google Scholar CrossRef Search ADS PubMed 83. Gbriel K.R. The biplot graphic display of matrices with application to principal component analysis. Biometrika . 1971; 58: 453– 467. Google Scholar CrossRef Search ADS 84. Ding C., He X.F. Dai H, Srikant R, Zhang C. Advances in Knowledge Discovery and Data Mining, Proceedings . 2004; 3056: Berlin: Springer-Verlag Berlin. 414– 418. Google Scholar CrossRef Search ADS 85. Barceló F., Ortiz-Lombardía M., Martorell M., Oliver M., Méndez C., Salas J.A., Portugal J. DNA binding characteristics of mithramycin and chromomycin analogues obtained by combinatorial biosynthesis. Biochemistry . 2010; 49: 10543– 10552. Google Scholar CrossRef Search ADS PubMed 86. Nicoludis J.M., Barrett S.P., Mergny J.-L., Yatsunyk L.A. Interaction of human telomeric DNA with N-methyl mesoporphyrin IX. Nucleic Acids Res. 2012; 40: 5432– 5447. Google Scholar CrossRef Search ADS PubMed 87. Nicoludis J.M., Miller S.T., Jeffrey P.D., Barrett S.P., Rablen P.R., Lawton T.J., Yatsunyk L.A. Optimized end-stacking provides specificity of N-methyl mesoporphyrin IX for human telomeric G-quadruplex DNA. J. Am. Chem. Soc. 2012; 134: 20446– 20456. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7DJenjaroenpun, Piroon;Wongsurawat, Thidathip;Pereira, Rui;Patumcharoenpol, Preecha;Ussery, David W;Nielsen, Jens;Nookaew, Intawat
doi: 10.1093/nar/gky014pmid: 29346625
Abstract Completion of eukaryal genomes can be difficult task with the highly repetitive sequences along the chromosomes and short read lengths of second-generation sequencing. Saccharomyces cerevisiae strain CEN.PK113-7D, widely used as a model organism and a cell factory, was selected for this study to demonstrate the superior capability of very long sequence reads for de novo genome assembly. We generated long reads using two common third-generation sequencing technologies (Oxford Nanopore Technology (ONT) and Pacific Biosciences (PacBio)) and used short reads obtained using Illumina sequencing for error correction. Assembly of the reads derived from all three technologies resulted in complete sequences for all 16 yeast chromosomes, as well as the mitochondrial chromosome, in one step. Further, we identified three types of DNA methylation (5mC, 4mC and 6mA). Comparison between the reference strain S288C and strain CEN.PK113-7D identified chromosomal rearrangements against a background of similar gene content between the two strains. We identified full-length transcripts through ONT direct RNA sequencing technology. This allows for the identification of transcriptional landscapes, including untranslated regions (UTRs) (5′ UTR and 3′ UTR) as well as differential gene expression quantification. About 91% of the predicted transcripts could be consistently detected across biological replicates grown either on glucose or ethanol. Direct RNA sequencing identified many polyadenylated non-coding RNAs, rRNAs, telomere-RNA, long non-coding RNA and antisense RNA. This work demonstrates a strategy to obtain complete genome sequences and transcriptional landscapes that can be applied to other eukaryal organisms. INTRODUCTION The genome of the most well studied eukaryotic model organism, Saccharomyces cerevisiae strain S288c, was sequenced and released in 1996; it was the first complete, high quality genome sequence of an eukaryal organism (1). Since then, the development of DNA sequencing technologies has yielded scientific breakthroughs that enable us to obtain and analyze genomic DNA sequences at a faster, more economical pace (2). As of August 2017, the NCBI genome database lists >500 Saccharomyces genomes and 4600 eukaryal sequenced genomes. However, <1% (35 genomes) of these are classified as ‘complete genomes’, which harbor contiguous chromosomal sequence(s) without gaps (definition by NCBI); this includes 1 animal (Caenorhabditis elegans), 29 fungi and 5 protists. All of these have relatively small genome sizes, most <100 Mb. Thus, >99% of the eukaryal genomes are drafts, and virtually all of the larger genomes are incomplete. There are many limitations of draft genomes, which can lead to misinterpretations (e.g. see (3)). Complete genome status should be the objective for a genome-sequencing project, even though draft genomes can provide most of the coding sequences that are sufficient to gain functional insights about an organism. Once a genome is obtained, transcriptional analysis can be performed to improve gene annotation and identify dynamic signatures of gene expression. Traditional RNA-Seq has been widely employed in several studies as a powerful tool for transcriptomics (4). The small percentage of complete genome sequences in both eukaryotes (as mentioned above) and prokaryotes (5) strongly indicates the difficulties and high cost of properly locating and ordering DNA segments obtained from assemblers as well as resolving ambiguities or discrepancies among reads for a completely assembled genome sequence (3). One major limitation lies in reads generated from DNA sequencing technologies. First-generation sequencing technology (6) can generate moderately long and accurate reads, but is a slow and expensive method for obtaining the high sequencing depth required for genome assemblers. Second-generation sequencing technologies (7) (such as Illumina, 454 and Ion-torrent) can generate massive amounts of reads with high accuracy, although the reads are too short to allow for the de novo assembly of complete genomes (8), resulting in pieces of DNA rather than chromosomal-sized contiguous sequences. Nevertheless, DNA spanning technologies from BioNanoGenomics, 10X Genomics, and Dovetail cHiCago sequencing company can produce long pieces of DNA sequence (with a mean span length of 30–250 kb depending on the technology) from short reads. Third-generation sequencing technologies can generate very long reads at the single-molecule level, though the error rate is high. Pacific Biosciences (PacBio) has developed Single Molecule Real Time (SMART) technology that offers two sequencing strategies—continuous long read (CLR) and circular consensus long read (CCS). The error rate of raw reads derived from CLR approach is around 13% (7,9). The high level of error can be reduced to ≤1% (7) with the CCS approach (multiple passing), which involves sequencing shorter DNA pieces, typically lower than 25 kb, several times. Oxford Nanopore Technologies (ONT) has developed a portable sequencing device called MinION that is able to perform single-molecule DNA sequencing (10) and, recently, cDNA sequencing (11). The DNA sequencing by ONT also offers two chemistries—1D and 1D2 for the latest version of flow cell R9.4/R9.5. The raw reads generated by 1D chemistry have a sequencing error rate similar to PacBio CLR, with possible read lengths of more than 300 kb (10). By using 1D2 chemistry, the mean error rate can be improved to <4%, although the throughput will be reduced by half when compared to the 1D chemistry. In addition, both PacBio and ONT can directly detect DNA methylation (12–14), providing additional valuable information for epigenetics. The S. cerevisiae strain CEN.PK113-7D, the offspring of parental strains ENY.WA-1A and MC996A, is used extensively in academic and industrial research, especially in metabolic engineering and systems biology, due to a combination of ease of genetic manipulation and a fast growth rate (15). Based on systems biology analysis by Canelas et al. (16), the phenotypic differences between CEN.PK113-7D and S288C are mainly observed in protein metabolism and ergosterol biosynthesis. Having a high quality complete genome for this strain is important for a detailed mechanistic understanding at the systems biology level. Otero et al. (17) first performed whole-genome sequencing of the CEN.PK113-7D strain using short reads (35 bp) with 18X coverage to identify single nucleotide variations (SNVs) compared to the S288c strain. Some of these SNVs were related to metabolic differences between the two strains. Later, Nijkamp et al. (18) performed a de novo assembly of the CEN.PK113-7D strain genome, with sequences from a GS FLX+ system, 454 Life Sciences (average read length of 350 bp) and Illumina short reads (2 × 50 bp). The result was a draft genome sequence, with 565 contiguous DNA sequences (contigs) instead of the contiguous 16 chromosomes. Recently, third-generation sequencing was used to sequence the genomes of S. cerevisiae strain S288C and other isolates (19–21). The promising results from the de novo assembly of ONT+Illumina and PacBio+Illumina gave a high degree of sequence scaffold continuity—>99% accuracy when compared to the reference genome sequence of S288C. However, the sequence of all 16 chromosomes was still not complete (19). In this study, we performed whole-genome sequencing of the CEN.PK113-7D strain with a combination of three sequencing technologies: PacBio, ONT and Illumina (22) to obtain a complete genome sequence by de novo assembly. We also performed a genome comparison between the S288C and CEN.PK113-7D strains. Further, we identified the transcriptional landscapes of the CEN.PK113-7D strain under diauxic growth conditions, using ONT direct RNA sequencing technology. MATERIALS AND METHODS Genomic DNA extraction and cell cultivation Saccharomyces cerevisiae CEN.PK113-7D (MATa MAL2-8c SUC2, obtained from Dr Peter Kötter, Frankfurt, Germany) was cultivated overnight in 15 ml of yeast extract peptone dextrose (YPD) medium (10 g/l yeast extract, 20 g/l of peptone and 20 g/l of glucose). We used the Blood & Cell Culture DNA Mini Kit (Qiagen, Hilden, Germany) to extract genomic DNA from 3 ml of the overnight yeast culture (∼5 × 108 cells). The protocol recommended by the manufacturer was modified for yeast cells in the following steps: (i) the lyticase digestion was extended to 1 h, (2) the spheroplasts were centrifuged at 2000 × g for 5 min, (iii) proteinase K digestion was performed at 60°C for 2.5 h, (iv) RNase A was added after the proteinase K digestion, (v) RNase A incubation was performed overnight at 37°C and the final elution volume was reduced to 1 ml of buffer QF and (vi) after the precipitation with isopropanol, the DNA was spooled by inverting the tube, recovered with a pipette tip, washed in 1 ml of cold ethanol 70%, dried at room temperature for 10–20 min and dissolved in 0.1× TE buffer (1 mM Tris, 0.1 mM EDTA, pH 8.0). RNA extraction and cell cultivation We extracted RNA from S. cerevisiae CEN.PK113-7D cultivated in 50 ml of defined media as previously described (23) with 20 g/l of glucose, 7.5 g/l of (NH4)2SO4, 14.4 g/l of KH2PO4 and the pH adjusted to 6.5 with NaOH. We sampled the culture on two different time points: mid-exponential growth on glucose (∼4.3 × 107 cells/ml) and oxidative growth on the ethanol/fermentation products (∼2.6 × 108 cells/ml). At each sampling point, we quickly transferred 15 ml of the sample into a 50 ml conical tube half-filled with ice pellets, centrifuged it at 2000 × g for 5 min, snap froze it on liquid nitrogen and stored it at -80°C. We used the RNeasy Mini Kit (Qiagen, Germany) to extract RNA from 3 to 4 × 108 frozen cells with the protocol recommended by the manufacturer. The cells were disrupted in 2-ml tubes filled with 500 mg of acid-washed glass beads (425–600 μm particle size of Lysing Matrix C, MP Biomedicals) using a FastPrep-24 Instrument (MP Biomedicals, California, USA) at 6.0 m/s for 40 s. We used the total RNA obtained from the three biological replicates for direct RNA sequencing following manufacturer recommendation on the starting amount of poly-A RNA of 500 ng. Library preparation and genomic DNA sequencing by PacBio We produced one PacBio library using the SMRTbell™ Template Prep Kit version 1.0 according to manufacturer's instructions. In brief, we sheared 10 μg of genomic DNA per library into 20 kb fragments using the Megaruptor system, followed by an exo VII treatment, DNA damage repair, and end-repair before ligation of hairpin adaptors to generate a SMRTbell™ library for circular consensus sequencing. We then subjected the library to exo treatment and PB AMPure bead wash procedures for clean-up before it was size-selected with the BluePippin system with a cut-off value of 9000 bp. We used one SMRTcell™ to sequence the DNA library on the PacBio Sequel instrument using the Sequel 2.0 polymerase and 600 min of movie time. The high quality PacBio reads are deposited in an SRA database under BioProject:PRJNA398797, SRP116559. Library preparation and genomic DNA sequencing by ONT We performed genomic sequencing using the Rapid Sequencing Kit for genomic DNA SQK-RAD002 (ONT, USA), The protocol of the library preparation is provided in Supplementary link. The DNA library was eluted and loaded onto a flow cell for sequencing. We accomplished the flow cell loading in three steps: (i) draw back a small volume to remove any bubbles, (ii) prime the flow cell and (iii) add 75 μl of sample to the flow cell via the sample port in a dropwise fashion. We sequenced the genomic DNA on a single R9.5/FLO-MIN107 flow cell on a MinION Mk1B for 48 h. We further base-called the signal files (.fast5) using Albacore version 1.2.6 (ONT, USA). The high quality DNA reads from ONT are deposited in an SRA database under accession number BioProject: PRJNA398797, SRP116559. Library preparation and direct RNA sequencing by ONT We performed direct RNA sequencing using the Direct RNA Sequencing protocol for the MinION with the SQK-RNA001 kit (ONT, USA), which recommends 500 ng of poly-A RNA for input. We purified poly-A RNA from total RNA of either glucose condition or ethanol conditions. In all, we used about 222–550 ng of poly-A RNA purified from glucose and ethanol conditions as the input for library preparation, in adherence to the kit protocol. The protocol of the library preparation is provided in Supplementary link. We then loaded the library onto a flow cell (the same way of the DNA sequencing described previously) and sequenced the polyadenylated RNA on a single R9.5/FLO-MIN107 flow cell on a MinION Mk1B for 48 h. For base calling, we used the local-based software Albacore version 2.1.0. The high quality RNA reads from ONT are deposited in an SRA database under accession number BioProject:PRJNA398797, SRP116559. Bioinformatics and statistical analysis De novo assembly and polishing We first filtered the raw reads from both ONT sequencings using a mean quality score cutoff of 9 in the Albacore software (version 1.2.6) to obtain ONT high quality ONT reads. We obtained high quality PacBio reads from SMART link software using the default setting. Only reads longer than 500 bases were kept as high-quality reads and used for further analyses. The high quality reads from both ONT and PacBio were identified in their overlap using GraphMap software version 0.52 (24). For de novo assembly of the reads, we used Canu software version 1.5 (25) (at default parameters) with three strategies: (i) use ONT reads alone, (ii) use PacBio reads alone and (iii) use both ONT and PacBio reads. We will call the contigs obtained from the genome assemblies ONT_assembly, PacBio_assembly and OP_assembly, respectively. We used Pilon software version 1.22 (26) to further polish the assembled contigs with Illumina reads of our previous published data (22) to obtain a high quality genome. Genome comparison, computational annotation, and methylation analysis With the complete S288C genome and annotation information (version R64) from the Saccharomyces Genome Database (SGD), we used MUMMER software version 3 for global genome comparisons of the assemblies with the S. cerevisiae strain S288C genome (27). We selected the best assembly contigs result (OP_assembly) to perform genome annotations. We annotated the open reading frames (ORFs) of coding sequences (CDSs) and RNA non-coding sequences on the CEN.PK113-7D genome by the similarity search using the ORF sequences of the S288C query against the CEN.PK113-7D genome sequence using Blat software version 36 (28). In addition, we employed ab initio CDS calling using AUGUSTUS software version 2.5.5 (29) to identify possible new CDSs in the CEN.PK113-7D genome that were probably not present in S288C. For the local genome comparisons, we used LAST software version 1.04.00 (30) to identify synteny, inversion, and translocation events between S288C and CEN.PK113-7D chromosomes. Further, we called the possible DNA methylations at the signal level of DNA sequencing using Nanopolish (default parameters) to identify 5mC methylation (14) for ONT reads. For PacBio reads, we used blasr (31) and employed kinetic tools from SMRT link software version 4.0 to identify 4mC and 6mA methylations, using a cut-off of P-value of 0.001 and >30 reads coverage. We took the results derived from the genome's features and comparisons and summarized and plotted them for global visualization using Circos software version 0.69–4 (32). Transcriptional landscape analysis We first filtered the raw reads obtained from direct RNA sequencing with Albacore (version 1.2.6) using a quality score cutoff of 8 to obtain high quality reads. Then we employed GraphMap software version 0.5.2 (24) to align the high quality reads on the CEN.PK113-7D complete genome to identify transcriptional landscapes. We used two strategies—direct chromosome alignment and transcript model guided alignment—to map the direct RNA sequence reads. We quantified the gene expression levels based on the transcript model guided alignments by counting the number of mapped reads with respect to the transcript location using bedtools software version 2.26 (33). We performed the differential gene expression analysis in ethanol versus glucose conditions with the negative binomial statistic approach on the DESeq2 package (34). The P-value of each individual transcript was corrected for multiple testing using the Benjamini–Hochburg method to generate adjusted P-values. We used the PIANO package (36) to perform the gene set analysis of Gene Ontology (GO), which is the control vocabularies describe gene function, and relationships that are organized in a hierarchical structure (35). We selected the GO terms that have adjusted enrichment P-value less than 10e-6 and plotted a heatmap. In addition, we re-analyzed the Illumina RNA-Seq data from our previous study (22) by only mapping the reads on the CEN.PK113-7D complete genome using Stamy aligner version 1.0.31 with the default parameters (37). To compare the dynamic range of direct RNA sequencing (this study) with traditional RNA-Seq (Illumina RNA-Seq data from our previous study (22)), we calculated the mean coverage depth based on the mapped reads for each transcript for both datasets (see detail of calculation in the Supplementary text). We used the distribution of the mean coverage depth of each biological replicate for the dynamic range comparison. To identify UTR regions, we developed a Python script in-house for mapping the 5′ and 3′ ends of gene boundary detected by direct RNA sequencing by searching for a sharp reduction in signals at both ends of mapped reads. The regions between gene boundaries and ORFs can be defined as 5′ and 3′ UTRs at 5′ and 3′ ends of a given transcript, respectively. The details of all bioinformatics commands used and the Python script are provided in Supplementary text. RESULTS Third-generation sequencing long reads We generated high quality sequencing reads with third-generation sequencing using both ONT and PacBio. We obtained about 130 000 reads from ONT MinION, corresponding to 830 million bases (Mb) of data, with an N50 (the shortest sequence length at 50% of sequenced bases) of 12 500 bases; this corresponds to a 69-fold genome coverage for the yeast genome. Using the CCS chemistry of PacBio, we generated a higher number of reads (∼739 000), with 4 900 Mbp of data with an N50 of 8700 bases. Although the PacBio had shorter average read length, the larger number of reads resulted in a 408-fold coverage, about six times greater than obtained with the ONT sequencing. The details of the third-generation sequencing reads are provided in the supplementary Table TS1. The distribution of the read lengths (Figure 1A) shows that ONT generated longer reads than PacBio. We investigated the overlap of reads between ONT and PacBio and found a high level of overlap (Figure 1B), even though the number of reads generated from PacBio were 5.6 times more than ONT. Surprisingly, about 13% and 12% of the reads were specific (non-overlapping) for ONT and PacBio sequencing, respectively. The non-overlapping reads may reflect differences in sample preparation; the PacBio library preparation has a DNA size selection procedure, whereas ONT does not have any size selection. Figure 1. View largeDownload slide Summary of DNA sequencing reads from ONT and PacBio. (A) Histogram plot showing the distribution of read length of high quality of DNA sequencing reads. ( B) The read overlap plot between ONT and PacBio. The red and blue colors represent the DNA sequencing reads from ONT and PacBio, respectively. Figure 1. View largeDownload slide Summary of DNA sequencing reads from ONT and PacBio. (A) Histogram plot showing the distribution of read length of high quality of DNA sequencing reads. ( B) The read overlap plot between ONT and PacBio. The red and blue colors represent the DNA sequencing reads from ONT and PacBio, respectively. De novo genome assembly We performed de novo assembly using the Canu software (25), with three strategies: (i) use ONT reads only, (ii) use PacBio reads only and (iii) use both ONT and PacBio reads. For each strategy, we polished the assembly (base correction) using short reads (Illumina) and the Pilon software (26). The resulting de novo assembly for all three methods produced full-length, contiguous DNA sequences for nearly all of the chromosomes and the mitochondria genome, with a length comparable to the S288C chromosomes (see the assemblies statistic in supplementary Table TS2). Notably, in all three cases, the assembled 2-micron plasmid is much longer than the known length of around 6.3 kb, as shown in Table 1. Interestingly, the ONT_assembly (obtained from strategy 1 has the best results, in terms of the correct number of known chromosomes (18 contigs = 16 chromosomes plus mitochondria plus 2-micron plasmid). The PacBio_assembly (obtained from strategy 2) has 19 contigs, caused by a broken mitochondrial chromosome. The OP_assembly (obtained from the strategy 3), has the highest number of contigs (21 contigs), with three additional pieces of telomere DNA and two additional pieces associated with the 2-micron plasmid, which had a similar size when ONT or PacBio reads were used alone. Unexpectedly, a contig derived from OP_assembly joined ChrVII with ChrXIII at their telomeric regions, as shown in Supplementary Figure S1. We investigated the mapped reads from ONT, PacBio, and Illumina on the joined chromosome region and found a clear breakpoint, then separated ChrVII and ChrXIII. The summary of de novo assembly results of S. cerevisiae strain CEN.PK113-7D obtained from the three strategies comparing it to the genome of S. cerevisiae strain S288C Table 1. The summary of de novo assembly results of S. cerevisiae strain CEN.PK113-7D obtained from the three strategies comparing it to the genome of S. cerevisiae strain S288C Feature CEN.PK113-7D S288C ONT_assembly (69X) PacBio_assembly (408X) OP_assembly(477X) SGD chrI 224 821 241 274 235 019 230 218 chrII 806 426 820 406 827 088 813 184 chrIII 319 119 369 115 367 275 316 620 chrIV 1 504 163 1 518 811 1 518 534 1 531 933 chrV 577 655 593 818 579 053 576 874 chrVI 272 158 278 189 286 399 270 161 chrVII 1 123 142 1 138 579 113 7891 (**205 1454) 1 090 940 chrVIII 560 935 577 405 577 241 562 643 chrIX 441 593 461 772 452 809 439 888 chrX 764 537 759 892 777 694 745 751 chrXI 679 352 690 875 680 699 666 816 chrXII 1 117 833 1 078 292 1 114 766 1 078 177 chrXIII 912 255 913 070 913563 (**2051454) 924 431 chrXIV 776 471 801 157 778 019 784 333 chrXV 1 091 066 1 101 379 1 105 746 1 091 291 chrXVI 948 593 965 730 979 195 948 066 chrmt 86 132 *53 964 86 343 85 779 *27 063 Total length (main+mt) 12 206 251 12 462 516 12 417 334 12 157 105 2-micron 147 349 71 725 144 927 6 318 2-micron – – 61 136 – telomere – – 76 064 – telomere – – 49 485 – telomere – – 45615 – # contigs intotal 18 19 21 18 # Ab initio CDS 5 624 5 531 5 554 5 465 Avg identity*** 99.3 99.6 99.6 NA Feature CEN.PK113-7D S288C ONT_assembly (69X) PacBio_assembly (408X) OP_assembly(477X) SGD chrI 224 821 241 274 235 019 230 218 chrII 806 426 820 406 827 088 813 184 chrIII 319 119 369 115 367 275 316 620 chrIV 1 504 163 1 518 811 1 518 534 1 531 933 chrV 577 655 593 818 579 053 576 874 chrVI 272 158 278 189 286 399 270 161 chrVII 1 123 142 1 138 579 113 7891 (**205 1454) 1 090 940 chrVIII 560 935 577 405 577 241 562 643 chrIX 441 593 461 772 452 809 439 888 chrX 764 537 759 892 777 694 745 751 chrXI 679 352 690 875 680 699 666 816 chrXII 1 117 833 1 078 292 1 114 766 1 078 177 chrXIII 912 255 913 070 913563 (**2051454) 924 431 chrXIV 776 471 801 157 778 019 784 333 chrXV 1 091 066 1 101 379 1 105 746 1 091 291 chrXVI 948 593 965 730 979 195 948 066 chrmt 86 132 *53 964 86 343 85 779 *27 063 Total length (main+mt) 12 206 251 12 462 516 12 417 334 12 157 105 2-micron 147 349 71 725 144 927 6 318 2-micron – – 61 136 – telomere – – 76 064 – telomere – – 49 485 – telomere – – 45615 – # contigs intotal 18 19 21 18 # Ab initio CDS 5 624 5 531 5 554 5 465 Avg identity*** 99.3 99.6 99.6 NA *The length of the two broken contigs of mitochondrial chromosome. **The length of the missed assembly contig before manually broken. ***Average identity of the assembly contig comparing with Illumina read before the polishing step. NA = not applicable. View Large Table 1. The summary of de novo assembly results of S. cerevisiae strain CEN.PK113-7D obtained from the three strategies comparing it to the genome of S. cerevisiae strain S288C Feature CEN.PK113-7D S288C ONT_assembly (69X) PacBio_assembly (408X) OP_assembly(477X) SGD chrI 224 821 241 274 235 019 230 218 chrII 806 426 820 406 827 088 813 184 chrIII 319 119 369 115 367 275 316 620 chrIV 1 504 163 1 518 811 1 518 534 1 531 933 chrV 577 655 593 818 579 053 576 874 chrVI 272 158 278 189 286 399 270 161 chrVII 1 123 142 1 138 579 113 7891 (**205 1454) 1 090 940 chrVIII 560 935 577 405 577 241 562 643 chrIX 441 593 461 772 452 809 439 888 chrX 764 537 759 892 777 694 745 751 chrXI 679 352 690 875 680 699 666 816 chrXII 1 117 833 1 078 292 1 114 766 1 078 177 chrXIII 912 255 913 070 913563 (**2051454) 924 431 chrXIV 776 471 801 157 778 019 784 333 chrXV 1 091 066 1 101 379 1 105 746 1 091 291 chrXVI 948 593 965 730 979 195 948 066 chrmt 86 132 *53 964 86 343 85 779 *27 063 Total length (main+mt) 12 206 251 12 462 516 12 417 334 12 157 105 2-micron 147 349 71 725 144 927 6 318 2-micron – – 61 136 – telomere – – 76 064 – telomere – – 49 485 – telomere – – 45615 – # contigs intotal 18 19 21 18 # Ab initio CDS 5 624 5 531 5 554 5 465 Avg identity*** 99.3 99.6 99.6 NA Feature CEN.PK113-7D S288C ONT_assembly (69X) PacBio_assembly (408X) OP_assembly(477X) SGD chrI 224 821 241 274 235 019 230 218 chrII 806 426 820 406 827 088 813 184 chrIII 319 119 369 115 367 275 316 620 chrIV 1 504 163 1 518 811 1 518 534 1 531 933 chrV 577 655 593 818 579 053 576 874 chrVI 272 158 278 189 286 399 270 161 chrVII 1 123 142 1 138 579 113 7891 (**205 1454) 1 090 940 chrVIII 560 935 577 405 577 241 562 643 chrIX 441 593 461 772 452 809 439 888 chrX 764 537 759 892 777 694 745 751 chrXI 679 352 690 875 680 699 666 816 chrXII 1 117 833 1 078 292 1 114 766 1 078 177 chrXIII 912 255 913 070 913563 (**2051454) 924 431 chrXIV 776 471 801 157 778 019 784 333 chrXV 1 091 066 1 101 379 1 105 746 1 091 291 chrXVI 948 593 965 730 979 195 948 066 chrmt 86 132 *53 964 86 343 85 779 *27 063 Total length (main+mt) 12 206 251 12 462 516 12 417 334 12 157 105 2-micron 147 349 71 725 144 927 6 318 2-micron – – 61 136 – telomere – – 76 064 – telomere – – 49 485 – telomere – – 45615 – # contigs intotal 18 19 21 18 # Ab initio CDS 5 624 5 531 5 554 5 465 Avg identity*** 99.3 99.6 99.6 NA *The length of the two broken contigs of mitochondrial chromosome. **The length of the missed assembly contig before manually broken. ***Average identity of the assembly contig comparing with Illumina read before the polishing step. NA = not applicable. View Large The 2-micron plasmid sequence obtained from all of our assemblies (ONT, PacBio and OP) is longer than the reported length of the 2-micron plasmid for strain S288C. We further investigated the long 2-micron contigs and found that the Canu assembler (25) had difficulty discriminating the extra depth from the multi-copy 2-micron plasmids (see Supplementary Figure S2). The complete genomes (all chromosomes) from the three assembly strategies were very similar, with a DNA identity of 99.95% and a similar number of CDS ORFs by the ab initio method AUGUSTUS (29). We decided to use OP_assembly to represent the whole genome of S. cerevisiae strain CEN.PK113-7D for further analysis and comparison because we believe that combining the reads will give the highest sequencing depth, leading to a high confidence genome sequencing. Moreover, the OP_assembly has the highest average identity (see Table 1) when compared to Illumina reads if the broken mitochondria chromosome found using PacBio_assembly is not considered. We further evaluated the assembly completeness by identifying telomeric repeats, which we found on both ends of all of the main chromosomes as illustrated in Figure 2. The S. cerevisiae strain CEN.PK113-7D complete genome has been deposited in Genbank (accession numbers CP022966–CP022982). Figure 2. View large Download slide The complete CEN.PK113-7D genome obtained from de novo assembly and its comparisons. (A) A Circos plot shows the genome comparisons. Lane a) The CEN.PK113-7D chromosomes (I-XVI) and the mitochondrial chromosome (mt) are plotted in different colors. Lane (b) S288C chromosomes and the crossed vertical lines represent the confidence chromosome rearrangement regions (>1 kb) between the two stains. Lane (c) The crossed vertical line plot shows the location on the chromosome of DNA methylation sites 4mC, 6mA and 5mC illustrated from the outer ring to the inner ring, respectively. The methylation sites that do not locate on the upstream region of ORFs are plotted in gray. The red and blue crossed vertical lines represent the methylation sites located on the upstream region of ORFs of 4mC and 6mA, respectively. Lane (d) DNA sequencing depth coverage plots, yellow, red and blue represent the data obtained from Illumina, PacBio and ONT reads, respectively. Lane (e) Gray bars represent the location of the assembled contigs obtained from Nijkamp et al. The cyan color represents the missing regions (gaps) that cannot be captured by the short reads assembly from Nijkamp et al. Lane (f) The Venn diagram compares the hits of S288C CDS ORFs hit on the CEN.PK113-7D genome. The star indicates the additional ORFs from ab initio gene calling obtained with AUGUSTUS software. On the right-hand side, the circos plot shows the results obtained from chromosomal rearrangement analysis between CEN.PK113-7D and S288C for synteny in panel (B) and translocation in panel (C). The chromosomes of CEN.PK113-7D are plotted in white on the top. The chromosomes of S288C are plotted in different colors on the bottom. The telomere regions were marked on the end of each chromosome in black. A close-up of the inversion is provided in Supplementary Figure S5. Figure 2. View large Download slide The complete CEN.PK113-7D genome obtained from de novo assembly and its comparisons. (A) A Circos plot shows the genome comparisons. Lane a) The CEN.PK113-7D chromosomes (I-XVI) and the mitochondrial chromosome (mt) are plotted in different colors. Lane (b) S288C chromosomes and the crossed vertical lines represent the confidence chromosome rearrangement regions (>1 kb) between the two stains. Lane (c) The crossed vertical line plot shows the location on the chromosome of DNA methylation sites 4mC, 6mA and 5mC illustrated from the outer ring to the inner ring, respectively. The methylation sites that do not locate on the upstream region of ORFs are plotted in gray. The red and blue crossed vertical lines represent the methylation sites located on the upstream region of ORFs of 4mC and 6mA, respectively. Lane (d) DNA sequencing depth coverage plots, yellow, red and blue represent the data obtained from Illumina, PacBio and ONT reads, respectively. Lane (e) Gray bars represent the location of the assembled contigs obtained from Nijkamp et al. The cyan color represents the missing regions (gaps) that cannot be captured by the short reads assembly from Nijkamp et al. Lane (f) The Venn diagram compares the hits of S288C CDS ORFs hit on the CEN.PK113-7D genome. The star indicates the additional ORFs from ab initio gene calling obtained with AUGUSTUS software. On the right-hand side, the circos plot shows the results obtained from chromosomal rearrangement analysis between CEN.PK113-7D and S288C for synteny in panel (B) and translocation in panel (C). The chromosomes of CEN.PK113-7D are plotted in white on the top. The chromosomes of S288C are plotted in different colors on the bottom. The telomere regions were marked on the end of each chromosome in black. A close-up of the inversion is provided in Supplementary Figure S5. Comparative genomics between the S288C and CEN.PK113-7D The CEN.PK113-7D genome was first compared to the S288C genome by MUMMER (27), yielding a global average DNA identity of 99.5%, (see the dot plot in Supplementary Figure S3). The number of identified SNVs are 24 071; this number is comparable with previous reports (18,22). Interestingly, the number of identified insertion-deletions (INDELs) detected is 13 732, which is around four times higher than reported from experiments using short reads (18,22), possibly due to homopolymer problems commonly observed when using PacBio and ONT. The complete CEN.PK113-7D genome is shown in Figure 2A.a. The read coverage from the ONT, PacBio, and Illumina, as illustrated in Figure 2A.d, reveals the unusual high sequencing depth, linked with high DNA copy numbers for the mitochondrial chromosome, and also in the middle of chromosome XII, which contains a cluster of repeated rRNA genes. In S288C, rRNA genes are also found in the long repeat region of 9.1 kb on Chromosome XII. To ensure that the assembly results are valid, we investigated the read alignment over the region. We found some long reads that span over the long repeat region, as seen in Supplementary Figure S4. The larger mitochondrial DNA content has been previously reported to show an increase in cell growth and nuclear DNA replication (38), reflecting the mid-log phase sampling point. DNA methylation plays important roles in various cellular regulation pathways and is also known to be responsible for epigenetic modification, which is associated with human diseases (39). Methylation in the upstream region of coding sequences can slow down the transcription process. S. cerevisiae has been used as an expression host to study higher eukaryote 5-methylcytosine (5mC), because the yeast is thought to contain no 5mC, as reported by Hattman et al. (40) and Capuono et al. (41). Using third-generation DNA sequencing technologies, the 4-methylcytosine (4mC) and 6-methyladenine (6mA) can be captured on the PacBio reads (12) and 5mC can be captured on ONT reads (14). Results of DNA methylation analysis are illustrated in Figure 2A.c. The Nanopolish software (14) identified only 40 5mCs, compared to thousands of methylation sites for 4mC and 6mA; none of the 5mCs are located in the upstream region of ORFs for the CEN.PK113-7D genome; this is consistent with other results obtained experimentally by LC–MS/MS methods (41). SMART link software identified 6946 4mC and 4688 6mA with 359 sites and 297 sites located in the upstream region of ORFs, defined as 200 bp before the start codon and corresponding to the typical length of the yeast core promoter, as reported by Lubliner et al. (42). All assembled contigs from the previous study of Nijkamp et al. (18) can be almost perfectly mapped to our assembled genome with a 99.8% DNA identity, as illustrated in Figure 2A.e, indicating the comprehensive quality of our genome. The result from the assembly based on short reads (in Figure 2A.e) shows the difficulty in mapping the terminal regions of the chromosome, close to telomeres. Moreover, the assembly based on short reads missed the mitochondrial chromosome and the middle region of chromosome XII, where we found the unusual sequencing depth in our study. Due to the high percentage DNA identity of the two yeast genomes (CEN.PK113-7D and S288C), which are the same species, we used the annotated protein-encoding ORFs (5,996 ORFs) of the strain S288C genome to directly query the CEN.PK113-7D genome using the Blat software (28), and identified 5,969 ORFs that hit as illustrated in Figure 2A.f. The hits resulted in 6,173 loci (annotated as CDS ORFs) on the CEN.PK113-7D chromosomes, indicating some genes had been duplicated. We found that 23 ORFs were absent in the CEN.PK113-7D genome (see supplementary Table TS3). This is less than previously reported by Nijkamp et al. (18), indicating problems from unknown gaps that possibly derived from collapsed tandem repeats in the assembly based on short reads (see supplementary Table TS3). Eighteen of the absent ORFs are in the set of previously reported missing genes; only five of the absent ORFs are uniquely identified in this study, possibly due to the different versions of S288C genome annotation used in the two studies. To look for possible additional ORFs in CEN.PK113-7D, we employed ab initio gene calling (29), and yielded 52 ORFs that have high similarity to known proteins in the Uniprot database, indicating a high confidence for these additional ORFs (see supplementary Table TS3). Furthermore, all 417 genes of non-translated RNAs (e.g. tRNA, rRNA, snRNA, snoRNA) of S288C hits on the CEN.PK113-7D genome by direct sequence queries resulted in identification of 412 loci in the CEN.PK113-7D genome. We used LAST software (30) for a detailed chromosome comparison between the CEN.PK113-7D and S288C genomes and identified a total of 555 regions of chromosomal rearrangements. Considering only the regions >1 kb, there are 35 regions identified as synteny, translocation, or inversion of the chromosomes illustrated in Figure 2A.b (see supplementary Table TS4). We further examined the 32 regions that contain ORFs and found 12 synteny regions on chromosomes IV, VIII, IX, and XII as well as two-inversion regions on chromosome VII, as illustrated in Figure 2B (see Supplementary Figure S5). The two largest synteny regions are 50 kb on chromosome IV with 28 ORFs and 13.5 kb on chromosome IX with 7 ORFs. The two two-inversion regions carry three retrotransposon-related ORFs. We also found 19 chromosome translocations with 35 ORFs on 9 chromosomes, as illustrated in Figure 2C. Interestingly, chromosome VII of S288C translocates into many chromosomes of CEN.PK113-7D (see supplementary Table TS4). CEN.PK113-7D transcriptional landscape and quantification We explored the transcriptional landscape using direct RNA sequencing over two metabolic stages of diauxic growth: respiro-fermentative growth on glucose and oxidative growth on ethanol. Averaged across four biological replicates, we obtained ∼530,000 high quality reads with N50 of 1150 bases, corresponding to ∼509 MB (59X of total transcripts length) for growth on glucose and ∼623 000 high quality reads with N50 of 1263 bases, corresponding to ∼623 MB (72X of total transcripts length) for growth on ethanol (see detail in supplementary Table TS5). We then evaluated the error rate of the aligned direct RNA sequence reads based on the reference genome sequence following Quick et al. (43) and found that, on average, the direct RNA sequencing read has 88% identity and 12% error (see detail in supplementary table TS6). As shown in Figure 3A, the distribution of high quality direct RNA sequencing reads obtained from both growth conditions have similar shapes, indicating a transcriptome signature of CEN.PK113-7D that can be captured by sequencing. Moreover, the distribution of direct RNA sequencing reads agrees with the distribution of transcript lengths obtained from gene annotations. The direct RNA sequencing reads were further aligned to the CEN.PK113-7D genome, and the level of expression of individual transcripts, with respect to the gene calling and annotation, were determined by simply counting the number of mapped reads on the individual transcripts. We found that ∼91% of the predicted transcripts (5433 from 5994) can be consistently detected across the four biological replicates of growth on either glucose or ethanol. Under the same criterion, out of the 492 non-translated transcripts, almost all of them (398 or 81%) did not pass the criterion (see supplementary Table TS7). The absence of non-translated transcripts is likely due to the experimental method of extracting transcripts, which was based only on the presence of a poly(A) tail by the poly(A) selection strategy. This would exclude polymerase III transcripts. We further explored the mapped direct RNA sequence reads to the 479 known spliced genes in the genome and found that 80 spliced genes (17%) were not expressed at all in any of the growth conditions used in the experiments. We found only 10 spliced genes (2%) that had direct RNA sequence reads covering less than 95% of total exon length. The rest (389 spliced genes) had direct RNA sequence reads mapped covering their exons (see supplementary Table TS8). Figure 3. View largeDownload slide Summary of the direct RNA sequencing data. (A) The histogram plot shows the distribution of read length of high quality reads obtained from yeast cell growth ethanol (magenta) and glucose (cyan), respectively, with the distribution of expected transcript lengths derived from the ORFs annotation. (B) Bar plots of the detected highly expressed transcripts are presented as an average normalized count with standard error over four biological replicates for each growth condition. The constitutively expressed, highly expressed in ethanol growth and highly expressed in glucose growth are illustrated in the left middle and right box, respectively. (C) The bubble scatter plots show the relationship between the fraction of detected full-length transcripts by the direct RNA sequencing with the transcript length and the level transcript expression. The violin-boxplots on the right show the overall distribution of the fraction of detected full-length transcripts. Figure 3. View largeDownload slide Summary of the direct RNA sequencing data. (A) The histogram plot shows the distribution of read length of high quality reads obtained from yeast cell growth ethanol (magenta) and glucose (cyan), respectively, with the distribution of expected transcript lengths derived from the ORFs annotation. (B) Bar plots of the detected highly expressed transcripts are presented as an average normalized count with standard error over four biological replicates for each growth condition. The constitutively expressed, highly expressed in ethanol growth and highly expressed in glucose growth are illustrated in the left middle and right box, respectively. (C) The bubble scatter plots show the relationship between the fraction of detected full-length transcripts by the direct RNA sequencing with the transcript length and the level transcript expression. The violin-boxplots on the right show the overall distribution of the fraction of detected full-length transcripts. Only a few transcripts are highly expressed. There are only 22 transcripts with >5000 direct RNA sequencing reads mapped for either growth condition, as illustrated in Figure 3B. As expected, the well-known glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is one of the most abundant mRNAs. It is the most abundant transcript during growth on glucose, and the third-most abundant during growth on ethanol. Besides this, TDH2, which is the homolog of GAPDH, is also highly expressed under both growth conditions. The three key transcripts of enzymes for glycolytic pathways, the 3-phosphoglycerate kinase (PGK1), glycerate phosphomutase (GPM1) and fructose-1,6-bisphosphatase aldolase (FBA1) are highly expressed under both growth conditions. In addition, transcripts encoding the translation elongation factor TEF1 and the paralog TEF2 were found to be constitutively high expressed. The constitutively high expression of PGK1 and TEF1,2 are in agreement with a study by Partow et al., who reported high performance of the promoters of these transcripts in a yeast expression vector (44). The three transcripts of enolase (ENO1), the major form of pyruvate decarboxylase (PDC1) that is key for alcoholic fermentation, and alcohol dehydrogenase (ADH1) involving in ethanol production, were specifically highly expressed during growth on glucose, as expected, clearly reflecting the respiro-fermentative metabolism. Faster growth of cells on glucose than on ethanol resulted in overexpression of three ribosome-related transcripts (RPL41A, RPS31, RPS12) and a transcript coding cell wall mannoprotein (CCW12). On the other hand, highly expressed transcripts encoding heat shock proteins (HSP26, HSP12), oxidative stress protection, and overexpression of many stress related transcripts (SIP18, TMA10, GRE1, DDR2, HOR7) were specifically observed in growth on ethanol, reflecting oxidative stress. Moreover, CIT2 encoding citrate synthase (peroxisomal isozyme) was overexpressed during growth on ethanol, indicating that the glyoxylate shunt is active. The ONT technology enables very long sequencing reads, a capability we explored in detection of full-length transcripts from the obtained direct RNA sequencing data. The direct RNA sequence reads that have 95% covered of the total transcript length were considered the full-length transcript reads and were used to calculate the fraction of full-length transcript detected. As seen in the violin-boxplots in Figure 3C, most of the detected transcripts have around 70% full length, with a small influence by the growth condition (see supplementary Figure S6 for violin-boxplot of detected full-length transcripts for individual sample). As expected, the fraction of detected full-length transcripts declined with increasing transcript length but independent of expression level, as illustrated in the bubble plots of Figure 3C (see supplementary Figure S7 for the plot of detected full-length transcripts versus expression level of transcript in detail). It is interesting that the direct RNA sequencing can detect full-length transcripts over 5kb. The heterogeneity of individual transcript lengths may reflect information about RNA turnover. An important goal of transcriptome analysis is differential gene expression identification. We first evaluated the intrinsic variability of transcriptome data using principle component analysis and found clear separation (90% of variance capture by PC1) between the two growth conditions as illustrated in Figure 4A. A simple count of the number of direct RNA sequence reads mapped to individual transcripts can be used as a proxy for quantification of gene expression that is a very similar approach to the traditional short read RNA-Seq. Therefore, we employed the DESeq2 method (34) to estimate the transcript-level statistic of transcriptional changes between growth on ethanol as summarized in the MA plot and violin-boxplot of adjusted P-values illustrated in Figure 4B and C. We further evaluated the biological sense of differential gene expression results using gene-set enrichment analysis (36), as illustrated in Figure 4D. The identified enrichment GO terms show reasonable explanations, in terms of known physiology of the classic diauxic growth pattern in yeast. For example, the GO terms related to transcription and translation processes were up-regulated in growth on glucose, which is in agreement with the higher growth rate on glucose than on ethanol. It is known that after glucose depletion, ethanol, which is a fermentative product, will be utilized through oxidative metabolism; this is in agreement with the up-regulated GO terms related to TCA cycle, glyoxylate shunt, and mitochondria electron transport. Lack of nutrients and accumulation of toxic metabolites in the growth on ethanol products were revealed by the up-regulated GO terms, responses to stress, catabolic processes, and beta-oxidation. Figure 4. View largeDownload slide The summary results of transcript quantification and differential gene expression analysis. (A) Principle component analysis plot of individual sample (circle) of yeast cell growth ethanol (magenta) and glucose (cyan color). (B) Violin-boxplot (square root transformed y-axis) shows the distribution of statistical adjusted P-values calculated using the DEseq2 method. The red line represents the yeast cell growth ethanol (magenta) and glucose (cyan color) adjusted P-value cut-off of 1e–20. (C) MA plot obtained from DEseq2 package. The red dots represent the transcripts that had adjusted P-values lower than the cut-off. The logFC represents the expression ratio of ethanol growth over glucose growth. (D) Heatmap illustration of the directional enrichment score of gene-set enrichment analysis of gene ontology using the PIANO package. Magenta represents the up-regulated scores on ethanol growth and cyan represents the up-regulated scores on glucose growth. (E) Bar plots show the comparison of DNA sequencing library size between ONT and Illumina datasets in terms of number of reads and amount in gigabases. The average values are presented with standard error over quadruplicate for each growth condition. (F) Violin-boxplots show the comparison of dynamic range in library size (Gb) corrected read count (log2) of the Illumina and ONT datasets across biological replicates (e = ethanol growth, g = glucose growth, b = batch growth, c = chemostat growth). Figure 4. View largeDownload slide The summary results of transcript quantification and differential gene expression analysis. (A) Principle component analysis plot of individual sample (circle) of yeast cell growth ethanol (magenta) and glucose (cyan color). (B) Violin-boxplot (square root transformed y-axis) shows the distribution of statistical adjusted P-values calculated using the DEseq2 method. The red line represents the yeast cell growth ethanol (magenta) and glucose (cyan color) adjusted P-value cut-off of 1e–20. (C) MA plot obtained from DEseq2 package. The red dots represent the transcripts that had adjusted P-values lower than the cut-off. The logFC represents the expression ratio of ethanol growth over glucose growth. (D) Heatmap illustration of the directional enrichment score of gene-set enrichment analysis of gene ontology using the PIANO package. Magenta represents the up-regulated scores on ethanol growth and cyan represents the up-regulated scores on glucose growth. (E) Bar plots show the comparison of DNA sequencing library size between ONT and Illumina datasets in terms of number of reads and amount in gigabases. The average values are presented with standard error over quadruplicate for each growth condition. (F) Violin-boxplots show the comparison of dynamic range in library size (Gb) corrected read count (log2) of the Illumina and ONT datasets across biological replicates (e = ethanol growth, g = glucose growth, b = batch growth, c = chemostat growth). We further compared the dynamic range of transcript detection between direct RNA sequencing using ONT and traditional RNA-Seq (using Illumina technology obtained from our previous study (22)). Based on the read mapping results as shown in Figure 4E, the number of mapped reads obtained from the Illumina dataset is about ten times higher than the ONT dataset due to the different read lengths (200 bp for Illumina, compared to >1000 bp for ONT). Therefore, the total length of mapped reads (that is, the total number of bp sequenced) was used instead to fairly compare the sequencing depth. We found that the ONT dataset has about half amount of the Illumina dataset, corresponding to about 64X and 118X of transcripts length, respectively (see Supplementary Table TS9 for more details). The distribution of the library size-corrected mean coverage depth across the transcripts for each biological replicate of Illumina and ONT dataset is illustrated in Figure 4F to compare dynamic ranges (see supplementary Figure S8 for the same data without library size correction). Both datasets have similar dynamic ranges across the different biological replicates, except e0 and g1, which have much lower sequencing depths than the other replicates (see Supplementary Table TS9). The dynamic range comparable to the lower half of the sequencing depth of direct RNA sequencing data might be reflective of the different methodologies. For RNA-Seq, the RNA is first converted to cDNA, then amplified, sequenced, and mapped back to the transcript. In contrast, for direct RNA sequencing, the RNA is sequenced directly. We examined regions of chromosomes to get an idea of the local transcriptional landscape structures. Figure 5A illustrates the simultaneous detection of mature and premature mRNA for the CENPK_0H0066W (RPL27A, Ribosomal 60S subunit protein L27A) locus. Figure 5A shows results for mapping the reads using GraphMap software (24) with non-guided mapping (Figure 5A, upper panel), compared to the transcript model guided mapping (Figure 5A, lower panel), which results in a very clean mapping signal for the exons. In another example (Figure 5B), we found an unexpected region that shows evidence of a polymerase II missed termination on the first ORF, which continues to transcribe until the termination of the second gene. These two genes are located in the region of 492 500 to 494 500 on Chromosome VIII. The two ORFs, CENPK_0H0281W (PTH1, Peptidyl-Trna Hydrolase) and CENPK_0H0282W (ERG9, Farnesyl-diphosphate farnesyl transferase), are illustrated in Figure 5B. We then compared the Polymerase II missed termination on the locus between direct RNA sequence reads (upper panel) with ‘traditional’ RNA-Seq results from short reads (lower panel). The long reads give a clear signal in support of read-through from the first transcript. In contrast, the short reads that are aligned in the region between the two ORFs are not firm covered, resulting in a lower-confidence signal possibly leading to a missed identification. We further explored the region and found that CENPK_0H0281W has no polyadenylation signal sequence (see Supplementary Figure S9); thus, it is likely that properly terminated full length transcripts of this gene would not be enriched in the poly(T) purification step. This reveals uncommon transcriptional regulation of the second gene (homolog to ERG9), which is a key gene in the sterol biosynthesis pathway in yeast. Moreover, the coverage plots clearly show that direct RNA sequence reads provide a homogeneously distributed signal over any ORF transcript. This even distribution is not seen in traditional RNA-Seq results. The ‘bumpy’ grey distribution seen in Figure 5B lower panel, compared to the relatively smooth grey band in the upper panel means the short reads have a less homogeneous distribution, possibly due to uneven amplification that needs further investigation at a similar sequencing depth. Figure 5. View largeDownload slide Transcriptional landscape structure examples illustrated by the snapshots of mapped reads using the IGV software. For panels A–F, i) shows the transcript structure(s), ii) indicates depth coverage, iii) details the mapped long reads of direct RNA sequence data. Red and blue represent the forward and reverse direction of mapped reads, respectively. iv) Shows details of mapped short reads of RNA-Seq data. (A) The different read alignment strategy shows that the non-guided exon alignment, illustrated in the upper panel, visually detects pre-mature transcripts that miss in guided exon alignment, illustrated in the lower panel. (B) The presence of dual transcribed of THI1 and ERG9. (C) The evidence of telomere RNA and (D) The evidence of polyadenylated long non-coding RNA. (E) The evidence of polyadenylated ribosomal RNA. (F) The presence of antisense transcript. (G) The length distribution of 5′UTRs (upper panel) and 3′UTRs (lower panel) of this study compared with Nagalakshmi et al. represented in red and green, respectively. Figure 5. View largeDownload slide Transcriptional landscape structure examples illustrated by the snapshots of mapped reads using the IGV software. For panels A–F, i) shows the transcript structure(s), ii) indicates depth coverage, iii) details the mapped long reads of direct RNA sequence data. Red and blue represent the forward and reverse direction of mapped reads, respectively. iv) Shows details of mapped short reads of RNA-Seq data. (A) The different read alignment strategy shows that the non-guided exon alignment, illustrated in the upper panel, visually detects pre-mature transcripts that miss in guided exon alignment, illustrated in the lower panel. (B) The presence of dual transcribed of THI1 and ERG9. (C) The evidence of telomere RNA and (D) The evidence of polyadenylated long non-coding RNA. (E) The evidence of polyadenylated ribosomal RNA. (F) The presence of antisense transcript. (G) The length distribution of 5′UTRs (upper panel) and 3′UTRs (lower panel) of this study compared with Nagalakshmi et al. represented in red and green, respectively. Surprisingly, we found some high-confidence non-coding exon ORFs that have direct RNA sequence reads mapped (Figure 5C, D, and E). The locus CENPK_0L0245C, with homologs to 35S rRNA (RDN37-1,2), predicted to be processed to 25S and 18S rRNA genes, has direct RNA sequence reads mapped indicating polyadenylation of the transcripts from this locus, and these rRNA genes might be transcribed by polymerase II rather than polymerase III. These findings are consistent with the discovery of polyadenylation on yeast rRNA by Kukai and coworkers (45) that used specific primers to probe polyadenylated rRNA. Interestingly, we found a signal of direct rRNA sequence reads in the region around the locus CENPK_0L0245C on both ethanol and glucose conditions (see Supplementary Figure S10). This region of chromosome XII has an unusually high DNA sequencing coverage depth, as shown in Figure 2A.d, and contains many copies of rRNA genes, based on predicted gene annotations. We found the polyadenylated rRNA transcript of telomerase RNA gene (TCL1), required for telomere replication at the locus CENPK_0B0178W, that is consistent with the study of Chapon et al. (46). In addition, we found polyadenylation of a long non-coding RNA transcript, which is the key regulation of the molecule (47), on the locus CENPK_0J0330C homologs to LTR1 involved in mating-type control of gametogenesis. We also found a polyadenylated antisense transcript from the 5′ region of CENPL_0L0204W homolog to regulation of RNA polymerase I (RRN5), an encoding transcription factor member of upstream activation factor (UAF) family (Figure 5F). It is well-known that UTRs can impact mRNA processing, gene expression, and protein synthesis. We identified most of the 5′ UTRs and 3′ UTRs of CEN.PK113-7D using our direct RNA sequencing data, and compared it to those found in strain S288C, as reported by Nagalakshmi et al. (48). As shown in the histograms in Figure 5G, the identified 5′ and 3′ UTRs of the both studies are in agreement. DISCUSSION DNA library prep and read length is important for assembly De novo assembly of eukaryotic genomes, which have large genome sizes and many repetitive regions, has made it almost impossible to obtain complete contiguous sequences of chromosomes with current short-read-based technology. The short read lengths obtained from second-generation sequencing make the problem mathematically hard in spite of high coverage, even though several algorithms have been developed to help overcome this problem (49). In our study, we used third-generation sequencing methods (with both ONT and PacBio) that yield very long reads, leading to successful de novo assembly of complete sequences of all 16 main yeast chromosomes in one step. Interestingly, even though the sequencing depth coverage of ONT reads was 6-fold lower than for the PacBio reads, the ONT_assembly yielded contiguous chromosome sequences for all 16 chromosomes. The PacBio_assembly had a small problem assembling the mitochondria chromosome, which could be the result of excessive DNA sequencing depth in combination with the short length of the mitochondrial chromosome. We could possibly improve this by adjusting the coverage depth when performing the assembly. The slightly better results obtained with ONT assembly compared to PacBio_assembly could be due to the longer N50 of the reads, which would provide longer pieces of DNA to anchor contigs The shorter read length distribution of PacBio reflects a different way of preparing the starting DNA for sequencing library preparation. Based on the PacBio sequencing protocol, we sheared the DNA before we made the sequencing library. This optimizes the CCS chemistry, as performance drops for DNA fragments longer than 25 kb. In contrast, we did not shear DNA for ONT sequencing library preparation. It is likely that shearing of the DNA is not completely random, and assembly across some breakage ‘hot spots’ could be difficult. Further, it is reasonable to assume that having very long reads is an important factor to obtain the complete sequence of eukaryal organisms through de novo assembly. Repetitive regions and high copy numbers of small pieces of DNA in the genome cause difficulty with de novo assembly Large eukaryal genomes typically contain several highly repetitive regions; these represent one of the biggest technical challenges when performing de novo assembly on short reads (50). The yeast S. cerevisiae, which is the subject of this study, has few repetitive regions (compared to plant or animal genomes, for example), except for telomere regions. These repetitive regions caused assembly errors in joining the ends of chromosomes VII and XIII when the sequencing depth was increased by combining the ONT and PacBio reads. In addition, the OP_assembly resulted in an additional three contigs from telomeres that cannot be joined to any chromosome. This kind of problem will be amplified in larger eukaryal genomes, such as human genomes, which contain repetitive sequences in more than half of the genome (51). Obtaining very long reads that can cover repetitive regions will, however, reduce assembly difficulties. S. cerevisiae has many copies of the 2-micron plasmid, with a size of around 6.3 kb, which is shorter than the N50 length of ONT and PacBio reads. Interestingly, the most abundant reads’ lengths, observed as a spike in the histogram in Figure 1A, are similar to the size of the 2-micron plasmids. We further found that more than 7,899 ONT reads and >1400 PacBio reads cover the full-length of the 2-micron plasmid; however, few of the PacBio reads that match the 2-micron plasmid are full-length (see the distribution in Supplementary Figure S11). The high abundance of 2-micron plasmid reads might confuse the assembler into connecting them into longer multi-copy plasmid chimeric assemblies (see Supplementary Figure S2). Therefore, natural plasmids, which are commonly found in fungi and some plants (52), need to be carefully annotated during genome assembly. Genome annotation: the next important step Genome annotation is the next important step after genome sequences are achieved. Annotation is a challenging and time-consuming task that requires manual curation from experts and the research community to obtain high-quality results (53). Even with the most studied eukaryote, S. cerevisiae, high quality gene annotation requires curation. Therefore, we have provided the CEN.PK113-7D genome browser for the yeast community to curate and validate the current annotation. The browser includes processed information of the complete genome sequence, automated gene annotation, DNA methylation sites, direct RNA sequence alignments, and 5′ and 3′ UTR location prediction using the JBrowse software (54). The CEN.PK113-7D genome browser is freely available at http://genomebrowser.uams.edu/cenpk1137/. Direct RNA sequencing enables single molecule quantification Traditional RNA-Seq by short reads requires the conversion of transcripts to complementary DNA (cDNA) and amplification before measurement through second-generation sequencing. These two procedures introduce artifacts and biases as seen in the non-uniform signal of coverage plots in Figure 5B (ii), lower panel; however, direct RNA sequencing provides a solution—single molecule detection (Figure 5B (ii), upper panel). The dynamic total RNA and mRNA concentrations in the cell at different cellular states directly impact analysis of the transcriptome, and have been simplified under the assumption that cells produce similar levels of RNA per cell as well as using similar amounts of total RNA as the beginning step without internal spike control, leading to erroneous interpretations, as reported by Loven et al. (55). Therefore, using a known amount of mRNA as starting material without amplification through direct RNA sequencing gave an accurate transcript quantification on differential gene expression analysis using negative binomial statistic and functional enrichment analysis. Furthermore, our differential analysis results suggested that there is no need to develop a new statistical analysis pipeline to analyze direct RNA sequence data, as existing tools can be employed effectively. The error-prone long reads obtained from direct RNA sequencing are the main limitation in studying RNA editing and modifications. However, the long reads are good for transcript abundance detection whether or not a reference genome is available. As reported in this study, however, around 30% of detected transcripts are not full-length, which may possibly impact transcript quantification if there is not a reference genome available. The full-length transcript detection capability of direct RNA sequencing allows us to (i) accurately identify the structure and boundary of the transcript, (ii) detect unexpected transcriptional events and (iii) capture transcript heterogeneities and dynamics, which are important phenomena in elucidating transcriptional regulations. The standard direct RNA sequence relies on the enrichment of poly-A transcripts; thus, only polyadenylated transcripts can be detected. This means that probing the eukaryotic transcripts derived from polymerase I and III, as well as prokaryotic transcripts, is not covered in the current protocol. CONCLUSION Here we show that combining long reads of third-generation sequencing technology with matured bioinformatics analysis allows for full assembly of an eukaryal genome. We demonstrated that the superior scaffolding of long reads, obtained from careful extraction of high molecular weight DNA that minimizes shearing, enables the de novo assembly of a high quality, complete eukaryotic genome sequence. These results imply the transition from a ‘draft genome era’ to the ‘complete genome era’, allowing for a solid foundation for comparative genomics. Nevertheless, there is the major boundary of financial feasibility or sequencing cost, as the cost per bp sequenced of third-generation sequencing technologies is still more expensive than second-generation sequencing. Transcriptional landscape identification by direct RNA sequencing enables accurate determination of the encoded mRNA location, differential gene expression quantification, and structure identification of polyadenylated transcripts, free from the bias of DNA amplification. We believe that Direct RNA sequencing will become a versatile tool for transcriptome analysis in the ‘complete genome era’ of the future. It should be noted that the results presented here are for a relatively small, well defined organism (yeast). However, in dealing with larger genomes from animals and plants, a combination of higher sequencing depths and longer reads may be needed to overcome their bigger genome size, higher complexity, and higher ploidy for genome assembly. Similarly, transcriptome analysis through direct RNA sequencing for higher eukaryotes will require more sequencing depth than for S. cerevisiae, which has a low number of spliced genes, to be able to quantify transcriptional isoforms. AVAILABILITY The data have deposited in an SRA database under BioProject: PRJNA398797, SRP116559. The CEN.PK113-7D genome browser is freely available at http://genomebrowser.uams.edu/cenpk1137/. SUPPLEMENTARY DATA Supplementary Data are available at NAR online. ACKNOWLEDGEMENTS The authors would like to acknowledge support of the National Genomics Infrastructure (NGI)/Uppsala Genome Center and UPPMAX for providing assistance in massive parallel sequencing and computational infrastructure. Work performed at NGI/Uppsala Genome Center was funded by RFI/VR and Science for Life Laboratory, Sweden. The Swedish Research Council (VR-2013-4504) is acknowledged for partial financial support of IN. The authors would like to thank the reviewers—especially Nicolas Delhomme, PhD, Researcher at SLU and Manager of the UPSC Bioinformatics Platform—for their valuable comments, which helped improve the manuscript. This manuscript was edited by the Office of Grants and Scientific Publications at the University of Arkansas for Medical Sciences. Author Contributions: I.N. and J.N. designed and conceived the genome sequencing project. P.J., I.N. performed computational analysis. T.W. performed MinION sequencing for DNA and direct RNA sequencing as well as genome sequence annotation and submission. R.P. performed cell cultivation, DNA extraction and RNA extraction and coordinated with the sequencing facility for PacBio sequencing. P.P. assisted with computational analysis. D.W.U. participated in design and supervised the study. I.N., P.J. and T.W. wrote the first version of the manuscript. All authors have read and approved the final version. FUNDING Arkansas Research Alliance; Helen Adams & Arkansas Research Alliance Professor & Chair; (NIH/NIGMS) [1P20GM121293]; Knut and Alice Wallenberg Foundation; Novo Nordisk Foundation. Funding for open access charge: UAMS startup fund. Conflict of interest statement. None declared. REFERENCES 1. Goffeau A., Barrell B.G., Bussey H., Davis R.W., Dujon B., Feldmann H., Galibert F., Hoheisel J.D., Jacq C., Johnston M.et al. Life with 6000 genes. Science . 1996; 274: 546– 567. Google Scholar CrossRef Search ADS PubMed 2. Heather J.M., Chain B. The sequence of sequencers: the history of sequencing DNA. Genomics . 2016; 107: 1– 8. Google Scholar CrossRef Search ADS PubMed 3. Mardis E., McPherson J., Martienssen R., Wilson R.K., McCombie W.R. What is finished, and why does it matter. Genome Res. 2002; 12: 669– 671. Google Scholar CrossRef Search ADS PubMed 4. Wang Z., Gerstein M., Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009; 10: 57– 63. Google Scholar CrossRef Search ADS PubMed 5. Land M., Hauser L., Jun S.R., Nookaew I., Leuze M.R., Ahn T.H., Karpinets T., Lund O., Kora G., Wassenaar T.et al. Insights from 20 years of bacterial genome sequencing. Funct. Integr. Genomics . 2015; 15: 141– 161. Google Scholar CrossRef Search ADS PubMed 6. Smith L.M., Sanders J.Z., Kaiser R.J., Hughes P., Dodd C., Connell C.R., Heiner C., Kent S.B., Hood L.E. Fluorescence detection in automated DNA sequence analysis. Nature . 1986; 321: 674– 679. Google Scholar CrossRef Search ADS PubMed 7. Goodwin S., McPherson J.D., McCombie W.R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016; 17: 333– 351. Google Scholar CrossRef Search ADS PubMed 8. Baker M. De novo genome assembly: what every biologist should know. Nat. Methods . 2012; 9: 333– 337. Google Scholar CrossRef Search ADS 9. Rhoads A., Au K.F. PacBio sequencing and its applications. Genomics Proteomics Bioinformatics . 2015; 13: 278– 289. Google Scholar CrossRef Search ADS PubMed 10. Jain M., Olsen H.E., Paten B., Akeson M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 2016; 17: 239. Google Scholar CrossRef Search ADS PubMed 11. Byrne A., Beaudin A.E., Olsen H.E., Jain M., Cole C., Palmer T., DuBois R.M., Forsberg E.C., Akeson M., Vollmers C. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat. Commun . 2017; 8: 16027. Google Scholar CrossRef Search ADS PubMed 12. Flusberg B.A., Webster D.R., Lee J.H., Travers K.J., Olivares E.C., Clark T.A., Korlach J., Turner S.W. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods . 2010; 7: 461– 465. Google Scholar CrossRef Search ADS PubMed 13. Rand A.C., Jain M., Eizenga J.M., Musselman-Brown A., Olsen H.E., Akeson M., Paten B. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods . 2017; 14: 411– 413. Google Scholar CrossRef Search ADS PubMed 14. Simpson J.T., Workman R.E., Zuzarte P.C., David M., Dursi L.J., Timp W. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods . 2017; 14: 407– 410. Google Scholar CrossRef Search ADS PubMed 15. van Dijken J.P., Bauer J., Brambilla L., Duboc P., Francois J.M., Gancedo C., Giuseppin M.L., Heijnen J.J., Hoare M., Lange H.C.et al. An interlaboratory comparison of physiological and genetic properties of four Saccharomyces cerevisiae strains. Enzyme Microb. Technol. 2000; 26: 706– 714. Google Scholar CrossRef Search ADS PubMed 16. Canelas A.B., Harrison N., Fazio A., Zhang J., Pitkanen J.P., van den Brink J., Bakker B.M., Bogner L., Bouwman J., Castrillo J.I.et al. Integrated multilaboratory systems biology reveals differences in protein metabolism between two reference yeast strains. Nat. Commun. 2010; 1: 145. Google Scholar CrossRef Search ADS PubMed 17. Otero J.M., Vongsangnak W., Asadollahi M.A., Olivares-Hernandes R., Maury J., Farinelli L., Barlocher L., Osteras M., Schalk M., Clark A.et al. Whole genome sequencing of Saccharomyces cerevisiae: from genotype to phenotype for improved metabolic engineering applications. BMC Genomics . 2010; 11: 723. Google Scholar CrossRef Search ADS PubMed 18. Nijkamp J.F., van den Broek M., Datema E., de Kok S., Bosman L., Luttik M.A., Daran-Lapujade P., Vongsangnak W., Nielsen J., Heijne W.H.et al. De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology. Microb. Cell Fact. 2012; 11: 36. Google Scholar CrossRef Search ADS PubMed 19. Giordano F., Aigrain L., Quail M.A., Coupland P., Bonfield J.K., Davies R.M., Tischler G., Jackson D.K., Keane T.M., Li J.et al. De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms. Sci. Rep. 2017; 7: 3935. Google Scholar CrossRef Search ADS PubMed 20. Istace B., Friedrich A., d’Agata L., Faye S., Payen E., Beluche O., Caradec C., Davidas S., Cruaud C., Liti G.et al. de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer. Gigascience . 2017; 6: 1– 13. Google Scholar CrossRef Search ADS PubMed 21. Goodwin S., Gurtowski J., Ethe-Sayers S., Deshpande P., Schatz M.C., McCombie W.R. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015; 25: 1750– 1756. Google Scholar CrossRef Search ADS PubMed 22. Nookaew I., Papini M., Pornputtapong N., Scalcinati G., Fagerberg L., Uhlen M., Nielsen J. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012; 40: 10084– 10097. Google Scholar CrossRef Search ADS PubMed 23. Verduyn C., Postma E., Scheffers W.A., Van Dijken J.P. Effect of benzoic acid on metabolic fluxes in yeasts: a continuous-culture study on the regulation of respiration and alcoholic fermentation. Yeast . 1992; 8: 501– 517. Google Scholar CrossRef Search ADS PubMed 24. Sovic I., Sikic M., Wilm A., Fenlon S.N., Chen S., Nagarajan N. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 2016; 7: 11307. Google Scholar CrossRef Search ADS PubMed 25. Koren S., Walenz B.P., Berlin K., Miller J.R., Bergman N.H., Phillippy A.M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017; 27: 722– 736. Google Scholar CrossRef Search ADS PubMed 26. Walker B.J., Abeel T., Shea T., Priest M., Abouelliel A., Sakthikumar S., Cuomo C.A., Zeng Q., Wortman J., Young S.K.et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One . 2014; 9: e112963. Google Scholar CrossRef Search ADS PubMed 27. Kurtz S., Phillippy A., Delcher A.L., Smoot M., Shumway M., Antonescu C., Salzberg S.L. Versatile and open software for comparing large genomes. Genome Biol. 2004; 5: R12. Google Scholar CrossRef Search ADS PubMed 28. Kent W.J. BLAT–the BLAST-like alignment tool. Genome Res. 2002; 12: 656– 664. Google Scholar CrossRef Search ADS PubMed 29. Stanke M., Schoffmann O., Morgenstern B., Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics . 2006; 7: 62. Google Scholar CrossRef Search ADS PubMed 30. Kielbasa S.M., Wan R., Sato K., Horton P., Frith M.C. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011; 21: 487– 493. Google Scholar CrossRef Search ADS PubMed 31. Chaisson M.J., Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics . 2012; 13: 238. Google Scholar CrossRef Search ADS PubMed 32. Krzywinski M., Schein J., Birol I., Connors J., Gascoyne R., Horsman D., Jones S.J., Marra M.A. Circos: an information aesthetic for comparative genomics. Genome Res. 2009; 19: 1639– 1645. Google Scholar CrossRef Search ADS PubMed 33. Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics . 2010; 26: 841– 842. Google Scholar CrossRef Search ADS PubMed 34. Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15: 550. Google Scholar CrossRef Search ADS PubMed 35. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T.et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000; 25: 25– 29. Google Scholar CrossRef Search ADS PubMed 36. Varemo L., Nielsen J., Nookaew I. Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods. Nucleic Acids Res. 2013; 41: 4378– 4391. Google Scholar CrossRef Search ADS PubMed 37. Lunter G., Goodson M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2011; 21: 936– 939. Google Scholar CrossRef Search ADS PubMed 38. Blank H.M., Li C., Mueller J.E., Bogomolnaya L.M., Bryk M., Polymenis M. An increase in mitochondrial DNA promotes nuclear DNA replication in yeast. PLoS Genet. 2008; 4: e1000047. Google Scholar CrossRef Search ADS PubMed 39. Robertson K.D. DNA methylation and human disease. Nat. Rev. Genet. 2005; 6: 597– 610. Google Scholar CrossRef Search ADS PubMed 40. Hattman S., Kenny C., Berger L., Pratt K. Comparative study of DNA methylation in three unicellular eucaryotes. J. Bacteriol. 1978; 135: 1156– 1157. Google Scholar PubMed 41. Capuano F., Mulleder M., Kok R., Blom H.J., Ralser M. Cytosine DNA methylation is found in Drosophila melanogaster but absent in Saccharomyces cerevisiae, Schizosaccharomyces pombe, and other yeast species. Anal. Chem. 2014; 86: 3697– 3702. Google Scholar CrossRef Search ADS PubMed 42. Lubliner S., Regev I., Lotan-Pompan M., Edelheit S., Weinberger A., Segal E. Core promoter sequence in yeast is a major determinant of expression level. Genome Res. 2015; 25: 1008– 1017. Google Scholar CrossRef Search ADS PubMed 43. Quick J., Quinlan A.R., Loman N.J. A reference bacterial genome dataset generated on the MinION portable single-molecule nanopore sequencer. GigaScience . 2014; 3: 22. Google Scholar CrossRef Search ADS PubMed 44. Partow S., Siewers V., Bjorn S., Nielsen J., Maury J. Characterization of different promoters for designing a new expression vector in Saccharomyces cerevisiae. Yeast . 2010; 27: 955– 964. Google Scholar CrossRef Search ADS PubMed 45. Kuai L., Fang F., Butler J.S., Sherman F. Polyadenylation of rRNA in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. U.S.A. 2004; 101: 8581– 8586. Google Scholar CrossRef Search ADS PubMed 46. Chapon C., Cech T.R., Zaug A.J. Polyadenylation of telomerase RNA in budding yeast. RNA . 1997; 3: 1337– 1351. Google Scholar PubMed 47. Beaulieu Y.B., Kleinman C.L., Landry-Voyer A.M., Majewski J., Bachand F. Polyadenylation-dependent control of long non-coding RNA expression by the poly(A)-binding protein nuclear 1. PLoS Genet. 2012; 8: e1003078. Google Scholar CrossRef Search ADS PubMed 48. Nagalakshmi U., Wang Z., Waern K., Shou C., Raha D., Gerstein M., Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science . 2008; 320: 1344– 1349. Google Scholar CrossRef Search ADS PubMed 49. Henson J., Tischler G., Ning Z. Next-generation sequencing and large genome assemblies. Pharmacogenomics . 2012; 13: 901– 915. Google Scholar CrossRef Search ADS PubMed 50. Treangen T.J., Salzberg S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2011; 13: 36– 46. Google Scholar CrossRef Search ADS PubMed 51. de Koning A.P., Gu W., Castoe T.A., Batzer M.A., Pollock D.D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011; 7: e1002384. Google Scholar CrossRef Search ADS PubMed 52. Griffiths A.J. Natural plasmids of filamentous fungi. Microbiol Rev. 1995; 59: 673– 685. Google Scholar PubMed 53. Yandell M., Ence D. A beginner's guide to eukaryotic genome annotation. Nat. Rev. Genet. 2012; 13: 329– 342. Google Scholar CrossRef Search ADS PubMed 54. Buels R., Yao E., Diesh C.M., Hayes R.D., Munoz-Torres M., Helt G., Goodstein D.M., Elsik C.G., Lewis S.E., Stein L.et al. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 2016; 17: 66. Google Scholar CrossRef Search ADS PubMed 55. Loven J., Orlando D.A., Sigova A.A., Lin C.Y., Rahl P.B., Burge C.B., Levens D.L., Lee T.I., Young R.A. Revisiting global gene expression analysis. Cell . 2012; 151: 476– 482. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Diff-seq: A high throughput sequencing-based mismatch detection assay for DNA variant enrichment and discoveryAggeli, Dimitra;Karas, Vlad O;Sinnott-Armstrong, Nicholas A;Varghese, Vici;Shafer, Robert W;Greenleaf, William J;Sherlock, Gavin
doi: 10.1093/nar/gky022pmid: 29361139
Abstract Much of the within species genetic variation is in the form of single nucleotide polymorphisms (SNPs), typically detected by whole genome sequencing (WGS) or microarray-based technologies. However, WGS produces mostly uninformative reads that perfectly match the reference, while microarrays require genome-specific reagents. We have developed Diff-seq, a sequencing-based mismatch detection assay for SNP discovery without the requirement for specialized nucleic-acid reagents. Diff-seq leverages the Surveyor endonuclease to cleave mismatched DNA molecules that are generated after cross-annealing of a complex pool of DNA fragments. Sequencing libraries enriched for Surveyor-cleaved molecules result in increased coverage at the variant sites. Diff-seq detected all mismatches present in an initial test substrate, with specific enrichment dependent on the identity and context of the variation. Application to viral sequences resulted in increased observation of variant alleles in a biologically relevant context. Diff-Seq has the potential to increase the sensitivity and efficiency of high-throughput sequencing in the detection of variation. INTRODUCTION The rapid advances in low-cost, high-throughput sequencing have enabled numerous resequencing applications, ranging from clinical oncology (1) to evolutionary dynamics (2,3). For many such applications, the goal of resequencing is the identification of sequence variants in a population of different genomes. This polymorphism detection problem often requires brute-force, high-depth shotgun sequencing of genomic DNA isolated from a population of cells, and painstaking bioinformatics analyses to confidently identify real genetic polymorphisms from a background of sequencing errors. For rare or infrequent polymorphisms, this approach often results in an overwhelming excess of reads that exactly match the reference genome, whereas reads containing true variants are only a tiny fraction of the total (4–6). Even for small genomes, such as viral genomes, several hundred-fold coverage is required for confident detection of variants present at ∼1% frequency (7,8), while techniques that enable variant calling well below the error rate of the platform require extremely high coverage data (9) or engineered redundancies in sequencing (often involving molecular barcodes). If the specific polymorphism to be detected is known a priori, a variety of powerful and elegant approaches designed to detect specific locations of sequence variation may be employed. For example, rolling circle amplification (10), molecular inversion probes (11) and mismatch ligation with bioluminescence detection (12), all based on DNA ligase and often on nuclease activities, can be used to detect the presence of specific alleles. Similarly, Taqman, molecular beacons, and related assays can also be used to detect specific, targeted alleles (13–17). However, all of these methods require a priori knowledge of the reference sequence of the genome and/or alleles under interrogation, and often involve the construction of sophisticated probes to detect individual alleles. By contrast, mismatch detection assays rely on the base-pairing quality of DNA, and subsequent enzymatic detection of mispaired bases (18–21), and are thus agnostic to the exact identity of the underlying mutation. Mismatch endonucleases act on the mismatched sites of heterohybrid DNA, generated by denaturation and reannealing of a population of DNA molecules, to produce fragments resolvable by electrophoresis, enabling detection of variation across whole genes, or even across small genomes (22–24). Genomic mismatch scanning and other platforms, such as the tiling array and the mismatch endonuclease array-based methodology (MENA) use DNA hybridization and mismatch endonucleases to uncover single nucleotide polymorphisms (SNPs) at genomic scales (25–29). We aimed to couple mismatch detection with high-throughput sequencing to allow for the detection of polymorphisms across a DNA sample. This de novo polymorphism detection method allows for the identification of variation that could occur anywhere in a genome, and furthermore specifically targets sequencing capacity to the variant positions and their genomic context. Our method, which we refer to as differential sequencing (Diff-seq), aims to increase the sensitivity of high throughput sequencing for the detection of rare variation, and can be directly applied to small genomes or amplicons. The enzymatic foundation of Diff-seq is the Surveyor endonuclease, which cuts heterohybrid DNA molecules at the sites of mispaired bases. By denaturing and reannealing a complex pool of DNA fragments, we generate a pool of heterohybrid double stranded DNA (dsDNA) molecules, which contain mismatches at positions of genetic variation. These heterohybrids are then digested with Surveyor endonuclease, and the generated fragments are targeted for inclusion in a high-throughput sequencing library, resulting in substantial enrichment for DNA fragments with polymorphic sites. Diff-seq thus enables the identification of the variant position within the sequencing read, and determination of the variant base. We first applied Diff-seq to a simple 1 kb test substrate with 0–4 mismatches to demonstrate its efficacy, then further demonstrated its performance on simple but mutation-dense populations of Human Immunodeficiency Virus (HIV) molecules. Diff-seq enabled the detection of polymorphic sites between two clones when the clones were mixed in a variety of stoichiometries. We finally applied Diff-seq to DNA molecules derived from HIV population samples (8), and showed that Diff-seq can increase the observation frequency of variant positions in biologically relevant samples. MATERIALS AND METHODS Preparation and amplification of 1 kb model substrate pET17b (Novagen, Madison, WI, USA) derivatives were generated by the introduction of single point mutations via QuickChange PCR (primers in Supplementary Table S1) and cloning into E. coli. 1 kb variants were amplified in 50 μl reactions from 4 ng of either pET17b or derivatives, using PrimeSTAR (TaKaRa, Mountain View, CA, USA) and primers VK41 and VK42 (Supplementary Table S1), each at a final concentration of 0.4 μM, in the following conditions: 98°C for 10′, 35 cycles of 98°C for 10”, 55°C for 5”, 72°C for 1′. Preparation of DNA from viral clones and populations Viral clones, whose sequences included the reverse transcriptase, integrase and protease regions of the pol gene, were a generous gift of Mark Winters and Mark Holodniy. The clones were amplified in 50 μl reactions from 4 ng of each clone with PrimeSTAR and primers DAo43 and DAo44 (Supplementary Table S1), each at a final concentration of 0.4 μM, in the following conditions: 98°C for 10′, 30 cycles of 98°C for 10”, 59.5°C for 5”, 72°C for 2′. RNA preparation, RT-PCR and amplification of the population viral genomes have been described previously (8), with the exception that a single ∼1 kb amplicon, including the reverse transcriptase and protease-encoding sequences, was generated in a single PCR. The consensus sequence of the population with PID 5248 (8) was synthesized by Life Technologies (Carlsbad CA) and had the most frequently appearing nucleotide at every position, as determined by Sanger sequencing. Diff-seq The overall protocol, whose steps are described below, is summarized in Figure 1. Figure 1. View largeDownload slide Differential sequencing method. Hybrid DNA molecules are generated by thermal melting and reannealing. Mismatched molecules become substrate for Surveyor and are tagged with a biotinylated nucleotide for subsequent selection. Sequencing adaptors are introduced in two ligation steps. A type IIS recognition site, engineered in the first ligating oligo, is used to ensure a homogeneously sized library. After the second ligation step the library is amplified, quantified with qPCR and sequenced. Figure 1. View largeDownload slide Differential sequencing method. Hybrid DNA molecules are generated by thermal melting and reannealing. Mismatched molecules become substrate for Surveyor and are tagged with a biotinylated nucleotide for subsequent selection. Sequencing adaptors are introduced in two ligation steps. A type IIS recognition site, engineered in the first ligating oligo, is used to ensure a homogeneously sized library. After the second ligation step the library is amplified, quantified with qPCR and sequenced. All purification steps were carried out using the MinElute Reaction Cleanup Kit (QIAGEN, Redwood City, CA, USA) and the products were eluted in 20 μl EB buffer, unless otherwise stated. Oligonucleotides were synthesized by IDT. Double stranded oligonucleotides used in the ligation reactions were generated by mixing complementary single stranded oligonucleotides isostoichiometrically in 50 mM NaCl and 10 mM Tris pH 7.5, boiling at 95°C for 5′ and incubating at room temperature for 30′. Reannealing reaction PCR products were purified with QIAquick PCR Purification Kit (QIAGEN, Redwood City, CA, USA), concentrated to 150–200 ng/μl and the dsDNA concentrations were estimated with a Qubit fluorometer using the Quant-iT dsDNA HS kit (Invitrogen, Waltham, MA, USA). A total of 3 μg of DNA was denatured at 95°C for 10′ and reannealed by decreasing the temperature by 5°C every 10′ down to 20°C, on a thermocycler, in 10 mM Tris pH 7.5 and 50 mM NaCl and at a DNA concentration of 125–150 ng/μl. S1 nuclease digestion The reannealed products were digested with 50 units of S1 Nuclease (Thermo Fisher Scientific, Santa Clara, CA, USA) per μg of DNA in 1X S1 buffer at 40–60 ng/μl DNA concentration for 1 h 30′ at 25°C. The digested products were purified with the QIAquick PCR purification kit and eluted in 50 μl EB buffer. Blocking 3′ ends with 2’,3’-dideoxycytidine-5’-triphosphate (ddCTP) and Terminal Transferase (TdT) The 3′ ends were blocked with 3–5 nmol ddCTP (Affymetrix, Santa Clara, CA, USA) and 20 units TdT (NEB, Ipswich, MA, USA) in 64 μl 1× TdT buffer supplemented with 0.25 mM CoCl2, for 1 h at 37°C, and the products were purified. Mismatch digestion 800–1000 ng DNA, which is approximately half the amount that was recovered from the previous reaction, were digested with 2 μl Surveyor endonuclease, in the presence of 2 μl Enhancer (Mutation Detection Kit, IDT, Redwood City, CA, USA), and 20 units Ampligase (Epicentre, Madison, WI, USA) in 1X Ampligase buffer and DNA concentration of 20–25 ng/μl for 50 min at 42°C, and the products were purified. 3′ end extension with biotin-14-dCTPnucleotides (B-dCTP) and TdT The newly generated 3′ ends were extended with 80–140 pmol B-14-dCTP (Invitrogen, Waltham, MA, USA) and 20 units TdT in 27 μl 1× TdT buffer supplemented with 0.25 mM CoCl2 for 1 h at 37°C, and the products were purified. Ligation Each of the oligos DAo83, DAo84 and DAo85 was reannealed to oligo DAo97 and the resulting double stranded oligos were mixed isostoichiometrically to yield the ligating oligo pool (6.7 μM per species). The purified DNA products were ligated to 1.5 μl ligating oligo pool in the presence of 2000 units T4 ligase (NEB, Ipswich, MA, USA) in 33 μl 1× T4 ligase buffer and 10% polyethylene glycol (PEG) 4000 for a minimum of 5 h at 16°C, and the products were purified. Type IIS restriction endonuclease digestion The DNA was digested with 50 units AcuI endonuclease (NEB, Ipswich MA) in 25 μl 1× CutSmart buffer supplemented with 65 μM S-adenosylmethionine at 37°C for a minimum of 4 h. AcuI recognition sites were introduced during the previous step and digestion resulted in a homogeneously-sized library. Biotinylated fragments pulldown with streptavidin magnetic beads 25 μl M270 beads (Invitrogen, Waltham, MA, USA), were washed twice in Bind and Wash buffer (B&W, 5 mM Tris pH 7.5, 1 M NaCl, 0.5 mM EDTA) and resuspended in 25 μl 2× B&W. The beads were then added to the AcuI reaction and the slurry was incubated for 1 h 30′ in a roller drum at room temperature. The beads were washed once in 1× B&W + 0.1% Tween 20, three times in 1× B&W, once in 1× saline-sodium citrate (SSC) buffer (Sigma-Aldrich, St Louis, MO, USA) and once in water. Ligation The washed beads were resuspended in 10 μl ligation reaction (2000 units T4 ligase (NEB, Ipswich, MA, USA), ∼50 pmol ds oligo DAo98/DAo99 in 1× ligation buffer and 10% PEG 4000). The reactions were incubated at 16°C for a minimum of 5 h on a rocker. The beads were then washed once in 1× B&W + 0.1% Tween 20, three times in 1× B&W, once in 1× SSC buffer (Sigma-Aldrich, St Louis, MO, USA) and resuspended in 20 μl water. Amplification and library sequencing 5 μl of the bead suspension was used in an amplification reaction to introduce Nextera Illumina sequencing adaptors and read indices to the biotinylated DNA strand for multiplexed sequencing (Supplementary Table S1). The fragments were amplified for 6 cycles with PrimeSTAR in a 50 μl reaction, using the following PCR conditions: 98°C for 10′, 6 cycles of 98°C for 10”, 66°C for 5”, 72°C for 10". The amplification product was purified and quantified in a One Step qPCR instrument (Thermo Fisher Scientific, Santa Clara, CA, USA) against PhiX Control v3 (Illumina, Santa Clara, CA, USA), with Illumina flow cell adaptor sequences (Supplementary Table S1) to determine the number of amplification cycles available prior to PCR saturation. The cycling conditions were 98°C for 10′, 30 cycles of 98°C for 10”, 63°C for 5”, 72°C for 10". A fraction of the rest of the library was re-amplified using conditions identical to the qPCR conditions for another 10–12 cycles, then purified, quantified once more by qPCR against PhiX and paired-end sequenced on a MiSeq instrument for 150 cycles. Processing of sequencing reads Processing of sequencing reads prior to downstream analysis is depicted in Supplementary Figure S1. Briefly, paired-end reads were merged using FLASH version 1.2.11 (30). The merged reads were then deduplicated and adaptors and auxiliary sequences (introduced by the first ligation reaction, including the AcuI restriction site and the UMIs) were trimmed using cutadapt version 1.9.1 (31). Reads that were less than eight nucleotides long, and reads that did not start with a G were excluded from further analysis. The 5′ G was trimmed from the remaining reads, which were then aligned to the reference sequence using bowtie2 version 2.2.6 (32). Nextera libraries preparation Nextera libraries were prepared using a modified Illumina Nextera protocol as described (33) and paired-end sequenced on a MiSeq for 150 cycles. The data were analyzed and processed using a custom python script in conjunction with other freely available software. Briefly, the sequencing reads were trimmed using cutadapt version 1.9.1 (31), then quality- and length-filtered. The filtered reads were then aligned to the reference sequence with bwa version 0.7.15 (34), and the aligned reads were sorted and indexed using Picard version 2.7.1 (http://broadinstitute.github.io/picard). SNPs were called using the GATK software (35). The values used in Figure 4 and Supplementary Figure S6 represent averages from 2 technical replicates for the Nextera data and for the Diff-seq data for dilution 1:1. The values representing the rest of the dilutions for Diff-seq data were each derived from a single experiment. RESULTS Method description Diff-seq generates a sequencing library that is enriched for loci that are polymorphic in the input DNA (Figure 1, Supplementary Figure S1). To achieve this enrichment, Diff-seq harnesses the mismatch cleavage activity of the CELII endonuclease (commercially available as Surveyor Nuclease) (23). First, a population of DNA molecules is thermally denatured and then reannealed by gradual return to 20°C, to create mismatch-containing heterohybrid molecules. S1 nuclease is used to eliminate poorly reannealed molecules and excess single-stranded DNA. Free 3′ ends are extended with ddCTP using TdT, which blocks them from participating in subsequent enzymatic steps. The reannealed DNA is then digested using Surveyor, which specifically cuts both strands 3′ of a mismatch position (23,24), resulting in dsDNA molecules with a reactive 3′ overhang at one end that should correspond to the mismatched base(s). Surveyor digestion is carried out in the presence of Ampligase to reduce non-specific Surveyor cleavage (36). The newly-generated 3′ ends are then extended with B-dCTP using TdT. These extended 3′ ends are then used as substrates for a ligation reaction that introduces: (i) the primer sequence for the forward Illumina sequencing read (along with some buffer sequence after the B-dCTP extension), (ii) unique molecular identifiers (UMIs) between the primer sequence and the buffer sequence and (iii) an AcuI (type IIS endonuclease) recognition site, oriented such that the cut site will be within the captured fragment. The ligation reaction is designed to capture fragments that had incorporated up to 3 B-dCTP nucleotides. AcuI restriction digestion then generates a homogeneously-sized library, eliminating potential PCR and sequencing size biases. After digestion, biotinylated fragments are captured using streptavidin magnetic beads. Then, while the library is still attached to the beads, a second ligation reaction introduces sequence for the reverse Illumina sequencing read. Finally, sequencing adaptors and library-specific indices are introduced via amplification directly from the bead-attached material. A small aliquot of this first round of amplification is used to monitor PCR saturation via qPCR with SybrGreen against the Illumina PhiX library, and determine the number of additional cycles remaining prior to saturation. After the second amplification, the final libraries are quantified by qPCR and then sequenced on an Illumina MiSeq. The assay generates sequencing reads that align at and around variant positions (Supplementary Figures S2A and B). Supplementary Figure S2B shows the total coverage frequencies for a single mismatch case, and the fully-matched control. From the library structure (see Figure 1 and Supplementary Figure S1), we expected that the variant position should lie at the beginning of each processed, unaligned read, so we focused our efforts on calling variants on specific parts of the aligned read, rather than the whole read. We therefore converted the total coverage data to ‘Diff-seq coverage’, which we used for subsequent analyses. The first ligation step is designed to capture molecules extended with up to three biotinylated CTPs. However, even after G-trimming of the unaligned reads (see Materials and Methods), there is still uncertainty as to where within the next two or three bases the variant position lies when the reads start with G or GG, respectively, as the particular Gs may have been part of the extension, or may represent a variant base. Thus, we retained up to three consecutive bases (depending on their identity), located at the beginning of forward-aligned or at the end of reverse-aligned reads to generate the Diff-seq coverage track (Supplementary Figure S2C). Depending on the length of the extension, the context of the mismatch, as well as the identity of the variation itself, signal on neighboring positions can be comparable to the signal in the variant position, as shown in the next section. The efficiency of the Diff-seq protocol, calculated as the number of unique molecules that mapped to the reference genome, divided by the calculated number of molecules present at the beginning of the experiment, was ∼0.00065% using 3 μg of starting material (Supplementary Table S2). We assumed that only half of the molecules made it into the library preparation (only the heterohybrids), and that each of these molecules could result in exactly two sequencing reads (each one capturing the mismatch site from different orientations). This second assumption holds true only for molecules with a single polymorphic site. Application of Diff-seq to a simple genome To develop and test Diff-seq, we applied it to a known sequence with defined mismatches. This test substrate was a 985 bp sequence originating from the pET17b vector and derivatives containing single base-pair substitutions, which were introduced singly in three different positions. By assaying this simple substrate, we tested the extent to which the method can detect different types of variation, and examined biases introduced by the context of the variation (Figure 2 and Supplementary Figure S3). All Diff-seq libraries were compared to a library generated from DNA fragments that did not contain mismatches (Figure 2A, Supplementary Figure S3F). Variants covering all possible substitutions at a single position were used to test the extent to which Diff-seq could detect different pairs of variants (Figure 2B and Supplementary Figures S3A–C). Identical mismatch types were also introduced into 2 different positions to determine if the local sequence context affects Diff-seq signal (Supplementary Figures S3D and E). Finally, a more complex sample comprising a mixture of all possible mismatch types was also assayed to test the capacity for their simultaneous detection (Figure 2C). The libraries were prepared, sequenced and processed as described above and the resulting numbers of reads for each library at each step are summarized in Supplementary Table S2. Figure 2. View largeDownload slide Differential sequencing application to a model substrate. Differential sequencing libraries derived from a 1 kb sequence were prepared as described in Figure 1 and sequenced. The Diff-seq coverage frequency for each position of the reference strand is plotted against the position, and color-coded according to the nucleotide base identity. Positive and negative values represent values for the forward and reverse strand, respectively. (A) Sample with no variation. (B) Sample with G:G|C:C variation at position 477. (C) Sample with multiple variant positions, assembled from four individually reannealed samples mixed isostoichiometrically. The arrow points to the variant position 328. For B and C, zoom-in the x-axis plots at the variant positions is shown. The context of the variation is annotated in gray in each graph with the mismatched bases in black and the strands that represent forward and reverse orientations are annotated with bold italicized and regular font, respectively. Figure 2. View largeDownload slide Differential sequencing application to a model substrate. Differential sequencing libraries derived from a 1 kb sequence were prepared as described in Figure 1 and sequenced. The Diff-seq coverage frequency for each position of the reference strand is plotted against the position, and color-coded according to the nucleotide base identity. Positive and negative values represent values for the forward and reverse strand, respectively. (A) Sample with no variation. (B) Sample with G:G|C:C variation at position 477. (C) Sample with multiple variant positions, assembled from four individually reannealed samples mixed isostoichiometrically. The arrow points to the variant position 328. For B and C, zoom-in the x-axis plots at the variant positions is shown. The context of the variation is annotated in gray in each graph with the mismatched bases in black and the strands that represent forward and reverse orientations are annotated with bold italicized and regular font, respectively. Variation introduced at position 477 of the pET17b fragment was used to examine identity-based biases; our data show that G/C variation generally gives more signal than other variant combinations (Figure 2B and Supplementary Figures S3A–C), consistent with a Surveyor nuclease preference for G:G/C:C mismatched pairs (23,24). Earlier work has suggested context-dependent Surveyor digestion activity of mismatched DNA (24). To examine our method for possible biases due to sequence context surrounding the variant position, we assayed samples that contained a single type of variation within two different contexts (positions 328 and 477 in Supplementary Figures S3D and E) on the same substrate. In order to ensure that each DNA molecule had up to one mismatch only, the samples were generated by isostoichiometrically mixing the relevant populations post-reannealing. Surprisingly, we found that variation at position 328 resulted in Diff-seq coverage predominantly at position 327, regardless of the exact identity of the contributing bases at position 328, suggesting that Diff-seq signal is context-dependent. We attribute this to possible context-dependent hotspots for detection by the method and/or favorable DNA reannealing alternatives. Finally, we assayed all four variant types (alleles G and C at position 477, alleles A and T also at position 477, alleles C and A at position 478 and alleles G and A at position 328) by isostoichiometrically mixing appropriate combinations of reannealed molecules. We observed contributions from each relevant base from all variant pairs, suggesting that Diff-seq can identify multiple variant types in a complex mixture of molecules (Figure 2C). Diff-seq application to viral sequences We next applied Diff-seq to viral genome-derived sequences, in order to determine our ability to identify multiple variants within a single DNA molecule, at a range of abundances (Figure 3, Supplementary Table S1). First, we applied Diff-seq to two HIV clones of size 2.7 kb, that originated from a single individual before and after treatment with antivirals (Figure 3A–D). The two clones differ at 59 positions, all of which are SNPs (with 11 located within 2 nucleotides of other SNPs). One clone was mixed with the other at varying relative abundances: 50%, 10%, 5%, 1%, 0.5%, 0.1%, 0.05%, 0.01% and 0%. We observed substantial signal when the minority clone is present at as low as 5% frequency, with substantial signal degradation upon further dilution, though some variants still give Diff-seq coverage at even lower frequencies (Figure 3A). Figure 3. View largeDownload slide Differential sequencing application to viral genomes. (A) Differential sequencing libraries derived from a mixture of 2 viral clones were prepared as described in Figure 1 and sequenced on an Illumina MiSeq. The Diff-seq coverage frequencies, aggregated across both strands, are shown for a series of experiments where the contribution of the rare variant ranges from 50% to 0. Panels B–D focus on the 5% rare clone frequency dataset. (B and C) The log2 differential Diff-seq coverage between the 5% rare variant frequency and the control datasets were plotted against the log2 Diff-seq coverage per position for all positions of the template, considering (B) all and (C) the minor alleles. (D) ROC curves for the polymorphic sites called using different variables as predictors. The variables used are the sums of the per strand Diff-seq coverage enrichment scores per position (see text for details). (E and F) ROC curves for the sums of the per strand minor allele Diff-seq coverage enrichment scores per position, for PCR amplified viral population samples after Diff-seq application. (E) The sample corresponds to PID 5248 (8). The consensus sequence served as control and as a high-frequency variant, within which the population was diluted to 20%. (F) The sample corresponds to PID 30269 (8). Figure 3. View largeDownload slide Differential sequencing application to viral genomes. (A) Differential sequencing libraries derived from a mixture of 2 viral clones were prepared as described in Figure 1 and sequenced on an Illumina MiSeq. The Diff-seq coverage frequencies, aggregated across both strands, are shown for a series of experiments where the contribution of the rare variant ranges from 50% to 0. Panels B–D focus on the 5% rare clone frequency dataset. (B and C) The log2 differential Diff-seq coverage between the 5% rare variant frequency and the control datasets were plotted against the log2 Diff-seq coverage per position for all positions of the template, considering (B) all and (C) the minor alleles. (D) ROC curves for the polymorphic sites called using different variables as predictors. The variables used are the sums of the per strand Diff-seq coverage enrichment scores per position (see text for details). (E and F) ROC curves for the sums of the per strand minor allele Diff-seq coverage enrichment scores per position, for PCR amplified viral population samples after Diff-seq application. (E) The sample corresponds to PID 5248 (8). The consensus sequence served as control and as a high-frequency variant, within which the population was diluted to 20%. (F) The sample corresponds to PID 30269 (8). For the 5% minority variant frequency dataset, we derived mismatch- and allele-specific contributions to the total and strand-specific signals for each of the variant positions (Supplementary Figure S4A). The mismatch bias, shown in the first column, shows how much detection of the one mismatched pair is favored over the other. The forward and reverse allele biases, shown in the next two columns, represent strand-specific biases for the alternative allele. From these we derived the average and maximum allele bias as figures of merit for variation detection. We also calculated the total allele bias, generated from the total reads aligning to the alternative allele on either strand compared to all reads aligning to reference. The sites have been ordered first by the identity of contributing alleles and then by the trinucleotide context. These data suggest that the detection of one mismatch pair or allele is often favored over the other, and this preference is consistent across positions carrying the same variation. The last column shows the log2 Diff-seq coverage frequencies. This coverage is largely independent of the variation and its context. In particular, five out of the eight least covered SNP positions are located within a very mutation-dense region of the substrate (positions 183, 184, 187, 188 and 190). We also estimated the contribution of the type of variation to the detection of each nucleotide, by calculating the average positional frequencies (across any trinucleotide context) of each relevant base for the 59 SNP positions of the 50% dataset, binned by type (Supplementary Figure S4B). These data suggest that the contribution of each nucleotide to the total signal is independent of the variation type. In particular, a G nucleotide gives the majority of the signal regardless of the identity of the other allele, whereas, a mismatched C will give the least amount of signal, also regardless of the identity of the other allele (compare C reverse and G forward plots to C forward and G reverse plots). To explore possible variant calling algorithms for our data, we plotted, for the 5% minority variant frequency dataset, the log2 ratio of Diff-seq coverage frequency for experiment vs. control (y-axis) against the log2 Diff-seq coverage frequencies in the experimental sample (x-axis) either for all alleles (Figure 3B) or just the minor alleles (Figure 3C). Most variant sites, along with variant neighboring sites are more highly covered on the 5% frequency dataset when all alleles are considered (Figure 3B). However, when only the minor allele is considered, there is a clear separation of the variant and non-variant sites (Figure 3C). Based on these observations, we constructed potential SNP calling algorithms. We calculated z-scores for each position and strand, for the (i) total, (ii) minor allele, (iii) major allele depending on the nucleotide identity, (iv) minor allele depending on the nucleotide identity, (v) major allele depending on the trinucleotide context and (vi) minor allele depending on the trinucleotide context Diff-seq coverage. We used these per position z-scores (sum of z-scores of the forward and reverse strands) to generate ROC curves for the 5% frequency dataset (Figure 3D). The ranked z-scores for each of the variables used are shown in Supplementary Figure S5. We find that models that consider only the minor allele outperform those that consider the total coverage or the major allele coverage. Surprisingly, inclusion of allele identity and trinucleotide context (cases 4 and 6) did not increase the predictive power of the model. To further test our SNP calling approach we applied Diff-seq to two samples, comprising populations of amplified HIV pol gene isolated from HIV patient plasma (8) (Figure 3E and F, see Supplementary Table S2 for sequencing library metrics). We also sequenced these populations with standard Nextera library preparation to identify variants in these samples. We employed the population consensus sequence for the sample with PID 5248 in Varghese et al. (8), both as a control and as ‘reference genome’ to dilute and reanneal against the population of molecules to be assayed (see Materials and Methods). The second population corresponds to the sample with PID 30269 in Varghese et al. (8). Our SNP calling method performed well for both population samples. It also performed comparably when the population sample with PID 5248 was diluted down to 20% frequency in the consensus sequence (Figure 3E). Comparison of Diff-seq to standard Nextera library preparation for variant discovery To compare Diff-seq to standard, Nextera-based sequencing for variant discovery, we constructed Nextera libraries out of several dilutions of our clonal substrates (Diff-seq data shown in Figure 3A–D) and sequenced them on a MiSeq. We took into account the coverage as represented by the whole read for the Nextera data, and the Diff-seq coverage for the Diff-seq data (Supplementary Figure S2). Figure 4A–C shows the log2 frequencies of the non-reference alleles for all positions that were covered by both library preparations for non-reference:reference genome ratios 1:4, 1:24 and 1:124 (see Supplementary Figure S6 for a more complete set). The Nextera values for the non-variant positions, shown in grey, are presumably indicative of the MiSeq platform error rate. However, the corresponding Diff-seq values for the non-variant positions are increased. This detection of non-reference positions could either be the consequence of prior reactions, such as the extension reaction or due to misligation events, or result from true variant sites generated during amplification of the starting material. Our expectation was that the greater the dilution, the more the variant position cloud should deviate above the diagonal, as the Diff-seq data should be enriched in variant sites. For the 1:1 dilution (Supplementary Figure S6), although the SNP position cloud is located the furthest away from the error rate cloud, it does not deviate from the diagonal, and as expected the two methods perform similarly. Increasing the dilution factor, Diff-seq has an advantage in the detection of the rare variant (non-reference). The difference in the detection of rare variants between the two methods is summarized in Figure 4D, using the data shown in panels 4A–C and Supplementary Figure S6. Coverage frequencies of the rare variants can be 8–100× higher using Diff-seq compared to Nextera, depending on the dilution factor, when using only aligned reads and the Diff-seq coverage analysis. When the positional information of the Diff-seq dataset was not used and instead coverage derived from the whole read was considered, the rare variant frequencies were 4–14× higher for the Diff-seq data. Considering that the aligned reads were 25–60% of the total reads for the Diff-seq libraries, the number of reads ranged from being equal to and up to 8.5 times less than the Nextera library requirement. Figure 4. View large Download slide Comparison between Diff-seq and Nextera library preparations for the detection of variation in viral clones. (A–C) The log2 non-reference allele coverage frequencies for each position of the reference was plotted for the Diff-seq against Nextera library preparations. Non-variant positions, SNPs and dense SNPs are separately annotated. 1:4 (A), 1:24 (B) and 1:124 (C) rare:frequent molecule ratios are shown. (D) The log2 average of the ratios of the non-reference allele coverage frequencies Diff-seq/Nextera for the SNPs (positions shown in black or cyan in panels A–C, see Supplementary Figure S6 for all datasets) was plotted against the log2 dilution factor. Two Diff-seq sets of dilution series were employed and the points are colored accordingly. Figure 4. View large Download slide Comparison between Diff-seq and Nextera library preparations for the detection of variation in viral clones. (A–C) The log2 non-reference allele coverage frequencies for each position of the reference was plotted for the Diff-seq against Nextera library preparations. Non-variant positions, SNPs and dense SNPs are separately annotated. 1:4 (A), 1:24 (B) and 1:124 (C) rare:frequent molecule ratios are shown. (D) The log2 average of the ratios of the non-reference allele coverage frequencies Diff-seq/Nextera for the SNPs (positions shown in black or cyan in panels A–C, see Supplementary Figure S6 for all datasets) was plotted against the log2 dilution factor. Two Diff-seq sets of dilution series were employed and the points are colored accordingly. We also compared the costs of the Diff-seq and Nextera library preparation. The cost per Nextera library has been estimated by Baym et al. (33) to be $8 per sample, excluding the sequencing reagents themselves. The cost for the Diff-seq per library was estimated to be $67 per sample, also excluding sequencing reagents, but including the two qPCR quality control runs and assuming that 10 samples were being processed simultaneously. However, the increased library costs are somewhat offset by the potential decreased costs associated with the sequencing reagents themselves. As noted in the previous paragraph, that decrease will depend on the frequency of the variants, and may be up to 8.5 times less than the Nextera library requirements. DISCUSSION Here we describe Diff-seq, a new method for the detection of genetic variation within a population of DNA molecules. Diff-seq aims to outperform conventional high-throughput sequencing for SNP detection, though detection depends on both the identity of the contributing alleles and the context within which the variation arises. The method performed better with decreasing SNP density, while densely-spaced SNPs were not readily detected. The genomes we used did not contain indels, so the method was not tested for indel detection. It has been suggested that Surveyor endonuclease is less sensitive towards the detection of indels (37), so it would not be surprising if Diff-seq does not perform as well in that context. One advantage of Diff-seq, compared to other approaches for polymorphism discovery, is that it is less sensitive to sequencing errors. Sequencing capacity in Diff-seq is targeted towards the mismatch endonuclease digestion sites, and polymorphic sites will appear in the first few nucleotides of the sequencing read, significantly limiting the sources of sequencing errors. Because of this fundamental advantage, we anticipate that further assay optimization will increase sensitivity of rare variation detection to extremely low frequencies. We anticipate that polymorphism detection sensitivity may be improved through the use of an optimized reannealing protocol. Use of alternative endonucleases, such as T7E1 that perform better at indel positions (38), could expand the applicability of Diff-seq. Further optimization of other aspects of the method will likely also enhance its utility and ease of use. Currently the protocol requires eight enzymatic steps and a similar number of intermediate purification steps. Similarly to other methods, such as chromatin immunoprecipitation sequencing (ChIP-seq), each step is an opportunity for experimental variability and substrate loss. Decreasing or combining steps will certainly improve the efficiency of the protocol, and may also improve signal to background metrics. A stringent library size selection procedure prior to sequencing might also increase the fraction of usable reads and decrease the number of uninformative reads. Furthermore, the type IIS restriction enzyme currently used, AcuI, cuts only 16 bases away from its recognition site. For small genomes, such as the viral genomes used here, this small read size does not greatly affect mappability, however for larger genomes, unique mappability becomes problematic (39); a possible solution may be the use of combined sequence information from the forward and reverse reads that are adjacent to a specific polymorphism. MmeI and EcoP15I, which are type III restriction enzymes, have restriction sites further than 16 bases from their recognition site, yet also require two recognition sites in opposite orientation for efficient digestion, making them impractical for use in the Diff-seq protocol. To overcome this limitation, the type IIS restriction digestion might be eliminated altogether. However, in order to ligate the second sequencing adaptor, the end of the fragment would need to be rendered reactive, for example either by exonucleolytic cleavage of the dideoxynucleotide or by using a reversible terminator. Such an approach could, however, lead to the generation of highly variable DNA fragment lengths, which have strongly differential amplification and clustering efficiency on the Illumina flow cell. This variability might be partially alleviated by employing different sets of enzymes for restriction digestion for the generation of the initial fragments pre-reannealing. As it currently stands, the Diff-seq library preparation is ∼8 times more expensive than the Nextera library preparation, because large amounts of multiple enzymes are needed, and there are multiple clean up steps. Finally, the quality control for Diff-seq is also a considerable part of the expense, with a current cost of $30 per qPCR run, which adds ∼$6 to the cost of a library. Future improvements should be targeted at increasing efficiency and decreasing these costs. Diff-seq may be suitable for a variety of applications, from genotyping to estimating DNA polymerase error rates. In principle, any method applied to an already known genome can employ Diff-seq immediately prior to sequencing, to target sequencing capacity towards polymorphic sites, and in this way increase reads covering low frequency alleles. We note however that a potential limitation of the method is that it relies on the generation of mismatches for the detection of variation. Thus, Diff-seq using a diploid genome as the input DNA would only expected to detect heterozygous alleles, and would not identify homozygous or hemizygous variants. This limitation might be overcome by adding a known consensus sequence or reference genome to the sample. Diff-seq application to more complex genomes will require robust DNA reannealing methods, such as oscillating phenol emulsion reassociation technique (osPERT) (40). Repetitive sequences reanneal more rapidly due to their higher relative abundance, and thus polymorphisms within these regions may be especially amenable to detection using Diff-seq. Alternatively, because of the kinetic separation of annealing times for repetitive regions, such regions can be removed (e.g. by retention on a hydroxyapatite column) (41). Additional annealing dynamics likely arise due to genome complexity, but how these will affect the outcome of Diff-seq is difficult to predict and account for at this stage. Potential application of Diff-seq to more complex genomes opens several exciting possibilities. For example, carrying out Diff-seq during exome capture may provide improved coverage of relevant polymorphic regions, potentially decreasing the sequencing resources required for comparable amounts of information. Furthermore, variant discovery in circulating tumor cells has been demonstrated through arduous enrichment approaches coupled with massive sequencing efforts involving multiple independent library preparations to distinguish SNPs from errors introduced during amplification and sequencing (42). Approaches for detecting somatic mutations associated with cancer from cell-free DNA likewise involve extremely deep, molecularly tagged sequencing. Diff-seq, coupled with significant improvements in efficiency, has the potential to substantially reduce the required sequencing capacity of these workflows by moving the mismatch detection ‘up front’ and focusing sequencing capacity on polymorphic regions. DATA AVAILABILITY The custom written software used for data analysis is available at https://github.com/Sherlock-Lab/Diff-seq, https://zenodo.org/badge/latestdoi/101428770. The datasets generated and analyzed during the current study are available in the Sequencing Read Archive, under study accession SRP116147 (https://submit.ncbi.nlm.nih.gov/subs/sra/). Samples description can be found under bioproject accession number PRJNA397463. PCR sequences of population samples can be found under accession numbers GQ212684 and GQ212685 for sample with PID 30269 and GQ206339 and GQ206337 for sample with PID 5248 (8). SUPPLEMENTARY DATA Supplementary Data are available at NAR online. ACKNOWLEDGEMENTS The authors would like to thank Dr Mark Holodniy and Dr Mark Winters for their generous gift of HIV clones. We would also like to thank Mia Jaffe and Viviana Risca for useful discussions and comments. FUNDING Stanford Transformative Innovation Fund in Basic Biomedical Sciences Grant (to G.S.); National Institutes of Health Grants P50 HG007735 and U19AI057266 and Rita Allen Foundation and Human Frontiers Science Program (to W.J.G); US Department of Defense National Defense Science and Engineering Graduate (NDSEG) Fellowship and Stanford Graduate Fellowship (to N.A.S.A.). W.J.G. is a Chan Zuckerberg Biohub investigator. Funding for open access charge: Stanford internal fund. Conflict of interest statement. D.A., V.O.K., W.J.G. and G.S. have applied for a related patent. REFERENCES 1. Hayward N.K., Wilmott J.S., Waddell N., Johansson P.A., Field M.A., Nones K., Patch A.-M., Kakavand H., Alexandrov L.B., Burke H.et al. Whole-genome landscapes of major melanoma subtypes. Nature . 2017; 545: 175– 180. Google Scholar CrossRef Search ADS PubMed 2. Long A., Liti G., Luptak A., Tenaillon O. Elucidating the molecular architecture of adaptation via evolve and resequence experiments. Nat. Rev. Genet. 2015; 16: 567– 582. Google Scholar CrossRef Search ADS PubMed 3. Flowers J.M., Hazzouri K.M., Pham G.M., Rosas U., Bahmani T., Khraiwesh B., Nelson D.R., Jijakli K., Abdrabu R., Harris E.H.et al. Whole-genome resequencing reveals extensive natural variation in the model green alga Chlamydomonas reinhardtii. Plant Cell . 2015; 27: 2353– 2369. Google Scholar CrossRef Search ADS PubMed 4. Huang M., Bai Y., Sjostrom S.L., Hallström B.M., Liu Z., Petranovic D., Uhlén M., Joensson H.N., Andersson-Svahn H., Nielsen J. Microfluidic screening and whole-genome sequencing identifies mutations associated with improved protein secretion by yeast. Proc. Natl. Acad. Sci. U.S.A. 2015; 112: E4689– E4696. Google Scholar CrossRef Search ADS PubMed 5. Kvitek D.J., Sherlock G. Reciprocal sign epistasis between frequently experimentally evolved adaptive mutations causes a rugged fitness landscape. PLoS Genet. 2011; 7: e1002056. Google Scholar CrossRef Search ADS PubMed 6. Flaherty P., Natsoulis G., Muralidharan O., Winters M., Buenrostro J., Bell J., Brown S., Holodniy M., Zhang N., Ji H.P. Ultrasensitive detection of rare mutations using next-generation targeted resequencing. Nucleic Acids Res. 2011; 40: e2. Google Scholar CrossRef Search ADS PubMed 7. Kvitek D.J., Sherlock G. Whole genome, whole population sequencing reveals that loss of signaling networks is the major adaptive strategy in a constant environment. PLoS Genet. 2013; 9: e1003972. Google Scholar CrossRef Search ADS PubMed 8. Varghese V., Shahriar R., Rhee S.-Y., Liu T., Simen B.B., Egholm M., Hanczaruk B., Blake L.A., Gharizadeh B., Babrzadeh F.et al. Minority variants associated with transmitted and acquired HIV-1 nonnucleoside reverse transcriptase inhibitor resistance: implications for the use of second-generation nonnucleoside reverse transcriptase inhibitors. JAIDS J. Acquired Immune Defic. Syndromes . 2009; 52: 309– 315. Google Scholar CrossRef Search ADS 9. Acevedo A., Andino R. Library preparation for highly accurate population sequencing of RNA viruses. Nat. Protoc. 2014; 9: 1760– 1769. Google Scholar CrossRef Search ADS PubMed 10. Lizardi P.M., Huang X., Zhu Z., Bray-Ward P., Thomas D.C., Ward D.C. Mutation detection and single-molecule counting using isothermal rolling-circle amplification. Nat. Genet. 1998; 19: 225– 232. Google Scholar CrossRef Search ADS PubMed 11. Hardenbol P., Banér J., Jain M., Nilsson M., Namsaraev E.A., Karlin-Neumann G.A., Fakhrai-Rad H., Ronaghi M., Willis T.D., Landegren U.et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat. Biotech. 2003; 21: 673– 678. Google Scholar CrossRef Search ADS 12. Xu Q., Huang S.Q., Ma F., Tang B., Zhang C.Y. Controllable mismatched ligation for bioluminescence screening of known and unknown mutations. Anal. Chem. 2016; 88: 2431– 2439. Google Scholar CrossRef Search ADS PubMed 13. Morita A., Nakayama T., Doba N., Hinohara S., Mizutani T., Soma M. Genotyping of triallelic SNPs using TaqMan® PCR. Mol. Cell. Probes . 2007; 21: 171– 176. Google Scholar CrossRef Search ADS PubMed 14. Cheng W., Zhang W., Yan Y., Shen B., Zhu D., Lei P., Ding S. A novel electrochemical biosensor for ultrasensitive and specific detection of DNA based on molecular beacon mediated circular strand displacement and rolling circle amplification. Biosens. Bioelectron. 2014; 62: 274– 279. Google Scholar CrossRef Search ADS PubMed 15. De La Vega F.M, Lazaruk K.D., Rhodes M.D., Wenz M.H. Assessment of two flexible and compatible SNP genotyping platforms: TaqMan® SNP genotyping assays and the SNPlex™ genotyping system. Mut. Res./Fundam. Mol. Mech. Mutagen. 2005; 573: 111– 135. Google Scholar CrossRef Search ADS 16. Li L., Li C., Zhang S., Zhao S., Liu Y., Lin Y. Analysis of 14 highly informative SNP markers on X chromosome by TaqMan® SNP genotyping assay. Forensic Sci. Int.: Genet. 2010; 4: e145– e148. Google Scholar CrossRef Search ADS PubMed 17. Tyagi S., Kramer F.R. Molecular beacons: probes that fluoresce upon hybridization. Nat. Biotech. 1996; 14: 303– 308. Google Scholar CrossRef Search ADS 18. Wagner R., Debble P., Radman M. Mutation detection using immobilized mismatch binding protein (MutS). Nucleic Acids Res. 1995; 23: 3944– 3948. Google Scholar CrossRef Search ADS PubMed 19. Youil R., Kemper B.W., Cotton R.G. Screening for mutations by enzyme mismatch cleavage with T4 endonuclease VII. Proc. Natl. Acad. Sci. U.S.A. 1995; 92: 87– 91. Google Scholar CrossRef Search ADS PubMed 20. Till B.J., Burtner C., Comai L., Henikoff S. Mismatch cleavage by single-strand specific nucleases. Nucleic Acids Res. 2004; 32: 2632– 2641. Google Scholar CrossRef Search ADS PubMed 21. Babon J.J., McKenzie M., Cotton R.G.H. The use of resolvases T4 endonuclease VII and T7 endonuclease I in mutation detection. MB . 2003; 23: 73– 82. Google Scholar CrossRef Search ADS 22. Bannwarth S., Procaccio V., Paquis-Flucklinger V. Rapid identification of unknown heteroplasmic mutations across the entire human mitochondrial genome with mismatch-specific Surveyor nuclease. Nat. Protoc. 2006; 1: 2037– 2047. Google Scholar CrossRef Search ADS PubMed 23. Oleykowski C.A., Bronson Mullins C.R., Godwin A.K., Yeung A.T. Mutation detection using a novel plant endonuclease. Nucleic Acids Res. 1998; 26: 4597– 4602. Google Scholar CrossRef Search ADS PubMed 24. Qiu P., Shandilya H., D’Alessio J.M., O’Connor K., Durocher J., Gerard G.F. Mutation detection using Surveyor™ nuclease. BioTechniques . 2004; 36: 702– 707. Google Scholar PubMed 25. Nelson S.F., McCusker J.H., Sander M.A., Kee Y., Modrich P., Brown P.O. Genomic mismatch scanning: a new approach to genetic linkage mapping. Nature Genetics . 1993; 4: 11– 18. Google Scholar CrossRef Search ADS PubMed 26. Colbert T., Till B.J., Tompa R., Reynolds S., Steine M.N., Yeung A.T., McCallum C.M., Comai L., Henikoff S. High-throughput screening for induced point mutations. Plant Physiol . 2001; 126: 480– 484. Google Scholar CrossRef Search ADS PubMed 27. Comai L., Young K., Till B.J., Reynolds S.H., Greene E.A., Codomo C.A., Enns L.C., Johnson J.E., Burtner C., Odden A.R.et al. Efficient discovery of DNA polymorphisms in natural populations by Ecotilling. Plant J. 2004; 37: 778– 786. Google Scholar CrossRef Search ADS PubMed 28. Comeron J.M., Reed J., Christie M., Jacobs J.S., Dierdorff J., Eberl D.F., Robert Manak J. A mismatch endoNuclease array-based methodology (MENA) for identifying known SNPs or novel point mutations. Microarrays . 2016; 5: 7. Google Scholar CrossRef Search ADS 29. Till B.J., Reynolds S.H., Greene E.A., Codomo C.A., Enns L.C., Johnson J.E., Burtner C., Odden A.R., Young K., Taylor N.E.et al. Large-scale discovery of induced point mutations with high-throughput tilling. Genome Res. 2003; 13: 524– 530. Google Scholar CrossRef Search ADS PubMed 30. Magoč T., Salzberg S.L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics . 2011; 27: 2957– 2963. Google Scholar CrossRef Search ADS PubMed 31. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet j. 2011; 17: 10– 12. Google Scholar CrossRef Search ADS 32. Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods . 2012; 9: 357– 359. Google Scholar CrossRef Search ADS PubMed 33. Baym M., Kryazhimskiy S., Lieberman T.D., Chung H., Desai M.M., Kishony R. Inexpensive multiplexed library preparation for megabase-sized genomes. PLoS ONE . 2015; 10: e0128036. Google Scholar CrossRef Search ADS PubMed 34. Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics . 2009; 25: 1754– 1760. Google Scholar CrossRef Search ADS PubMed 35. McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M.et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010; 20: 1297– 1303. Google Scholar CrossRef Search ADS PubMed 36. Huang M.C., Cheong W.C., Lim L.S., Li M.H. A simple, high sensitivity mutation screening using Ampligase mediated T7 endonuclease I and Surveyor nuclease with microfluidic capillary electrophoresis. Electrophoresis . 2012; 33: 788– 796. Google Scholar CrossRef Search ADS PubMed 37. Voskarides K., Deltas C. Screening for mutations in kidney-related genes using SURVEYOR nuclease for cleavage at heteroduplex mismatches. J. Mol. Diagn. 2009; 11: 311– 318. Google Scholar CrossRef Search ADS PubMed 38. Vouillot L., Thélie A., Pollet N. Comparison of T7E1 and surveyor mismatch cleavage assays to detect mutations triggered by engineered mucleases. G3 . 2015; 5: 407– 415. Google Scholar CrossRef Search ADS PubMed 39. Lee H., Schatz M.C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics . 2012; 28: 2097– 2105. Google Scholar CrossRef Search ADS PubMed 40. Bruzel A., Cheung V.G. DNA reassociation using oscillating phenol emulsions. Genomics . 2006; 87: 286– 289. Google Scholar CrossRef Search ADS PubMed 41. Britten R.J., Kohne D.E. Repeated sequences in DNA. Science . 1968; 161: 529– 540. Google Scholar CrossRef Search ADS PubMed 42. Lohr J.G., Ly A., Adalsteinsson V.A., Cibulskis K., Choudhury A.D., Rosenberg M., Cruz-Gordillo P., Francis J.M., Zhang C.-Z., Shalek A.K.et al. Whole-exome sequencing of circulating tumor cells provides a window into metastatic prostate cancer. Nat. Biotech. 2014; 32: 479– 484. Google Scholar CrossRef Search ADS © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
An interplay of miRNA abundance and target site architecture determines miRNA activity and specificityBrancati, Giovanna;Großhans, Helge
doi: 10.1093/nar/gky201pmid: 29897601
Abstract MicroRNAs often occur in families whose members share an identical 5′ terminal ‘seed’ sequence. The seed is a major determinant of miRNA activity, and family members are thought to act redundantly on target mRNAs with perfect seed matches, i.e. sequences complementary to the seed. However, recently sequences outside the seed were reported to promote silencing by individual miRNA family members. Here, we examine this concept and the importance of miRNA specificity for the robustness of developmental gene control. Using the let-7 miRNA family in Caenorhabditis elegans, we find that seed match imperfections can increase specificity by requiring extensive pairing outside the miRNA seed region for efficient silencing and that such specificity is needed for faithful worm development. In addition, for some target site architectures, elevated miRNA levels can compensate for a lack of complementarity outside the seed. Thus, some target sites require higher miRNA concentration for silencing than others, contrasting with a traditional binary distinction between functional and non-functional sites. We conclude that changing miRNA concentrations can alter cellular miRNA target repertoires. This diversifies possible biological outcomes of miRNA-mediated gene regulation and stresses the importance of target validation under physiological conditions to understand miRNA functions in vivo. INTRODUCTION MicroRNAs (miRNAs) are small RNAs of about 22 nucleotides that silence target messenger RNAs by binding to partially complementary sequences in their 3′ untranslated regions (3′UTRs). miRNAs are loaded onto an Argonaute (Ago) protein to form the core of the miRNA-induced silencing complex (miRISC), which induces decay or translational repression of the targets (1). Conceptually, miRNAs can be separated into two parts: the ‘seed’, comprising nucleotides two through eight, and the ‘seed-distal’ 3′ end (Figure 1A). The seed sequence has emerged as the main determinant for target identification (2). Usually, functional miRNA targets contain ‘seed matches’, heptamers that base pair with perfect Watson-Crick complementarity to the miRNA seed. These were found to be necessary and sufficient for silencing in studies using ectopic miRNA expression (3–5). Structural and biochemical analyses of miRISC have provided an explanation for these results: the seed of a miRNA bound by Ago exists in a pre-arranged conformation, thus reducing the entropic cost of binding and favoring duplex formation with a target (6–8). Figure 1. View largeDownload slide let-7 becomes dispensable for viability when the lin-41 3′UTR contains perfect seed match sites. (A) Schematic drawing of a miRNA/target duplex with seed (nucleotides 2–8)/seed match and limited seed-distal pairing indicated. Top mRNA, bottom miRNA. (B) The let-7 family with the seed sequence (nucleotides 2–8) highlighted in magenta. (C) The two let-7 complementary sites (LCS1 and LCS2) in the lin-41 3′UTR of C. elegans. Each site contains an imperfect seed match (a bulged A and a G: U wobble, respectively, in bold) to the let-7 family and an extensive seed-distal pairing to let-7 only. The sites are separated by 27 nt of intervening sequence (dashed line). (D, E) Representative images of animals carrying the let-7ts mutation and (D) wild-type lin-41 or (E) the lin-41(xe83[perfect]) allele with perfect seed match to the let-7 family and unchanged seed-distal region. Animals were grown at 25°C. let-7ts: let-7(n2853) X, temperature sensitive lesion. miRNA site legend: magenta = seed/seed match; cyan = let-7 seed-distal binding. Figure 1. View largeDownload slide let-7 becomes dispensable for viability when the lin-41 3′UTR contains perfect seed match sites. (A) Schematic drawing of a miRNA/target duplex with seed (nucleotides 2–8)/seed match and limited seed-distal pairing indicated. Top mRNA, bottom miRNA. (B) The let-7 family with the seed sequence (nucleotides 2–8) highlighted in magenta. (C) The two let-7 complementary sites (LCS1 and LCS2) in the lin-41 3′UTR of C. elegans. Each site contains an imperfect seed match (a bulged A and a G: U wobble, respectively, in bold) to the let-7 family and an extensive seed-distal pairing to let-7 only. The sites are separated by 27 nt of intervening sequence (dashed line). (D, E) Representative images of animals carrying the let-7ts mutation and (D) wild-type lin-41 or (E) the lin-41(xe83[perfect]) allele with perfect seed match to the let-7 family and unchanged seed-distal region. Animals were grown at 25°C. let-7ts: let-7(n2853) X, temperature sensitive lesion. miRNA site legend: magenta = seed/seed match; cyan = let-7 seed-distal binding. miRNAs frequently occur in families that share the seed sequence but differ in the seed-distal part. Given the reliance of target silencing on seed matches, it is assumed that miRNA family members can function redundantly, and most computational approaches that predict miRNA targets make predictions for miRNA families rather than for individual miRNAs (2). Consequently, it was hypothesized that in order to attain specificity among family members, miRNAs require imperfect seed matches. In this scenario, an imperfect seed match impairs binding and activity of most family members, but extensive seed-distal base pairing would enable a specific family member to compensate for the unfavorable seed binding (3). However, high-throughput biochemical capture of Ago-bound miRNA/target duplexes revealed numerous instances of interactions that frequently extended beyond the seed, to involve the seed-distal parts of the miRNA (9–12). In cell culture and in vivo assays, some of the targets that could base pair through their seed-distal parts were silenced preferentially by specific family members (9,12). Because such specificity also occurred for target sites with perfect seed matches, these findings argued that seed match imperfections might not be a requirement for miRNA family member specificity. By contrast, specificity of miRNA silencing through seed mismatches would explain why members of the let-7 family of Caenorhabditis elegans have partially non-redundant functions. Indeed, among four members with overlapping expression patterns (13), let-7, miR-48, miR-84 and miR-241 (Figure 1B), only let-7 is essential for viability (14,15). let-7 ensures proper development of C. elegans by repressing one crucial target, lin-41 (16,17), whose 3′UTR contains two functional let-7 binding sites (let-7complementary sites, LCSs) (18). Both LCSs contain imperfect seed-matches, which yield a bulged-out nucleotide and a G:U wobble base-pair respectively (Figure 1C). Moreover, both LCSs exhibit extensive complementarity to the seed-distal sequence of let-7 but to none of its sisters. Here, we test if this miRNA site architecture ensures specific silencing by let-7 and explore miRNA site architectures as a mechanism for the selectivity of different family members towards distinct targets. We show that extensive seed-distal pairing favors miRNA silencing by an individual miRNA family member even when the seed match is perfect, but that an imperfect seed match greatly enhances this family member specificity. Thus, we find that perturbing let-7-specific regulation of lin-41, by introducing a perfect seed match, impairs normal C. elegans development through allowing the let-7 family sisters miR-48, miR-241 and miR-84 to prematurely silence lin-41. Moreover, specificity of targets with perfect or nearly perfect seed matches can be overcome through elevated levels of a miRNA that is incapable of seed-distal pairing. Hence, although sequence-instructed, specificity is not fully hard-wired and can be altered by changes in miRNA expression levels. Our observations are consistent with a model where let-7 family miRNAs act as rheostats (19), such that the interplay of target site architecture and miRNA abundance determine the extent of target silencing. This flexible targeting mechanism expands the regulatory potential of miRNA families and indicates that miRNA activity may differ on bona fide targets at a given miRNA concentration. Conversely, alterations in miRNA concentrations may then change the miRNA target repertoire, expanding the range of possible biological outcomes, and revealing a need for target validation under physiological conditions to understand miRNA function in vivo. MATERIALS AND METHODS Worm handling and strains Worms were grown using standard methods at 25°C. The transgenic unc-54 + miRNA sites reporter strains were obtained by single-copy integration into the ttTi5605 locus on chromosome II (20). Injected plasmids were cloned using the MultiSite Gateway Technology (Thermo Fisher Scientific) and the destination vector pCFJ150 (21) or Gibson assembly (22). All strains are listed in Supplementary Table S1. unc-54 + miRNA sites reporters All unc-54 + miRNA sites reporters were constructed using the MultiSite Gateway Technology (Thermo Fisher Scientific) and the destination vector pCFJ150 (21) or Gibson assembly (22). First, the pGB0 vector was obtained via site-directed mutagenesis (23) of the pDONR P2R-P3_p37 vector to insert the AscI restriction site. Then, the pGB01 plasmid was obtained via LR reaction (Gateway LR Clonase II Enzyme mix, Thermo Fisher Scientific; 11791020) of the three entry vectors pdpy-30 x pGFP::H2B x pGB0 and the pCFJ150 backbone. All the plasmids listed in Supplementary Table S2 were obtained via Gibson assembly of the digested pGB01 plasmid and gBlocks® Gene Fragments (Integrated DNA Technologies) listed below. All plasmids were verified by sequencing. Transgenic worms were obtained by single-copy integration into the ttTi5605 locus on chromosome II, following the published protocol for injection with low DNA concentration (20). We optimized our previous mCherry reference transgene (16) by replacing the artificial 3′UTR with an endogenous unc-54 3′ UTR, to achieve more physiologic and brighter expression. The resulting Pdpy-30::mCherry::H2B::unc-54 transgene was integrated on chromosome I to yield strain HW1454. Genome editing Mutations in the endogenous lin-41 3′UTR sequence were obtained by CRISPR-Cas9 to generate the lin-41(xe83[perfect]), lin-41(xe76[ap427_W-C]), and lin-41(xe99[48-ized]) alleles. Wild-type worms were injected as described in (24) with a mix containing 50 ng/μl pIK155, 100 ng/μl of each pGB48 and plin-41sgRNA, 20 ng/ μl repair oligo (see Supplementary Table S4), dpy-10 co-crispr mix containing 100 ng/ml pIK208 (Addgene plasmid #65630) and 20 ng/ml AF-ZF-827 oligo PAGE purified (IDT). Single F1 roller progeny of injected wild-type worms were picked to individual plates and the F2 progeny screened for the mutated allele using PCR assays and sequencing (Supplementary Table S3). The alleles were outcrossed three times to the wild-type strain. let-7 over-expression A let-7(++) strain (HW 1909 [xeSi287, V]) was obtained by injection of the plasmid pGB26, obtained via Gibson assembly of the PCR amplified minimal rescue fragment from (15) and the pIK37 plasmid. Transgenic worms were obtained by single-copy integration into the oxTi365 locus on chromosome V (universal MosSCI strain #EG8082 (25). Reporter quantification For confocal assays, worms were grown at 25°C. Let-7ts worms were maintained at 15°C and adults were transferred to 25°C for 48 h before imaging. Z-stacks of 0.313 μm μm thickness were acquired in green, red and transmitted light channels at 40× magnification on a Zeiss LSM700 confocal microscope coupled to Zeiss Zen 2010 software equipped with a multi-position tile scan macro. The z-stacks were stitched together and compiled into a single image using scripts in Matlab and Fiji (26). For data analysis, late L4 worms were selected based on visual inspection of gonad length and vulva morphology (27). Ten to fourteen vulva cells were selected in the ‘cell counter’ macro in Fiji. Images around these seed points were de-noised using a Richardson-Lucy algorithm and segmented using an Otsu global threshold. Remaining holes were filled using a morphological filter. Signal intensity in the green channel was divided by the red signal intensity for each cell; relative signal intensities were then averaged for each worm. 10–12 vulva cells in 5–10 worms per genotype were quantified, mean signal intensity and SD were calculated and graphed using GraphPad Prism software. Confocal analysis of LIN-29 precocious accumulation Synchronized arrested L1 larvae of animals carrying endogenously tagged LIN-29, lin-29(xe61[lin-29::gfp::3xflag]) (28), in wild-type or lin-41(xe83[perfect]) background, were plated on food and incubated at 25°C on 2% NGM agar plates with Escherichia coli OP50 bacteria and imaged at the L3 stage (20–22 h after plating). Images were acquired in green and transmitted light channels (with Differential Interference Contrast, DIC) with 40×/1.3 oil immersion objective on a Zeiss LSM700 confocal microscope coupled to Zeiss Zen 2010 software. Further image processing was performed with Fiji (26). RESULTS Perfect seed matches in the lin-41 3′ UTR make let-7 miRNA dispensable for animal viability Specific regulation of lin-41 by let-7 and not by its sisters was previously speculated (3) to derive from the imperfect seed-matches in the two let-7 miRNA Complementary Sites (LCS1 and LCS2) in the lin-41 3′UTR (Figure 1C (18,29)). When bound by let-7 family miRNAs, the seed match sequences of LCS1 and LCS2 generate an A-bulge and a G:U wobble pair. Both sites contain seed-distal complementarity to let-7, but not to its sisters. However, Broughton and colleagues recently identified a target site in the 3′UTR of dot-1.1 that appeared specific to the let-7 family member miR-48 in the absence of seed match imperfections (9). Given this unexpected finding, we tested the possibility that seed mismatches in LCS1 and LCS2 were similarly dispensable for specific recognition by let-7. To this end, we generated a lin-41 allele, lin-41(xe83[perfect]), which differs from the wild-type allele in two nucleotides: We eliminated the A bulge in LCS1 and converted the G:U wobble pair of LCS2 into a standard Watson-Crick base pair. Strikingly, these two nucleotide changes rescued the larval lethality caused by loss of let-7, both in the let-7(mn112) null mutant strain and the let-7(n2853) temperature-sensitive strain (henceforth let-7ts), which recapitulates the let-7 null phenotype at the restrictive temperature, 25°C ((15), Figure 1D and E). Thus, ≥98% (N = 3, each with n ≥ 200 animals) of lin-41(xe83[perfect]); let-7ts double mutant animals survived into adulthood, as did 100% (N = 2, n ≥ 98 animals) of lin-41(xe83[perfect]); let-7(mn112) double mutant animals, of which 6% subsequently died as adults. These findings suggest that seed mismatches are required to restrict silencing of lin-41 to let-7, because other let-7 family members confer silencing in their absence. A perfect seed match allows redundant activity of the let-7 sisters To confirm that the perfect seed matches of the lin-41(xe83[perfect]) allele allow redundant binding of the let-7 family, we monitored the activity of the four miRNAs through a GFP reporter modified from (16) (Materials and Methods). In our assay, each animal contains a red mCherry reporter, which is used as reference during image analysis, and a GFP reporter, which is the miRNA activity sensor (Figure 2A). Both reporters are driven by the ubiquitous and constitutively active dpy-30 promoter and contain the unc-54 3′UTR, generally thought to be devoid of regulatory elements. Finally, each reporter is integrated by Mos1-mediated single copy integration into a distinct genomic location (20). Figure 2. View largeDownload slide Redundant activity of the let-7 family in the presence of a perfect seed match. (A) Schematic of the reporters used to monitor miRNA activity in vivo. The depicted GFP transgene unc-54 + let-7 sites reporter contains 111 nucleotides of the lin-41 3′UTR (shaded in blue), which harbor the two let-7 binding sites and the 27 nt-long intervening sequence, grafted into the heterologous, unregulated unc-54 3′UTR. Worms also contain a red mCherry reporter for normalization. Transcription of the single-copy integrated reporters from the ubiquitously active dpy-30 promoter is constitutive. miRNA site legend: magenta = seed/seed match; cyan = let-7 seed-distal binding. (B) Representative confocal images of the vulvae of animals carrying the red mCherry reporter (for normalization) and GFP reporters with the indicated 3′UTRs. These are ‘lin-41 3′UTR full-length’, ‘unc-54’ (CTRL, unregulated) and ‘unc-54 + let-7 sites’ in wild-type and ‘unc-54 + let-7 sites’ in the let-7ts background. Images are merged GFP, mCherry and DIC channels. Red color indicates a greater, and green color a lesser degree of reporter repression. Dashed lines outline the vulvae of the animals, which confirm appropriate late Larval stage 4 (L4). Scale bars 15 μm. (C, D) Quantification of (C) ‘unc-54 + let-7 sites’ reporter, (D) ‘unc-54 + let-7 sites_perfect seed match’ reporter. Each dot represents the average of the GFP signal intensity, obtained by confocal imaging, divided by the mCherry intensity for a single animal per condition. 10–12 vulva cells were quantified per worm. Mean values are normalized to the average value of the GFP/mCherry ratio of the negative control unc-54 3′UTR reporter, which is not silenced. Horizontal line and error bars indicate mean values per condition ± SD. *P < 0.05 and ***P < 0.001, two-tailed unpaired t-test. For reference, data obtained for the unc-54, Neg.Control reporter are replotted in panel D; gray shading is bounded by the min-max values of this control. Figure 2. View largeDownload slide Redundant activity of the let-7 family in the presence of a perfect seed match. (A) Schematic of the reporters used to monitor miRNA activity in vivo. The depicted GFP transgene unc-54 + let-7 sites reporter contains 111 nucleotides of the lin-41 3′UTR (shaded in blue), which harbor the two let-7 binding sites and the 27 nt-long intervening sequence, grafted into the heterologous, unregulated unc-54 3′UTR. Worms also contain a red mCherry reporter for normalization. Transcription of the single-copy integrated reporters from the ubiquitously active dpy-30 promoter is constitutive. miRNA site legend: magenta = seed/seed match; cyan = let-7 seed-distal binding. (B) Representative confocal images of the vulvae of animals carrying the red mCherry reporter (for normalization) and GFP reporters with the indicated 3′UTRs. These are ‘lin-41 3′UTR full-length’, ‘unc-54’ (CTRL, unregulated) and ‘unc-54 + let-7 sites’ in wild-type and ‘unc-54 + let-7 sites’ in the let-7ts background. Images are merged GFP, mCherry and DIC channels. Red color indicates a greater, and green color a lesser degree of reporter repression. Dashed lines outline the vulvae of the animals, which confirm appropriate late Larval stage 4 (L4). Scale bars 15 μm. (C, D) Quantification of (C) ‘unc-54 + let-7 sites’ reporter, (D) ‘unc-54 + let-7 sites_perfect seed match’ reporter. Each dot represents the average of the GFP signal intensity, obtained by confocal imaging, divided by the mCherry intensity for a single animal per condition. 10–12 vulva cells were quantified per worm. Mean values are normalized to the average value of the GFP/mCherry ratio of the negative control unc-54 3′UTR reporter, which is not silenced. Horizontal line and error bars indicate mean values per condition ± SD. *P < 0.05 and ***P < 0.001, two-tailed unpaired t-test. For reference, data obtained for the unc-54, Neg.Control reporter are replotted in panel D; gray shading is bounded by the min-max values of this control. To monitor let-7 activity, we generated the reporter ‘unc-54 + let-7 sites’ in which only a stretch of 111 nucleotides of the lin-41 3′UTR, comprising LCS1 and LCS2, was transplanted into the unc-54 3′UTR (Figure 2A). Silencing of this minimal target reporter by let-7 was comparable to that of a reporter containing the full-lenght lin-41 3′UTR (Figure 2B, C and Supplementary Figure S1A), confirming functionality. We focused our analysis on the vulva because lin-41 repression by let-7 in this organ is required and likely sufficient to prevent vulval rupturing (16). As expected, the ‘unc-54 + let-7 sites’ reporter was expressed in young L1 or L2 animals (Supplementary Figure S1B), when the let-7 family levels are low (30). Moreover, it was robustly silenced in older, L4-stage larvae, when let-7 family levels are high (Figure 2B and C). Finally, it was de-silenced in let-7ts animals, but not in animals lacking the three let-7 sisters ([mir-48/mir-241(ndf51)V, mir-84(n4037)X], henceforth mir-48/241/84(–)) (Figure 2C). Therefore, the stretch of 111 nucleotides suffices for efficient and specific let-7-dependent silencing. Next, we generated an ‘unc-54 + let-7 sites_perfect seed match’ reporter, modified to contain LCSs with perfect seed matches, as in the endogenous lin-41(xe83[perfect]) mutation (Figure 1E). Like the ‘unc-54 + let-7 sites’ reporter, the new reporter was expressed in young L1 or L2 animals (Supplementary Figure S1B), but robustly silenced in L4-stage larvae (Figure 2D). However, unlike the ‘unc-54 + let-7 sites’ reporter, the new reporter was only marginally de-repressed in L4-stage larvae lacking let-7 (let-7ts) or the three let-7 sisters (mir-48/241/84(–)) (Figure 2D). A seed-distal match establishes specificity to one miRNA in the presence of an imperfect seed match Taken together, the genetic interaction and the reporter assay data presented thus far validate the hypothesis that the seed mismatches in the let-7 complementarity sites of lin-41 are necessary for specific regulation of lin-41 by let-7, to the apparent exclusion of the other family members. However, this conclusion appears at odds with the results of biochemical miRNA–mRNA duplex identification, which indicate preferential target binding by individual family members even in the presence of perfect seed matches (9,12). Thus, to challenge our finding, we sought to reprogram the LCSs to another let-7 family member, miR-48, and test the effect of seed match imperfections. We chose miR-48 because its expression levels and spatial expression patterns appear very similar to those of let-7 (13,14,31). Because structural data suggest that base pairing between nucleotides 13–16 of the miRNA and a target may be favored (8), we started out by generating a reporter with seed-distal base pairing to only these nucleotides. However, this reporter failed to be silenced even in wild-type conditions, i.e. with both let-7 and miR-48 present (Figure 3A and Supplementary Figure S2A). Hence, it appears that more extensive seed distal complementarity is required for functionality of targets with a sub-optimal seed match. Indeed, an ‘unc-54 + miR-48 sites’ reporter that emulated the LCS architecture by carrying a central bulge in the seed sequence and an extensive seed distal match to miR-48 (Supplementary Figure S2B), was silenced in L4 stage animals. Moreover, and in agreement with our predictions, the ‘unc-54 + miR-48 sites’ reporter was repressed at the L4 stage in both the presence and absence of let-7 miRNA, but became de-repressed when miR-48 was absent (Figure 3B). Figure 3. View large Download slide Imperfect seed matches and extensive 3′ pairing confer target specificity. (A–C) Reporter quantification as in Figure 2, from which the negative control data (black dots) are also replotted for reference; gray shading is bounded by the min-max values of the negative control. (A) The ‘unc-54 + let-7 sites 13–16miR-48-ized’ reporter contains let-7 complementary sites modified to pair miR-48 at position 13–16 but not other seed-distal nucleotides (gray dots). Results from the unmodified ‘unc-54 + let-7 sites’ reporter in wild-type and let-7ts mutant background are from Figure 2C and included for reference (cyan dots). (B) The ‘unc-54 + miR-48 sites’ reporter combines extensive seed-distal complementarity to miR-48 with seed match imperfections whereas (C) the ‘unc-54 + miR-48 sites_perfect seed match’ reporter contains extensive seed-distal complementarity to miR-48 and perfect seed matches. Horizontal line and error bars indicate mean values per condition ± SD. *P < 0.05 and ***P < 0.001, two-tailed unpaired t-test. (D) Animals carrying the lin-41(ap427[dot-1.1_G: U] (9)) allele die in the absence of miR-48. (E) Survival of strain lin-41(xe76[dot-1.1_W-C] upon manipulation of let-7 and miR-48 activity. In this strain, a U at position 8 in the two target sites of the lin-41(ap427[dot-1.1_G:U]) allele has been converted to a C, to permit Watson-Crick instead of G:U wobble base-pairing with the let-7 family seed sequence (Supplementary Figure S2C and D). This allele was crossed into a (i) let-7ts, (ii) mir-48(–) or (iii) mir-48(–) let-7(++) background, where let-7(++) denotes let-7 overexpression from a single copy integrated transgene. Insets magnify the central part of the animal body to reveal egg retention (arrow), i.e. and egg-laying defective (Egl) phenotype. let-7ts: let-7(n2853) X, temperature-sensitive lesion, grown at the restrictive temperature 25°C; mir-48(–): mir-48(n4097) V; mir-48/241/84(–): mir-48/mir-241(ndf51) V, mir-84(n4037) X. Figure 3. View large Download slide Imperfect seed matches and extensive 3′ pairing confer target specificity. (A–C) Reporter quantification as in Figure 2, from which the negative control data (black dots) are also replotted for reference; gray shading is bounded by the min-max values of the negative control. (A) The ‘unc-54 + let-7 sites 13–16miR-48-ized’ reporter contains let-7 complementary sites modified to pair miR-48 at position 13–16 but not other seed-distal nucleotides (gray dots). Results from the unmodified ‘unc-54 + let-7 sites’ reporter in wild-type and let-7ts mutant background are from Figure 2C and included for reference (cyan dots). (B) The ‘unc-54 + miR-48 sites’ reporter combines extensive seed-distal complementarity to miR-48 with seed match imperfections whereas (C) the ‘unc-54 + miR-48 sites_perfect seed match’ reporter contains extensive seed-distal complementarity to miR-48 and perfect seed matches. Horizontal line and error bars indicate mean values per condition ± SD. *P < 0.05 and ***P < 0.001, two-tailed unpaired t-test. (D) Animals carrying the lin-41(ap427[dot-1.1_G: U] (9)) allele die in the absence of miR-48. (E) Survival of strain lin-41(xe76[dot-1.1_W-C] upon manipulation of let-7 and miR-48 activity. In this strain, a U at position 8 in the two target sites of the lin-41(ap427[dot-1.1_G:U]) allele has been converted to a C, to permit Watson-Crick instead of G:U wobble base-pairing with the let-7 family seed sequence (Supplementary Figure S2C and D). This allele was crossed into a (i) let-7ts, (ii) mir-48(–) or (iii) mir-48(–) let-7(++) background, where let-7(++) denotes let-7 overexpression from a single copy integrated transgene. Insets magnify the central part of the animal body to reveal egg retention (arrow), i.e. and egg-laying defective (Egl) phenotype. let-7ts: let-7(n2853) X, temperature-sensitive lesion, grown at the restrictive temperature 25°C; mir-48(–): mir-48(n4097) V; mir-48/241/84(–): mir-48/mir-241(ndf51) V, mir-84(n4037) X. Consistent with our results for the let-7 reporters, the specificity of the ‘unc-54 + miR-48 sites’ reporter was largely lost when we modified it to contain perfect seed matches: the resulting ‘unc-54 + miR-48 sites_perfect seed match’ reporter continued to be silenced extensively in both let7ts and mir-48/241/84(–) animals (Figure 3C). However, silencing appeared marginally impaired in the absence of the let-7 sisters (Figure 3C), mirroring an analogous result for the ‘unc-54 + let-7sites_perfect seed match’ reporter in let-7ts animals (Figure 2D). We conclude that the imperfect seed match and the extensive 3′ pairing are both important determinants for the robust target specificity of the lin-41 sites. A G:U wobble base-pair in a peripheral seed match location promotes miRNA specificity The duplexes formed between let-7 and lin-41 contain a bulge between nucleotides 4–5 in LCS1 and a G:U wobble base-pair at position 6 in LCS2 (Figure 1C). We wondered if such centrally located ‘imperfections’ were required for specificity. We turned to the miRNA binding site in the dot-1.1 3′UTR, which had been shown to be specific to miR-48 (9). Broughton et al. found that substitution of the let-7 complementary sites in the endogenous lin-41 3′UTR by two copies of the dot-1.1 site rendered animals insensitive to loss of let-7 (9), but made them depend on the presence of miR-48. This finding was attributed to the fact that the site features an extensive seed-distal match to miR-48 (Figure 3D and Supplementary Figure S2C). However, we noticed that the let-7 family/dot-1.1 predicted duplexes exhibited not only perfect Watson–Crick pairing from nucleotides 2–7, but also a G:U wobble pair at position 8 (Supplementary Figure S2C). Although hexameric seed match sites, with complementarity to nucleotides 2–7, are considered canonical and functional (2), genome-wide studies also suggested that they are less functional than heptameric sites that match nucleotides 2–8 (6,32,33). Since G:U wobble base pairs elsewhere in seed-seed match duplexes appear detrimental to silencing (3,4,34–36), we wondered if this ‘peripheral G:U’ in seed match position 8 might affect silencing and specificity. To test this hypothesis, we modified the endogenous target sites in lin-41 to those of dot-1.1, but with the G:U wobbles at positions 8 converted to Watson-Crick G:C pairs, yielding allele lin-41(xe76[dot-1.1_W-C]) (Supplementary Figure S2D). We then compared the reliance of this and the lin-41(ap427[dot-1.1_G:U]) strain, which carried the unmodified G:U-wobble-containing dot-1.1 sites, on let-7 and miR-48 for survival. Whereas both strains were insensitive to loss of let-7 (Figure 3E(i) and (9)), lin-41(ap427[dot-1.1_G:U]) but not lin-41(xe76[dot-1.1_W-C]) required miR-48 for survival into adulthood (Figure 3D and E(ii)). We conclude that the G:U wobble at position eight repels binding by all let-7 family members such that only miR-48 can exert repression by compensating through extensive complementarity of its 3′ seed-distal sequence. Collectively, our data thus reveal that bulges or wobbles in different positions of a seed match can serve to avoid redundancy of the let-7 family and confer strong target specificity. miRNA abundance affects silencing in vivo Although our experiments provided strong evidence that seed mismatches are required for robust specificity among let-7 family members, we consistently observed evidence of residual specificity even for targets that contained a perfect seed match. In target reporters containing perfect seed matches, we observed modest but reproducible de-silencing specifically when the family member with seed-distal match was lost (Figures 2D and 3C), and phenotype (Figure 3E(ii)). In fact, although lin-41(xe76[dot-1.1_W-C]); mir-48(–) animals survived into adulthood, they exhibited an egg-laying (Egl) defect (Figure 3E (ii), 93%, n = 132), i.e. a partial vulval dysfunction that is consistent with incomplete repression of lin-41 (16). We wondered if this partial specificity could be overridden by increased levels of another miRNA family member. Since we were unable to overexpress mir-48, we tested this possibility by overexpressing let-7. Mos1-mediated single copy integration (25) of a genomic fragment, known to rescue let-7 lethality (15), to a locus on chromosome V that is ∼5 cM apart from mir-48, yielded a ∼2-fold increase in expression levels (data not shown). Consistent with our hypothesis, lin-41(xe76[dot-1.1_W-C]) animals that over-expressed let-7 were no longer Egl in the absence of miR-48 (Figure 3E(iii), compare to E(ii)). We conclude that, in vivo, increased miRNA levels can override the specificity imparted by seed-distal pairing. Seed match imperfections maintain specificity upon miRNA overexpression Since the modest preferential silencing imposed by the seed-distal pairing to miR-48 could be overcome by increasing the levels of let-7 in the presence of a perfect seed match (Figure 3E (ii) and (iii)), we wondered about the effect of let-7 over-expression on sites with more extensive target specificity. Hence, we examined two reporters specific to miR-48 that harbored imperfect seed matches: the previous ‘unc-54 + miR-48 sites’ (Figures 3B and 4A) and the new ‘unc-54 + dot-1.1 sites’ reporter, obtained by inserting two copies of the binding sites from the dot-1.1 3′UTR (Figure 4B). Consistent with the in vivo data ((9) and Figure 3E), silencing of both reporters was dependent on miR-48 but not let-7 (Figures 3B, 4A, B and Supplementary Figure S2E). However, the response of the two reporters differed when we overexpressed let-7 in the absence of miR-48. The ‘unc-54 + miR-48 sites’ reporter, with central seed mismatches, was insensitive to a doubling of let-7 expression (Figure 4A). By contrast, silencing of the ‘unc-54 + dot-1.1 sites’ reporter, with peripheral seed mismatches, was restored to almost wild-type level in the same conditions (Figure 4B). This suggests that for miR-48 targets with extensive seed-distal pairing, sensitivity to let-7 levels depends on seed match quality. Figure 4. View largeDownload slide Robust miRNA specificity relies on imperfect seed matches. (A, B) Reporter quantification as in Figure 2, from which the negative control data are also replotted for reference. (A) ‘unc-54 + miR-48 sites’ reporter and (B) ‘unc-54 + dot-1.1 sites’ reporter are assayed in worms of the indicated genotypes. Horizontal line and error bars indicate mean values per condition ± SD, *P < 0.05 and ***P < 0.001, two-tailed unpaired t-test. (C) Representative image of a viable lin-41(ap427[dot-1.1_G:U]), mir-48(-) let-7(++) animal. (D) Progeny (n = 99) derived from a cross of lin-41(xe99[48-ized]) with mir-48(–) let-7(++) animals were categorized by phenotype and genotyped to determine the viability of lin-41(xe99[48-ized]); mir-48(-) let-7(++) ‘triple homozygous’ mutant animals. (E) Summary of the effect that different site architectures and miRNA abundance have on silencing lin-41 alleles ‘recoded’ towards miR-48. mir-48(–): mir-48(n4097)V; unc-54(CTRL): wild-type unc-54 3′UTR; let-7(++): let-7 over-expression allele (MosSCI, V). Figure 4. View largeDownload slide Robust miRNA specificity relies on imperfect seed matches. (A, B) Reporter quantification as in Figure 2, from which the negative control data are also replotted for reference. (A) ‘unc-54 + miR-48 sites’ reporter and (B) ‘unc-54 + dot-1.1 sites’ reporter are assayed in worms of the indicated genotypes. Horizontal line and error bars indicate mean values per condition ± SD, *P < 0.05 and ***P < 0.001, two-tailed unpaired t-test. (C) Representative image of a viable lin-41(ap427[dot-1.1_G:U]), mir-48(-) let-7(++) animal. (D) Progeny (n = 99) derived from a cross of lin-41(xe99[48-ized]) with mir-48(–) let-7(++) animals were categorized by phenotype and genotyped to determine the viability of lin-41(xe99[48-ized]); mir-48(-) let-7(++) ‘triple homozygous’ mutant animals. (E) Summary of the effect that different site architectures and miRNA abundance have on silencing lin-41 alleles ‘recoded’ towards miR-48. mir-48(–): mir-48(n4097)V; unc-54(CTRL): wild-type unc-54 3′UTR; let-7(++): let-7 over-expression allele (MosSCI, V). To confirm this result on a functional level, we tested whether let-7 overexpression could suppress the dependence on miR-48 of animals carrying lin-41 alleles analogous to those in the miR-48-specific reporters, namely the lin-41(ap427[dot-1.1_G:U]) allele and the newly generated lin-41(xe99[48-ized]) allele (Figure 4C and D, respectively). As predicted by the reporter assay, overexpression of let-7 rendered lin-41(ap427[dot-1.1_G:U]); mir-48(–) double mutant animals viable, although Egl (Figure 4C). By contrast, we were unable to obtain viable animals of the lin-41(xe99[48-ized])I; mir-48(–) let-7 (++)V genotype (Figure 4D). Instead, we readily observed dead animals, which had burst through the vulva. Genotyping revealed that such animals were homozygous for the three alleles of interest, lin-41(xe99[48-ized]), mir-48(–), and let-7(++) (Figure 4D). [Note that mir-48(–) and let-7(++) are closely linked loci on chromosome V, explaining why we did not find dead animals that were lin-41(xe99[48-ized]); mir-48(–) double mutant but lacked the let-7 over-expression transgene.] In contrast, randomly selected wild-type animals were never doubly homozygous for lin-41(xe99[48-ized]) and mir-48(–), irrespective of let-7 transgene status, and only one Egl animal was found to be lin-41(xe99[48-ized]); mir-48(–) let-7(++) mutant. Hence, although an increase in let-7 levels can overcome the specificity to miR-48 imposed by seed-distal matches in combination with a perfect seed (Figure 3E) or in the presence of peripheral seed mismatches (Figure 4C), it cannot do so with a central seed bulge or wobble (Figure 4D), at least within the physiological ranges of the expression levels that we tested. We conclude that specificity arises through seed-distal pairing of a miRNA, but that it is enhanced in extent and robustness by appropriate seed match architecture (Figure 4E). Loss of miRNA specificity impairs robust development Our results suggest that sites with central seed match imperfections, such as LCS1 and LCS2 in the lin-41 3′UTR, are extremely specific to one miRNA, even when a paralogue is highly expressed. We suspected that such robust specificity would be physiologically relevant in the case of lin-41. This is because the let-7 sisters are all expressed prior to let-7, in the L2 stage (30). Given their overlapping spatial expression patterns, lack of mechanisms to prevent let-7 sisters’ action on lin-41 might cause inappropriately early repression of lin-41, as speculated previously (2,3). Consistent with this notion, we found that the ‘unc-54 + let-7 sites_perfect seed match’ reporter was precociously repressed during the L3 stage, whereas the ‘unc-54 + let-7 sites’ reporter was still expressed at the same stage (Figure 5A). Figure 5. View largeDownload slide Developmental robustness requires an imperfect let-7 seed match in lin-41. (A) Representative confocal images of skin cells of animals carrying an unc-54 3′UTR reporter (top), an ‘unc-54 + let-7 sites’ (center), or an ‘unc-54 + let-7 sites_perfect seed match’ reporter (bottom). At the L3 stage, levels of miR-48 but not let-7 are already high (39). Scale bars 15 μm. (B) Microscopy images of the skin of late L3 worms expressing endogenously tagged LIN-29::GFP (xe61) (28) in wild-type and lin-41(xe83[perfect]) background. Cyan arrowheads point to LIN-29 signal in seam cells, magenta arrows to LIN-29 accumulation in hyp7 cells. Images in the middle are inverted to increase clarity. Worms are staged according to the position of the distal tip cell (green) and gonad length. Scale bars 15μm. (C) Representative images of wild-type (n = 27) or lin-41(xe83[perfect]) (n = 36) animals treated with hbl-1 RNAi. Percentages of animals with the indicated alae status at the L3/L4 transition are indicated. Gonads are outlined to confirm appropriate staging. The strains used, SX346 and HW2144, additionally contain the mjIs15 and wIs51 transgenes. Scale bars 15μm. Figure 5. View largeDownload slide Developmental robustness requires an imperfect let-7 seed match in lin-41. (A) Representative confocal images of skin cells of animals carrying an unc-54 3′UTR reporter (top), an ‘unc-54 + let-7 sites’ (center), or an ‘unc-54 + let-7 sites_perfect seed match’ reporter (bottom). At the L3 stage, levels of miR-48 but not let-7 are already high (39). Scale bars 15 μm. (B) Microscopy images of the skin of late L3 worms expressing endogenously tagged LIN-29::GFP (xe61) (28) in wild-type and lin-41(xe83[perfect]) background. Cyan arrowheads point to LIN-29 signal in seam cells, magenta arrows to LIN-29 accumulation in hyp7 cells. Images in the middle are inverted to increase clarity. Worms are staged according to the position of the distal tip cell (green) and gonad length. Scale bars 15μm. (C) Representative images of wild-type (n = 27) or lin-41(xe83[perfect]) (n = 36) animals treated with hbl-1 RNAi. Percentages of animals with the indicated alae status at the L3/L4 transition are indicated. Gonads are outlined to confirm appropriate staging. The strains used, SX346 and HW2144, additionally contain the mjIs15 and wIs51 transgenes. Scale bars 15μm. To test whether this precocious repression of lin-41 had physiological consequences, we examined the accumulation of LIN-29A, a target of LIN-41. In wild-type animals, LIN-41 translationally represses LIN-29A until the L4 stage, when repression is released following let-7 accumulation and consequent LIN-41 downregulation (28). Premature loss of LIN-41 activity causes inappropriately early activation of LIN-29A and thereby precocious execution of the so-called larval-to-adult transition, which includes fusion of hypodermal seam cells into a syncytium and secretion of an adult cuticular structure termed alae (17). We observed LIN-29A levels through use of a lin-29(xe61[lin-29::gfp::3xflag]) strain, in which the endogenous lin-29 locus has been edited to produce GFP-tagged LIN-29A and B isoforms, and in which loss of lin-41 activity yields a specific upregulation of only LIN-29A (28). At mid-L3 larval stage, wild-type animals have LIN-29::GFP signal only in their seam cells (Figure 5B). By contrast, animals carrying the lin-41(xe83[perfect]) allele show additional GFP expression in the major hypodermal syncytium, hyp7, at the same developmental stage (Figure 5B). Therefore, precocious downregulation of lin-41(xe83[perfect]) is responsible for premature LIN-29 translation and accumulation in the hypodermis, as described for other lin-41 loss-of-function alleles (17). The lin-41(xe83[perfect]) animals looked superficially wild-type, but the premature upregulation of LIN-29 was sufficient to promote precocious larval-to-adult transition in a sensitized background. Specifically, the transcription factor HBL-1 inhibits larval-to-adult transition, possibly in parallel to LIN-41 (37,38), and its RNAi-mediated depletion causes partially penetrant and partially expressive precocious alae formation (Figure 5C). This phenotype was enhanced when we depleted HBL-1 in lin-41(xe83[perfect]) mutant animals, resulting in fully penetrant precocious secretion of alae (although weak or patched in some cases) (Figure 5C). We conclude that loss of specificity of repression by let-7 alone in the lin-41(xe83[perfect]) background impairs the robustness of temporal patterning through premature LIN-29 accumulation. DISCUSSION It has been an open question to what extent and by which mechanisms miRNA family members can function non-redundantly despite a shared seed sequence. Previously, it was proposed that redundancy was the rule (2). Rare occasions of non-redundant function were hypothesized to require targets with both an imperfect seed match and extensive seed-distal pairing to only one specific family member (3). According to this view, the seed match imperfection impairs silencing by all family members but extensive seed-distal pairing can compensate to facilitate silencing by an individual miRNA. However, this hypothesis has remained untested, and recent observations have challenged it by providing evidence that non-redundant target binding appears wide-spread and that seed-distal pairing may suffice to achieve specificity (9,12). Our systematic study through gene editing and fluorescent reporter analysis with cell-type resolution resolves the discrepant views on specificity-promoting features for the let-7 family: We demonstrate that extensive seed-distal pairing to a specific family member suffices to generate a weak but consistent preference for silencing by this family member. However, more robust discrimination requires an imperfect seed match and depends on the quality of such imperfections: a central bulge or G:U wobble base pair, as in the lin-41 3′UTR, confers the strongest specificity, while a peripheral G:U wobble base pair, as in the dot-1.1 3′UTR, gives an intermediate level. The physiological importance of extensive, seed-mismatch-dependent specificity is evident from the decreased developmental robustness that results when perfect let-7 seed matches permit promiscuous silencing of lin-41 by the whole let-7 family. Perfect seed matches can still be compatible with selective targeting by individual miRNAs, but the effect depends on miRNA abundance: A moderate increase in let-7 levels (∼2-fold) could overcome the specificity of a binding site that was silenced by miR-48 and had a perfect seed match. However, it only partially did so when the seed match contained a peripheral G:U wobble, and it was insufficient to override sequence-determined specificity when a site contained a central seed match imperfection. This suggests that in vivo, miRNA binding site architecture, particularly seed match quality, and miRNAs abundance act together to determine miRNA activity towards individual targets (Figure 6). Figure 6. View largeDownload slide miRNA abundance and architecture of the target site determine mRNA silencing. (top) Extensive complementarity (paired 3′) between a miRNA and a target site allows for efficient and specific silencing, independently of the miRNA level and the presence of imperfections in the seed match. (middle) Abundant miRNAs can silence targets carrying a perfect seed match or a nearly-perfect seed match (e.g. a peripheral G:U wobble), even in the absence of complementarity to the sequence outside the seed. A ‘central mismatch’ repels poorly complementary miRNAs. (bottom) Lowly abundant and poorly complementary miRNAs can silence targets carrying a perfect seed match, but not the ones carrying a seed match imperfection (e.g. peripheral G:U or central bulge). Green shading: functional site; pink shading: nonfunctional site. Magenta: seed/seed match Figure 6. View largeDownload slide miRNA abundance and architecture of the target site determine mRNA silencing. (top) Extensive complementarity (paired 3′) between a miRNA and a target site allows for efficient and specific silencing, independently of the miRNA level and the presence of imperfections in the seed match. (middle) Abundant miRNAs can silence targets carrying a perfect seed match or a nearly-perfect seed match (e.g. a peripheral G:U wobble), even in the absence of complementarity to the sequence outside the seed. A ‘central mismatch’ repels poorly complementary miRNAs. (bottom) Lowly abundant and poorly complementary miRNAs can silence targets carrying a perfect seed match, but not the ones carrying a seed match imperfection (e.g. peripheral G:U or central bulge). Green shading: functional site; pink shading: nonfunctional site. Magenta: seed/seed match The finding that miRNA activity is determined at the level of individual targets has implications beyond the issue of miRNA family member specificity. It contrasts with a view where a miRNA is globally either ‘on’ or ‘off’ in a cell, silencing all of its targets at sufficiently high concentrations and none at low ones. Variable, target site-dependent activity was already entertained in the early days of the miRNA field when miRNAs were likened to rheostats, whose activity is adjusted by two features, namely the extent of target site complementarity to the miRNA and miRNA abundance (19). A lack of explicit experimental testing of such context-dependent function (4) and the rising popularity of the ‘seed-match only’ model caused this hypothesis to fade from view. We propose that it is time to revisit the idea of miRNAs functioning as rheostats and subject it to further testing. We note that target validation experiments that rely, as often done, on ectopic miRNA expression appear to make the implicit assumption that miRNAs are uniformly active on their targets. However, if the goal of target validation is to provide insights into pathway biology, physiology and/or pathology, our results and those of others (34) strongly suggest that it must be conducted in a relevant physiological context, avoiding ectopic expression or overexpression of miRNAs. Ideally, validation will also involve functional studies such as those offered by direct manipulation of individual miRNA/target interaction through genome editing. We predict that such efforts will reveal a more nuanced picture of dynamic, context-dependent miRNA target repertoires, and thereby improve our understanding of the diversity of biological outcomes that miRNA-mediated gene regulation can achieve in vivo. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank Kathrin Kunzer and Lan Xu for their help with C. elegans strain generation. We are grateful to Matyas Ecsedi for initial observations on target specificity of let-7 family members. We thank Florian Aeschimann for reagents and helpful discussions and Iskra Katic for worm injections and reagents. We thank Laurent Gelman and Steven Bourke for help with confocal imaging; Roland Nitschke (Life Imaging Center, University of Freiburg, Germany) and Carl Zeiss (Jena, Germany) for sharing the macro for Multiple Position/Tile Imaging acquisitions; Raphael Thierry, Jan Eglinger and Moritz Kirschmann (University of Zurich) for help with image analysis; and Amy Pasquinelli for C. elegans strains. Some strains were provided by the Caenorhabditis Genetics Center (CGC), which is funded by the National Institutes of Health Office of Research Infrastructure Programs (P40 OD010440). We thank Matyas Ecsedi, Sarah Carl, Benjamin Towbin, Iskra Katic and Witold Filipowicz for a critical reading of the manuscript. FUNDING NCCR RNA & Disease funded by the Swiss National Science Foundation; Novartis Research Foundation through the FMI (to H.G.); Boehringer Ingelheim Fonds PhD Fellowship (to G.B.). Funding for open access charge: Internal Funds. Conflict of interest statement. None declared. REFERENCES 1. Krol J., Loedige I., Filipowicz W. The widespread regulation of microRNA biogenesis, function and decay. Nat. Rev. Genet. 2010; 11: 597– 610. Google Scholar CrossRef Search ADS PubMed 2. Bartel D.P. MicroRNAs: target recognition and regulatory functions. Cell . 2009; 136: 215– 233. Google Scholar CrossRef Search ADS PubMed 3. Brennecke J., Stark A., Russell R.B., Cohen S.M. Principles of microRNA-target recognition. PLoS Biol. 2005; 3: e85. Google Scholar CrossRef Search ADS PubMed 4. Doench J.G., Sharp P.A. Specificity of microRNA target selection in translational repression. Genes Dev. 2004; 18: 504– 511. Google Scholar CrossRef Search ADS PubMed 5. Lai E.C. Micro RNAs are complementary to 3′ UTR sequence motifs that mediate negative post-transcriptional regulation. Nat. Genet. 2002; 30: 363– 364. Google Scholar CrossRef Search ADS PubMed 6. Chandradoss S.D., Schirle N.T., Szczepaniak M., MacRae I.J., Joo C. A Dynamic Search Process Underlies MicroRNA Targeting. Cell . 2015; 162: 96– 107. Google Scholar CrossRef Search ADS PubMed 7. Parker J.S., Parizotto E.A., Wang M., Roe S.M., Barford D. Enhancement of the seed-target recognition step in RNA silencing by a PIWI/MID domain protein. Mol. Cell . 2009; 33: 204– 214. Google Scholar CrossRef Search ADS PubMed 8. Schirle N.T., Sheu-Gruttadauria J., MacRae I.J. Structural basis for microRNA targeting. Science . 2014; 346: 608– 613. Google Scholar CrossRef Search ADS PubMed 9. Broughton J.P., Lovci M.T., Huang J.L., Yeo G.W., Pasquinelli A.E. Pairing beyond the seed supports microRNA targeting specificity. Mol. Cell . 2016; 64: 320– 333. Google Scholar CrossRef Search ADS PubMed 10. Grosswendt S., Filipchyk A., Manzano M., Klironomos F., Schilling M., Herzog M., Gottwein E., Rajewsky N. Unambiguous identification of miRNA:target site interactions by different types of ligation reactions. Mol. Cell . 2014; 54: 1042– 1054. Google Scholar CrossRef Search ADS PubMed 11. Helwak A., Kudla G., Dudnakova T., Tollervey D. Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell . 2013; 153: 654– 665. Google Scholar CrossRef Search ADS PubMed 12. Moore M.J., Scheel T.K., Luna J.M., Park C.Y., Fak J.J., Nishiuchi E., Rice C.M., Darnell R.B. miRNA-target chimeras reveal miRNA 3′-end pairing as a major determinant of Argonaute target specificity. Nat. Commun. 2015; 6: 8864. Google Scholar CrossRef Search ADS PubMed 13. Roush S., Slack F.J. The let-7 family of microRNAs. Trends Cell Biol. 2008; 18: 505– 516. Google Scholar CrossRef Search ADS PubMed 14. Abbott A.L., Alvarez-Saavedra E., Miska E.A., Lau N.C., Bartel D.P., Horvitz H.R., Ambros V. The let-7 MicroRNA family members mir-48, mir-84, and mir-241 function together to regulate developmental timing in Caenorhabditis elegans. Dev. Cell . 2005; 9: 403– 414. Google Scholar CrossRef Search ADS PubMed 15. Reinhart B.J., Slack F.J., Basson M., Pasquinelli A.E., Bettinger J.C., Rougvie A.E., Horvitz H.R., Ruvkun G. The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature . 2000; 403: 901– 906. Google Scholar CrossRef Search ADS PubMed 16. Ecsedi M., Rausch M., Großhans H. The let-7 microRNA directs vulval development through a single target. Dev. Cell . 2015; 32: 335– 344. Google Scholar CrossRef Search ADS PubMed 17. Slack F.J., Basson M., Liu Z., Ambros V., Horvitz H.R., Ruvkun G. The lin-41 RBCC gene acts in the C. elegans heterochronic pathway between the let-7 regulatory RNA and the LIN-29 transcription factor. Mol. Cell . 2000; 5: 659– 669. Google Scholar CrossRef Search ADS PubMed 18. Vella M.C., Choi E.Y., Lin S.Y., Reinert K., Slack F.J. The C. elegans microRNA let-7 binds to imperfect let-7 complementary sites from the lin-41 3′UTR. Genes Dev. 2004; 18: 132– 137. Google Scholar CrossRef Search ADS PubMed 19. Bartel D.P., Chen C.Z. Micromanagers of gene expression: the potentially widespread influence of metazoan microRNAs. Nat. Rev. Genet. 2004; 5: 396– 400. Google Scholar CrossRef Search ADS PubMed 20. Frokjaer-Jensen C., Davis M.W., Ailion M., Jorgensen E.M. Improved Mos1-mediated transgenesis in C. elegans. Nat. Methods . 2012; 9: 117– 118. Google Scholar CrossRef Search ADS PubMed 21. Frokjaer-Jensen C., Davis M.W., Hopkins C.E., Newman B.J., Thummel J.M., Olesen S.P., Grunnet M., Jorgensen E.M. Single-copy insertion of transgenes in Caenorhabditis elegans. Nat. Genet. 2008; 40: 1375– 1383. Google Scholar CrossRef Search ADS PubMed 22. Gibson D.G., Young L., Chuang R.Y., Venter J.C., Hutchison C.A. 3rd, Smith H.O. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods . 2009; 6: 343– 345. Google Scholar CrossRef Search ADS PubMed 23. Zheng L., Baumann U., Reymond J.L. An efficient one-step site-directed and site-saturation mutagenesis protocol. Nucleic Acids Res. 2004; 32: e115. Google Scholar CrossRef Search ADS PubMed 24. Katic I., Xu L., Ciosk R. CRISPR/Cas9 genome editing in Caenorhabditis elegans: evaluation of templates for homology-mediated repair and knock-ins by homology-independent DNA repair. G3 (Bethesda) . 2015; 5: 1649– 1656. Google Scholar CrossRef Search ADS PubMed 25. Frokjaer-Jensen C., Davis M.W., Sarov M., Taylor J., Flibotte S., LaBella M., Pozniakovsky A., Moerman D.G., Jorgensen E.M. Random and targeted transgene insertion in Caenorhabditis elegans using a modified Mos1 transposon. Nat. Methods . 2014; 11: 529– 534. Google Scholar CrossRef Search ADS PubMed 26. Schindelin J., Arganda-Carreras I., Frise E., Kaynig V., Longair M., Pietzsch T., Preibisch S., Rueden C., Saalfeld S., Schmid B.et al. Fiji: an open-source platform for biological-image analysis. Nat. Methods . 2012; 9: 676– 682. Google Scholar CrossRef Search ADS PubMed 27. Mok D.Z., Sternberg P.W., Inoue T. Morphologically defined sub-stages of C. elegans vulval development in the fourth larval stage. BMC Dev. Biol. 2015; 15: 26. Google Scholar CrossRef Search ADS PubMed 28. Aeschimann F., Kumari P., Bartake H., Gaidatzis D., Xu L., Ciosk R., Großhans H. LIN41 post-transcriptionally silences mRNAs by two distinct and position-dependent mechanisms. Mol. Cell . 2017; 65: 476– 489. Google Scholar CrossRef Search ADS PubMed 29. Vella M.C., Reinert K., Slack F.J. Architecture of a validated microRNA::target interaction. Chem. Biol. 2004; 11: 1619– 1623. Google Scholar CrossRef Search ADS PubMed 30. Vadla B., Kemper K., Alaimo J., Heine C., Moss E.G. lin-28 controls the succession of cell fate choices via two distinct activities. PLoS Genet. 2012; 8: e1002588. Google Scholar CrossRef Search ADS PubMed 31. Martinez N.J., Ow M.C., Reece-Hoyes J.S., Barrasa M.I., Ambros V.R., Walhout A.J. Genome-scale spatiotemporal analysis of Caenorhabditis elegans microRNA promoter activity. Genome Res. 2008; 18: 2005– 2015. Google Scholar CrossRef Search ADS PubMed 32. Baek D., Villen J., Shin C., Camargo F.D., Gygi S.P., Bartel D.P. The impact of microRNAs on protein output. Nature . 2008; 455: 64– 71. Google Scholar CrossRef Search ADS PubMed 33. Grimson A., Farh K.K., Johnston W.K., Garrett-Engele P., Lim L.P., Bartel D.P. MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol. Cell . 2007; 27: 91– 105. Google Scholar CrossRef Search ADS PubMed 34. Didiano D., Hobert O. Perfect seed pairing is not a generally reliable predictor for miRNA-target interactions. Nat. Struct. Mol. Biol. 2006; 13: 849– 851. Google Scholar CrossRef Search ADS PubMed 35. Lai E.C., Tam B., Rubin G.M. Pervasive regulation of Drosophila Notch target genes by GY-box-, Brd-box-, and K-box-class microRNAs. Genes Dev. 2005; 19: 1067– 1080. Google Scholar CrossRef Search ADS PubMed 36. Wolter J.M., Le H.H., Linse A., Godlove V.A., Nguyen T.D., Kotagama K., Lynch A., Rawls A., Mangone M. Evolutionary patterns of metazoan microRNAs reveal targeting principles in the let-7 and miR-10 families. Genome Res. 2017; 27: 53– 63. Google Scholar CrossRef Search ADS PubMed 37. Abrahante J.E., Daul A.L., Li M., Volk M.L., Tennessen J.M., Miller E.A., Rougvie A.E. The Caenorhabditis elegans hunchback-like gene lin-57/hbl-1 controls developmental time and is regulated by microRNAs. Dev. Cell . 2003; 4: 625– 637. Google Scholar CrossRef Search ADS PubMed 38. Lin S.Y., Johnson S.M., Abraham M., Vella M.C., Pasquinelli A., Gamberi C., Gottlieb E., Slack F.J. The C elegans hunchback homolog, hbl-1, controls temporal patterning and is a probable microRNA target. Dev. Cell . 2003; 4: 639– 650. Google Scholar CrossRef Search ADS PubMed 39. Esquela-Kerscher A., Johnson S.M., Bai L., Saito K., Partridge J., Reinert K.L., Slack F.J. Post-embryonic expression of C. elegans microRNAs belonging to the lin-4 and let-7 families in the hypodermis and the reproductive system. Dev. Dyn. 2005; 234: 868– 877. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
G-quadruplexes and G-quadruplex ligands: targets and tools in antiviral therapyRuggiero, Emanuela;Richter, Sara N
doi: 10.1093/nar/gky187pmid: 29554280
Abstract G-quadruplexes (G4s) are non-canonical nucleic acids secondary structures that form within guanine-rich strands of regulatory genomic regions. G4s have been extensively described in the human genome, especially in telomeres and oncogene promoters; in recent years the presence of G4s in viruses has attracted increasing interest. Indeed, G4s have been reported in several viruses, including those involved in recent epidemics, such as the Zika and Ebola viruses. Viral G4s are usually located in regulatory regions of the genome and implicated in the control of key viral processes; in some cases, they have been involved also in viral latency. In this context, G4 ligands have been developed and tested both as tools to study the complexity of G4-mediated mechanisms in the viral life cycle, and as therapeutic agents. In general, G4 ligands showed promising antiviral activity, with G4-mediated mechanisms of action both at the genome and transcript level. This review aims to provide an updated close-up of the literature on G4s in viruses. The current state of the art of G4 ligands in antiviral research is also reported, with particular focus on the structural and physicochemical requirements for optimal biological activity. The achievements and the to-dos in the field are discussed. INTRODUCTION G-quadruplexes (G4s) are nucleic acids secondary structures that can form within DNA (1) or RNA (2) guanine (G)-rich strands, when two or more G-tetrads stack on top of each other and coordinate monovalent cations, such as K+ and Na+. Each tetrad is composed of four G residues that are linked by the sugar–phosphate backbone and connected through Hoogsteen-type hydrogen bonds. G4s are highly polymorphic structures whose topology can be influenced by variations in strand stoichiometry and polarity, as well as by the nature and length of loops and their location in the sequence. G4s can fold intramolecularly from a single G-rich strand, or intermolecularly through dimerization or tetramerization of separate filaments: research of biologically relevant G4s has mainly focused on monomolecular G4s (3,4); however, intermolecular G4s are gaining increasing attention (5–7). Strands orientation defines the parallel, antiparallel or mixed topology of G4s, which is directly correlated to the conformational state, anti or syn, of the glycosidic bond between the G base and the sugar (1). The anti conformation characterizes a parallel folding, while antiparallel G4s are found to adopt both syn and anti orientations (8) (Figure 1). While RNA G4s are mostly locked in a parallel conformation due to the 2′-hydroxyl group in the sugar which exclusively allows the anti orientation (2), DNA G4s are in principle characterized by higher topological diversity, even though the majority of DNA G4s examined so far adopt the parallel topology. Figure 1. View largeDownload slide The G-quadruplex structure. (A) Chemical structure (left) and schematic illustration (right) of a G-tetrad, composed of four guanines linked together through Hoogsteen H-bonds (red dashed lines); M+ represents the monovalent cation coordinated at the center of the tetrad. (B) Different topologies of intramolecular G4 structures. Figure 1. View largeDownload slide The G-quadruplex structure. (A) Chemical structure (left) and schematic illustration (right) of a G-tetrad, composed of four guanines linked together through Hoogsteen H-bonds (red dashed lines); M+ represents the monovalent cation coordinated at the center of the tetrad. (B) Different topologies of intramolecular G4 structures. Computational analysis using different algorithms (9,10) indicated that 300 000 and up to around 3 000 000 potential G4-forming sequences may form in the human genome, correlated with specific gene functions (11). These data have been corroborated by ‘G4-seq’ high-throughput sequencing method, which identified about 700 000 G4s (12). However, mapping of G4s in chromatin by G4 ChIP-sequencing with an anti-G4 antibody (13) or footprinting (14) retrieved only about 10 000 G4s in highly transcribed regulatory nucleosome-depleted chromatin regions. These data indicate that G4s are mostly suppressed in chromatin and that, in turn, they may influence the occupancy and positioning of nucleosomes. In general, G4 sequences are non-randomly distributed but mainly clustered in pivotal genomic regions, namely telomeres, gene promoters and DNA replication origins (15). Moreover, putative G4-forming sequences have been found in coding and non-coding regions of the human transcriptome, i.e. open reading frames and untranslated regions (UTRs), and in the telomeric repeat-containing RNA (2).This evidence suggests that G4s are likely involved in the regulation of different biological pathways such as replication, transcription, translation and genome instability. In the past years, the resolution of G4 structures (16–18) and the employment of novel visualization approaches (19–21) helped researchers to validate the previous computational predictions, disclosing new aspects of the multi-faceted G4s world, e.g. the effective occurrence of G4s within patient-derived cancer tissues (22) or the key role in the pathogenesis of two incurable neurodegenerative diseases, amyotrophic lateral sclerosis and frontotemporal dementia (23). Indeed, the presence of G4s in the human genome and their potential in diseases modulation have been extensively investigated, resulting in many good and exhaustive reviews focused on G4 structures (1,8,24,25) and their biological role, particularly in telomeres (26–29) and oncogene promoters (30–35). G-QUADRUPLEXES IN VIRUSES: PRESENCE AND FUNCTION Besides humans, putative G4-forming sequences have been found in other mammalian genomes (36), yeasts (37), protozoa (38), bacteria (39,40) and viruses, therefore implicating G4s in many human infectious diseases. One review has been published in 2015 on the possible role of G4s in the antigenic variation systems of bacteria and protozoa and silencing of two viruses (41). The possible role of G4s in viruses and the use of G4-forming oligonucleotides as antiviral agents have been discussed in 2014 (42). Since the number of reports describing the presence of G4s in virus genomes has boomed in the past 2 years and treatment with several G4 ligands has shown potentially interesting therapeutic activity, we here aim at presenting, organizing and discussing an up-to-date close-up of the literature on G4s in viruses and the classes of molecules that have shown antiviral activity by viral G4 targeting. In particular, we first focus on the presence and proposed function of G4s in virus genomes. Next, we present the classes of G4 ligands that have reported successful antiviral activity, with special emphasis on the structural and physicochemical properties that characterize the viral G4/G4 ligand interaction. A general simplified virus life cycle is schematically depicted in Figure 2; a summary of the viruses in which G4s have been reported and of the corresponding G4s is shown in Figure 3. Since the use of G4-forming oligonucleotides as antiviral agents has been more recently addressed by Musumeci et al. (43,44), this topic has not been considered in the present review. Figure 2. View largeDownload slide Schematic representation of the viral life cycle. The virus recognizes and binds the host cell surface receptors (step 1) to enter the cell (step 2). After penetration, the viral genome is uncoated (step 3) and its DNA or RNA nature determines where and how the genome is replicated (step 4): most DNA viruses replicate in the cell nucleus, while the majority of RNA viruses replicate in the cytoplasm of infected cells. After viral mRNA production, viral proteins are expressed in the cytoplasm (steps 5–6). The newly synthesized viral genomes and proteins are then assembled into new virions (step 7), which are released outside the cell (step 8). Figure 2. View largeDownload slide Schematic representation of the viral life cycle. The virus recognizes and binds the host cell surface receptors (step 1) to enter the cell (step 2). After penetration, the viral genome is uncoated (step 3) and its DNA or RNA nature determines where and how the genome is replicated (step 4): most DNA viruses replicate in the cell nucleus, while the majority of RNA viruses replicate in the cytoplasm of infected cells. After viral mRNA production, viral proteins are expressed in the cytoplasm (steps 5–6). The newly synthesized viral genomes and proteins are then assembled into new virions (step 7), which are released outside the cell (step 8). Figure 3. View largeDownload slide Summary of G4s reported in viral genomes. For each virus the following information is shown: virion structure and dimension, genome size and organization; schematic representation of the G4 (red dots) location in the viral genomes or in the mRNA and G4 binding proteins; number of G4s assessed through bioinformatics analysis, according to the corresponding references; G4 ligands reported to date to display antiviral effect and corresponding references. Figure 3. View largeDownload slide Summary of G4s reported in viral genomes. For each virus the following information is shown: virion structure and dimension, genome size and organization; schematic representation of the G4 (red dots) location in the viral genomes or in the mRNA and G4 binding proteins; number of G4s assessed through bioinformatics analysis, according to the corresponding references; G4 ligands reported to date to display antiviral effect and corresponding references. Human immunodeficiency virus The human immunodeficiency virus (HIV) is the etiological agent of the acquired immune deficiency syndrome (AIDS), which to date affects more than 35 million people worldwide. Albeit the current anti-retroviral therapy keeps the disease progression under control, people still die from HIV-related causes; thereby it is necessary to find alternative and effective antiviral targets. The HIV belongs to the Retroviridae family; the single-stranded RNA genome is processed by the viral retrotranscriptase and the newly formed double-stranded DNA is integrated into the host cell chromosomes to form the proviral genome, from which viral mRNAs and new genomes are transcribed. The research of G4s in the HIV-1 genome has been quite productive, concerning not only the two RNA viral genome copies, but also the integrated proviral genome, specifically in the long terminal promoter (LTR) region (45–47) and in the nef coding region (48), as properly reviewed by Metifiot et al. (42). Briefly, the LTR promoter is characterized by a highly conserved G-rich sequence in the U3 region, corresponding to Sp1 and NF-κB binding sites, where three mutually exclusive G4 structures can form, i.e. LTR-II, LTR-III and LTR-IV (46). LTR-IV is a parallel G4 with a bulge at its 3′-end, as ascertained by nuclear magnetic resonance (NMR) characterization (49). LTR-III and LTR-IV exert opposite effects on LTR promoter activity, which is silenced when LTR-III is folded and enhanced by LTR-IV stabilization (49). In addition, the LTR G4 region is under the control of two nuclear proteins: nucleolin, which upon binding increases LTR G4 stability and thus silences transcription (50) and the human ribonucleoprotein (hnRNP) A2/B1, which unwinds the LTR region, decreasing its promoter activity (51): these data suggest that the balance between G4s acts as a regulatory mechanism in HIV-1 promoter activity. Interestingly, G4-forming sequences are present in the LTR promoter of all primate lentiviruses and display binding sites for transcription factors that are related to G4 regulation (52,53), supporting a role for G4s as crucial control elements for viral transcription, conserved throughout evolution (54). G4s were also evidenced in the U3 region of the HIV-1 RNA genome, where multiple highly stable parallel G4s can form (55). RNA sequences can dimerize through an intermolecular G4 interaction (56), suggesting that the U3 region could represent an additional point of contact between the two viral genome copies. Additionally, such RNA G4s likely contribute to the observed increased genetic recombination rate in the U3 (57). Nef, a viral accessory protein, is an essential factor in proviral DNA synthesis (58) and in the establishment of a persistent state of infection (59). Its coding region is located at the 3′-end of the viral genome and partially overlaps with the 3′-LTR. Three G4 sequences have been identified in the most conserved region of the gene (48). G-rich sequences able to form G4s were reported in the HIV central DNA flap overlapping positive-strand and were found to protect the pre-integrated genome from nuclease degradation (60). Stabilization of HIV G4s by small molecules showed antiviral effects at different levels: G4 ligand binding to DNA LTR G4s decreased viral transcription, while binding to RNA LTR G4s inhibited the reverse transcription process, leading in both cases to strong antiviral effects (46,55,61). G4 ligand-mediated stabilization of the nef G4s induced nef-dependent antiviral activity (48). Very recently, G4 stabilizing agents were also employed in cells infected with latent HIV-1, where their activity resulted in a strong antiviral effect, especially in combination with a DNA repair inhibitor, revealing new aspects of HIV-1 latent infection (62). The specific molecules that were used as anti-HIV-1 agents are discussed in the ‘Antiviral G4 ligands’ section of this review. Herpesviruses Herpesviridae is a large family of viruses with long linear double-stranded DNA genomes. Among the nine herpesvirus species that can infect humans, at least five are extremely widespread, i.e. herpes simplex virus 1 and 2 (HSV-1 and HSV-2), varicella zoster virus, Epstein–Barr virus (EBV) and cytomegalovirus, which cause orolabial and genital herpes (63), chickenpox and shingles (64), mononucleosis (65) and some cancers (66). More than 90% of adults have been infected with at least one of these (63). Herpesviruses also tend to display latent, recurring infections, with the virus remaining in some part of the infected organism and typically maintaining its genome as extrachromosomal nuclear episome (67). Recent genome-wide bioinformatics analysis revealed an impressively high density of putative G4-forming sequences in all herpesvirus species (68). Indeed, the presence of G4s has been experimentally reported for HSV-1, EBV, Kaposi’s sarcoma associated herpesvirus (KSHV) and human herpesvirus 6 (HHV-6). HSV-1 establishes life-long persistent infections with a viral lifecycle that involves latency and reactivation/lytic replication. More than half of the world population suffers from HSV infections, the outcome of which may become severe in immunocompromised patients. Anti-HSV-1 therapy can be very effective; however, the emergence of drug-resistant viral strains urges the discovery of anti-herpetic drugs with innovative mechanisms of action. The HSV-1 genome, characterized by 68% GC-content, was found to contain numerous and highly stable G4-forming sequences that are mainly located in the repeated regions (69). These HSV-1 G4s, visualized through a G4-specific antibody in infected cells at different time points post-infections, were shown to form in a virus cycle-dependent fashion: viral G4s form massively in the cell nucleus during viral replication, and localize in different cell compartments according to the viral genome movements (70). EBV is associated not only with the well-known infectious mononucleosis, but also with a wider spectrum of illnesses, including several lymphoid malignancies. Studies on the presence and role of G4s in EBV proved that the genome maintenance protein EBV-encoded nuclear antigen-1 (EBNA1) stimulates viral DNA replication by recruiting the cellular origin replication complex through an interaction with RNA G4s (71). The EBNA1 mRNA itself is rich in G clusters able to fold into parallel G4s, which behave as cis-acting regulators of viral mRNA translation, producing ribosome dissociation. G4s in EBNA1 mRNA have been shown to modulate the endogenous presentation of EBNA1-specific CD8+ T-cell epitopes, which are involved in persistent infections (72). The cellular protein nucleolin counteracts this mechanism by interacting with EBNA1 mRNA G4s and thus downregulating EBNA1 protein expression and antigen presentation (73,74). G4s can also be observed in the mRNAs of other genome maintenance proteins that are known to regulate their self-synthesis, suggesting that G4s are exploited as structural regulatory elements by the virus (75). KSHV is the etiological agent of all forms of Kaposi’s sarcoma and other numerous lymphoproliferative disorders, which mostly concern AIDS patients, and at the moment, no treatments for the lytic or latent infections are available (76). The KSHV genome is organized in a 137 kb long unique region, flanked by the terminal repeats, which are rich in G residues and able to form stable G4s, both in the forward and reverse strands (77). HHV-6 is a ubiquitous virus that infects almost 100% of the human population. The diseases associated with HHV-6 include the febrile illness roseola infantum, also known as the sixth childhood eruptive disease (78). Reactivation of HHV-6 in immunosuppressed individuals is associated with adverse clinical outcomes, comprising life-threatening encephalitis or graft rejection in transplant patients (79). The HHV-6 genome presents telomeric regions at its termini, which can integrate into the telomeres of human chromosomes: integration is considered one possible mode of latency (80). Since telomeres can fold into G4s, these structures may be involved in the mechanism of HHV-6 integration. Indeed, stabilization of telomeric G4s by a G4 ligand inhibited HHV-6 chromosomal integration (81). Stabilization of herpesvirus G4s by G4 ligands led to antiviral activity. In HSV-1, inhibition of DNA replication and reduction of late viral transcripts were observed (69,82). In EBV, a G4 ligand inhibited EBNA1-dependent stimulation of viral DNA replication (71) and EBNA1 synthesis (75). In contrast, another G4 ligand reduced nucleolin binding to EBNA1 mRNA (75), which in turn resulted in enhanced EBNA1 synthesis and antigen presentation (73,74). Treatment of latently infected cells with G4 stabilizing compounds proved to negatively regulate viral replication, leading to a reduction in the KSHV genome copies (77). G4 ligands used against herpesviruses are discussed in the ‘Antiviral G4 ligands’ section of this review. Other viruses DNA viruses The human papillomavirus (HPV) is a double-stranded DNA virus that can cause skin and genital warts and some types of cancer. Its genome displays several G-rich sequences: stable G4s form in only eight out of 120 identified HPV types; however, the G4-forming HPVs include some of the most high risk HPV types, responsible for the majority of cases of cervical cancer (83,84). The Hepatitis B virus (HBV) is a partially double-stranded DNA virus, the best known member of the Hepadnaviridae family. It causes the hepatitis B disease, which may lead to cirrhosis and hepatocellular carcinoma. A single putative G4-forming sequence was discovered in the promoter region of the preS2/S gene in HBV genotype B and was found to fold into an intramolecular hybrid G4 structure. Surprisingly, the G4 acted as a positive regulator of HBV transcription, as revealed by luciferase reporter assays (85). Adeno-associated viruses (AAV) are single-stranded DNA viruses of the Parvoviridae family. AAV are not currently linked to human diseases and have been used as delivery vectors for gene therapy. A recent study reported the presence of G4s in the AAV genome. The DNA binding protein nucleophosmin (NPM1), which is known to enhance AAV infectivity, directly interacts with G4s: 18 putative G4s were identified, located within the inverted terminal repeat region (86). RNA viruses Amongst RNA viruses, G4 putative sequences have been identified in three positive and single-stranded ones, namely the severe acute respiratory syndrome coronavirus (SARS-CoV), the Hepatitis C virus (HCV) and the Zika virus (ZIKV). The SARS-CoV belongs to the family of Coronaviridae; its genome is about 29.7 Kb, which is one of the largest among RNA viruses. It has been identified after a massive outbreak in 2003 and is considered one of the most pathogenic coronaviruses in humans. Within the non-structural protein 3, the so-called SARS unique domain (SUD), which plays an essential role in viral replication and transcription, was found to preferentially bind G4-forming oligonucleotides (87,88). These may be found in the 3′-non-translated regions of mRNAs coding for host-cell proteins involved in apoptosis or signal transduction; therefore, it has been proposed that SUD/G4 interaction may be involved in controlling the host cell’s response to the viral infection. The HCV belongs to the Flaviviridae family; it can cause both acute and chronic hepatitis, possibly leading to cirrhosis and liver cancer. Bioinformatics and biophysical analysis demonstrated the existence of two highly conserved G4 sequences in the C gene of HCV (89). The ZIKV is also included in the family of Flaviviridae. It is transmitted to humans by mosquito bites; while in an adult it may cause mild symptoms or even be symptomless, it may be devastating in a pregnant woman as it causes microcephaly in the unborn child. Several G4 sequences were discovered in the positive strand of the ZIKV genome: 7 of these are conserved within more than 50 flavivirus genomes, suggesting an important role in the life cycle of these viruses. Furthermore, ZIKV presents an additional G4 in the unique 3′-UTR region, crucial for initial viral replication of the negative-sense strand (90). Finally, G4s have been investigated in the Ebola virus (EBOV) and Marburg virus (MARV), two negative and single-stranded RNA viruses belonging to the Filoviridae family. These are deadly pathogens that cause haemorrhagic fever in humans and primates (91). The presence of G4 sequences in the negative strand of EBOV and MARV was assessed by a fluorescent probe (92). Both ZIKV and EBOLV went through massive outbreaks in the past three years, which makes them two of the most dangerous agents of viral epidemics of the current decade. In Figure 3, all the viruses in which G4s have been investigated are displayed. The stabilizing G4 ligands tested in some of these viruses are thoroughly described in the section below. ANTIVIRAL G4 LIGANDS: DEVELOPMENT, ANTIVIRAL ACTIVITY AND MECHANISM In the past few years much effort has been directed toward the design of small molecules able to target G4s, leading to very promising potential therapeutics, especially against cancer. Several updated reviews describe the use of G4 ligands that target telomeres and oncogenes to treat cancer (8,34,93–95). Despite the considerable achievements in antiviral research, viral infections still represent a major global threat for human health, causing significant morbidity and mortality. The recurrent onset of drug-resistant pathogens, combined with the fact that the majority of viruses still lack a specific vaccine, urges the development of novel therapeutic approaches for the management of viral diseases. To this end, G4 ligands provide both compounds with an innovative mechanism of action in antiviral treatment and valuable tools to better understand virus mechanisms. In the section below G4 ligands reported to exert antiviral activity have been grouped based on the chemical nature of their core. A description of their discovery, general G4 binding activity and biological effects in cells is initially provided. Antiviral properties, activity and selectivity are then discussed. BRACO-19 The N,N'-(9-((4-(dimethylamino)phenyl)amino)acridine-3,6-diyl)bis(3-(pyrrolidin-1-yl)propan-amide), labeled BRACO-19 (B19) (1, Figure 4), is to date one of the most studied G4 ligands. It is the outcome of a complex and thorough medicinal chemistry investigation that started with the introduction of an acridine moiety as a new chromophore in the research of G4 binders. Read and colleagues demonstrated that the acridine core was more active than the previously developed anthraquinone core (96,97), because of the presence of a nitrogen atom in the heterocyclic scaffold that could be protonated at physiological conditions. As a result, the electron deficiency in the chromophore was increased, with consequent enhancement of the G4 interaction (98). In-depth structure-activity relationship (SAR) analysis supported by molecular modeling techniques next led to the development of bi- and tri-substituted derivatives (99,100). These classes of compounds are characterized by a central planar pharmacophore that binds G-tetrads through π–π interactions (Figure 5A). Additionally, two side chains functionalized with a tertiary amine moiety are needed to interact with the grooves: the amine group is crucial for activity since it is protonated at physiological pH, while it disrupts the G4 when substituted with bulky residues (101). The 3,6,9-trisubstituted acridines emerged as the most potent compounds among all the possible regioisomeric series that have been evaluated: they proved to act as G4-mediated telomerase inhibitors. B19 showed telomerase inhibition at nanomolar concentration, with higher affinity for G4 with respect to duplex DNA, and lower cytotoxicity when compared to first generation acridines. It induced long-term growth arrest and replicative senescence in the 21NT breast carcinoma cell line and was the first G4 ligand to prove anticancer activity in vivo, against human tumor xenograft models (102,103). Figure 4. View largeDownload slide Chemical structures of reported G4 ligands with antiviral activity. Figure 4. View largeDownload slide Chemical structures of reported G4 ligands with antiviral activity. Figure 5. View largeDownload slide Solved crystal structures for three of the compounds discussed in the text. (B) Crystal structure of the complex between B19 and the bimolecular human telomeric G4 (PDB ID: 3CE5): each quadruplex contains three planar stacked G-tetrads with the molecule stacking directly onto the 3′ end quartet (136). (B) Crystal structure of the complex between TMPyP4 and the bimolecular human telomeric G4 (PDB ID: 2HRI). The compound binds by stacking onto the TTA nucleotides, as part of the external loop or at the 5′ region of the stacked quadruplex (105). (C) Crystal structure of the complex between PhenDC3 and the human c-myc-promoter G4 (PDB ID: 2MGN). The ligand establishes an optimal interaction with the top G-tetrad, while the two N-methyl groups are positioned above the grooves and have minimal contact with the tetraplex (132). Figure 5. View largeDownload slide Solved crystal structures for three of the compounds discussed in the text. (B) Crystal structure of the complex between B19 and the bimolecular human telomeric G4 (PDB ID: 3CE5): each quadruplex contains three planar stacked G-tetrads with the molecule stacking directly onto the 3′ end quartet (136). (B) Crystal structure of the complex between TMPyP4 and the bimolecular human telomeric G4 (PDB ID: 2HRI). The compound binds by stacking onto the TTA nucleotides, as part of the external loop or at the 5′ region of the stacked quadruplex (105). (C) Crystal structure of the complex between PhenDC3 and the human c-myc-promoter G4 (PDB ID: 2MGN). The ligand establishes an optimal interaction with the top G-tetrad, while the two N-methyl groups are positioned above the grooves and have minimal contact with the tetraplex (132). The use of B19 in a viral environment was first analyzed in EBV, to investigate the functional and biochemical characteristics of EBNA1. Results showed that B19 stabilized the viral RNA G4 and, during infection, was able to reduce EBV genome copy numbers in Raji cells. It was also found to induce modest reduction of transcription levels of EBNA2 and EBNA3A and inhibition of EBNA1-dependent DNA replication. These data indicate that G4-interacting molecules can block functions of EBNA1 that are critical for viral DNA replication (71). In the LTR promoter region of the HIV-1 proviral genome, B19 was able to significantly stabilize the naturally occurring G4s, LTR-II and LTR-III, and to induce an additional G4, LTR-IV. In the presence of increasing concentration of B19, LTR promoter activity was decreased of almost 70% with respect to the untreated control, while no activity was detected in a mutated sequence unable to fold into G4s (46). These results confirmed a G4-mediated mechanism of action. The anti-HIV-1 activity of B19 (IC50 < 7.9 μM) was tested in various cell lines, against different viral strains and was demonstrated to be G4 mediated. Since G4 structures also formed in the pre-integration viral RNA (55), a dual mode of action both at the pre- and post-integration level was proposed (Figure 6). B19 antiviral activity was tested and confirmed in latent HIV-1 infected cells, where the acridine was able to reduce the viral titer to undetectable level, also in long-term treatment (62). Figure 6. View largeDownload slide Proposed antiviral mechanism of the G4 ligand B19. In HIV-1 infected cell, B19 exerts a dual mechanism of action: it recognizes and stabilizes the DNA G4s in the 5′-LTR within the proviral genome in the cell nucleus, and it binds to the RNA G4s in the 5′-LTR and 3′-LTR of the viral genome in the cytoplasm: such interactions result in the inhibition of viral direct and reverse transcription. Figure 6. View largeDownload slide Proposed antiviral mechanism of the G4 ligand B19. In HIV-1 infected cell, B19 exerts a dual mechanism of action: it recognizes and stabilizes the DNA G4s in the 5′-LTR within the proviral genome in the cell nucleus, and it binds to the RNA G4s in the 5′-LTR and 3′-LTR of the viral genome in the cytoplasm: such interactions result in the inhibition of viral direct and reverse transcription. B19 exerted its G4 stabilizing activity also in the HSV-1 genome, where multiple G4s can form. Treatment with B19 led to a significant antiviral effect (IC50 = 8 μM), with reduction in viral DNA synthesis and late proteins production (69). Moreover, B19 was used in HHV-6A infected cells to evaluate the ability of G4 ligands to impair viral integration in the telomeric region, through stabilization of telomeric G4s. Interestingly, in telomerase expressing cell lines, the frequency of chromosomal integration was reduced up to 50% upon treatment. However, effects of G4 ligands on HHV-6 replication and gene expression are yet to be discovered (81). Recently, B19 was employed in a luciferase reporter assay to analyze the role of G4s in HBV, where it enhanced promoter activity, suggesting a positive regulatory role of G4s in HBV transcription (85). Despite its good solubility in aqueous solutions and strong G4 binding, poor permeability across biological barriers, which characterizes most G4 ligands, restrains B19 pharmacological application (104). Nonetheless, B19 is still considered a reference compound in G4 research. TMPyP4 The cationic porphyrin compound 5,10,15,20-tetrakis-(N-methyl-4-pyridyl)porphine (TMPyP4, 2, Figure 4) was proposed as G4 binder because of its suitable physical properties, such as molecular size, planar core, positive charges and hydrophobicity, favorable for stacking with the G tetrads (105) (Figure 5B). Biophysical analysis demonstrated that TMPyP4 was actually able to stack and stabilize both parallel and antiparallel G4s, with mild selectivity for quadruplex over duplex DNA (95). Since then, it has been widely employed as a tool to study G4s, especially because of the availability of a negative control compound, TMPyP2 (3, Figure 4), which is a structural isomer with N-methyl-2-pyridyl residues on the porphine core. Intriguingly, TMPyP2 is sterically hindered from external stacking on the G4 with respect to TMPyP4, producing no biological effects (106,107). In biological assays, TMPyP4 was shown to inhibit human telomerase (IC50 = 6.5 ± 1.4 μM) (108) and downregulate the proto-oncogene c-myc expression as well as several c-myc-regulated genes containing G4-forming sequences. Such modulation resulted in in vivo antitumor activity in different models where the porphyrin was able to decrease tumor growth and prolong survival (109). In viruses, TMPyP4 was shown to stabilize G4s in the HIV-1 nef coding region and to induce their formation within the double-helix conformation. Interestingly, in the TZM-bl reporter cell line, which supports nef-dependent HIV-1 replication, the porphyrin inhibited viral infectivity in a dose-dependent manner (48). In addition, TMPyP4 administration was able to block viral replication in two different Jurkat-derived T-cell lines with established HIV-1 latency. Bambara’s research group demonstrated that the antiviral activity was coupled with an increased rate of apoptosis/death when compared to untreated cells, and that this effect was enhanced by association with DNA damage repair inhibitors (62). In HCV, TMPyP4 was found to stabilize RNA G4s and inhibit HCV C gene expression through a G4-mediated mechanism of action confirmed by an enhanced green fluorescent protein reporter gene system. In addition, in an infectious HCV culture system, administration of the porphyrin led to a dose-dependent decrease of viral RNA levels (89). TMPyP4 was also employed to investigate the role of G4s in EBOV L gene. It exerted high stabilization of the target G4 RNA in circular dichroism and RNA stop assays. More importantly, after treatment with increasing concentrations of the compound, transcription of the L gene was gradually reduced. To confirm target selectivity, a mutant non-G4-forming sequence was used as a negative control, where TMPyP4 did not produce significant inhibition of transcription. In addition, the porphyrin was found to inhibit replication of EBOV mini-genome, a cell-based approach that uses firefly luciferase as reporter protein and thus can be used as an efficient antiviral screening system (91). It is worth noting that the low selectivity of TMPyP4 towards G4 structures versus duplex DNA (110) may suggest the antiviral activity to be ascribed to multiple mechanisms of action, limiting its biological and clinical application. Perylenes and naphthalene diimides Perylenes represent a well-known family of G4 ligands, containing a differently substituted, large fused aromatic ring system: they are characterized by a hydrophobic heptacyclic central core, which is responsible for the binding to G quartets through π–π interactions, and by up to four protonated side chains. Accurate SAR studies on this scaffold pointed out two crucial features for G4 binding: the basicity of the system, which prevents the compound from self-aggregation, and the distance between the aromatic central core and the quaternarized nitrogen residue in the side chain, which modulates ligand solubility and affects G4 recognition. The cationic amino moieties in the lateral substituents are thought to regulate specificity for G4 versus duplex DNA (111). PIPER, N,N’-bis[2-(1-piperidino)-ethyl]-3,4,9,10-perylenetetracarboxylic diimide (4, Figure 4) is the lead compound of this class; it was shown to induce and stabilize G4 structures in telomeres (112), leading to telomere shortening, reduction of cell proliferation and tumorigenicity, and senescence (113). In the effort to improve the physicochemical properties of the perylene scaffold, progressive surface reduction led to the more promising class of naphthalene diimide (NDI) derivatives. Indeed, it was demonstrated that the dimensions of the planar core modulate the ability of this class of compounds to recognize different DNA conformations. In particular, in the cyclic condensed system at least four rings are required to efficiently target G4s (114). In addition, the NDI planar core can accommodate up to four side chains to enhance G4 affinity. These compounds were found to inhibit telomerase activity in the low micromolar range and to produce short-term cell growth inhibition against MCF-7 and A549 cancer cell lines (115). To improve DNA G4 alkylating properties, further modifications were introduced on the NDI scaffold, which include quinone methides precursors (116,117). These ligands revealed both reversible and irreversible binding properties toward telomeric DNA, with promising duplex versus quadruplex selectivity (118), and were found to impair the growth of different telomerase-positive cancer cell lines following telomerase activity inhibition (119–121). Crystallographic analyses of various NDI–telomere complexes provided a turning point for rational optimization of this class of compounds (122,123). Neidle et al. reported that the tested ligands promoted a parallel G4 topology, forming a 1:1 complex with the oligonucleotide. This stoichiometry resulted from the combination of binding site affinity and direct groove interactions that are highly influenced by the protonated moiety in the side chains, which interacts with DNA phosphates in the grooves. Despite their high molecular weight, NDIs are highly versatile structures, suitable for further medicinal chemistry modifications to improve their pharmacological profile (124,125). In the antiviral field, PIPER induced and stabilized G4 structures in the nef coding region of the HIV-1 genome (48). However, the best results were obtained with core-extended NDI derivatives (c-exNDIs, 5, Figure 4). This series of compounds, endowed with exceptional solubility properties, has been obtained by fusing the NDI core with a 1,4-dihydroquinoxaline heterocycle. Interestingly, the newly developed ligands displayed greater in vitro binding and stabilization activity on viral HIV-1 LTR G4s than the human telomeric sequence, used as a cellular reference G4. Most importantly, the c-exNDIs exhibited very promising antiviral activity in the low nanomolar range (IC50 < 25 nM) against different strains of HIV-1, with very low cytotoxicity, yielding a wide and encouraging therapeutic window. The G4-related mechanism of action was proved combining time-related antiviral and reporter assays, using a non-G4-forming LTR-mutant sequence as control. It is reasonable that the higher antiviral activity depends on the selectivity toward the viral G4s, as, during the infection, LTR and telomeric G4s are likely the most abundant species in the cell (61). The most active c-exNDI was also analyzed in HSV-1 infection. In vitro CD and Taq-polymerase stop assays indicated that the compound was able to bind and stabilize various G4-forming sequences of the HSV-1 genome. Mass spectrometry competition analysis revealed a stronger preference for HIV-1 G4s over HSV-1, but generally, viral G4s were preferentially bound, when compared to the telomeric G4. Indeed, c-exNDI showed remarkable antiviral activity (IC50 = 18.3 ± 1.4 nM). The anti-herpetic effect was ascribed to inhibition of viral DNA replication, as gathered by time-of-addition assay and flow cytometry analysis using acyclovir as reference compound (82). Since c-exNDI selectivity towards HSV-1 G4s in vitro resulted to be good but not outstanding, the marked anti-HSV-1 activity was likely due also to the massive presence of viral G4s in the cell nucleus, which was demonstrated to occur during HSV-1 replication (70). Pyridostatin Pyridostatin (PDS, 6, Figure 4) has been rationally designed on the structural features shared by known G4-binding ligands, as it comprises a potentially planar electron-rich aromatic surface and the ability to participate in hydrogen bonding. Moreover, the rotatable bonds provide a flexibility degree, which makes PDS capable to adapt to the dynamism of G4s. PDS strongly stabilized telomeric G4 with no effect on double-stranded DNA: as a result, the shelterin complex integrity was altered, triggering a DNA-damage response at telomeres (126). Numerous modifications have been introduced in the PDS scaffold to further explore the role of this class in anticancer therapy. Indeed, the obtained analogues showed remarkable growth-inhibitory effects in cancer cell lines and a complete arrest after long-term exposure to the drug. These results emphasize the high potential of these compounds to fine-tune their biological activity (127,128). In antiviral research, PDS has been used to study the role of G4s in EBV EBNA1 mRNA, where it enhanced the stability of the G4-forming sequence, decreasing EBNA1 synthesis level in a concentration-dependent fashion, both in vitro and in vivo. As a consequence, EBV-infected cells resulted less efficiently recognized by virus-specific T cells, albeit the mechanism of action still needs to be clarified (75). In HBV, PDS was used to unravel the positive regulatory role of G4s within the preS2/S gene promoter (85). A PDS analogue, namely PDP (7, Figure 4), was employed in HCV G4 research, along with TMPyP4. The PDP-induced stabilization of G4 structures located in the HCV RNA downregulated C gene expression. In vivo, PDP inhibited intracellular replication of different HCV genotypes through a confirmed G4-related mechanism of action, resulting in antiviral activity in the low micromolar range (89). Bisquinolinium derivatives Bisquinolinium compounds are characterized by an aromatic nucleus substituted with two protonated quinoline moieties. The first reported compounds present a dicarboxamide-pyridine or -triazine ring as central core: the most promising of these ligands have shown to increase G4 stability in telomeres, with great selectivity over duplex DNA (129). These compounds are able to adopt an intramolecular syn-syn H-bond, which was proposed to be critical for G4 recognition, likely because the consequent rigidity of the compound promotes G-quartet overlap. On these bases, the central core was expanded without disrupting the H-bonds, leading to a new disubstituted-1,10-phenanthroline series that displays exceptional selectivity for G4s (130), due to the crescent-like shape which prevents such compounds to intercalate with duplex DNA (Figure 5C) (131,132). PhenDC3 (8, Figure 4), the best representative of this class, is a potent telomeric G4 ligand able to reduce telomerase processivity (133). PhenDC3 was used in KSHV to evaluate its potential role in inhibiting latent viral replication. The ligand was found to elicit a stress response in infected BCBL-1 cells and to stall the replication machinery both in the leading and lagging strands of the KSHV genome. Furthermore, treatment with PhenDC3 resulted in the dramatic reduction (60%) of episome copy number, with no effect on cell growth and proliferation. These data represent the first use of G4 ligands in targeting latent viral infections (77). PhenDC3 was also used in EBV, where it prevented binding of nucleolin to EBNA1 mRNA G4 and increased the endogenous EBNA1 levels in EBV-infected B cells and in cells derived from a nasopharyngeal carcinoma. These results indicate that the nucleolin–EBNA1 mRNA interaction can also be targeted by antiviral G4-ligands (74). A summary of G4 ligands and the viruses against which they have been tested is reported in Figure 3. DISCUSSION AND FUTURE PERSPECTIVES In the last decades, research on the role of G4s in the human genome has been quite challenging and promising, leading to the awareness that these high-order structures play key regulatory roles in biological pathways such as transcription, replication, translation and telomere maintenance. The development of G4 binders with encouraging anti-cancer activity has prompted researchers to identify new ways to exploit G4 structures in human diseases, e.g. viral infections. Because G4s are present both in cell and virus genomes, the challenge in developing antiviral G4 ligands reasonably consists in overcoming selectivity toward viral versus cellular G4s. A major limitation of the so far described G4 ligands is their large flat aromatic core that stacks on the G tetrad, which reduces the chances to discriminate among different G4s. Moreover, they are generally characterized by high-molecular weights and protonated side chains, which are necessary for loops and grooves interaction, but, on the other hand, may affect cellular uptake. Indeed, because of the low selectivity profile and poor drug-like properties, no G4 ligand has advanced beyond Phase II in the drug discovery pathway. Quarfloxin, a fluoroquinolone derivative compound developed by Hurley’s research group (134), is to date the only G4 ligand that has reached Phase II clinical trials but was withdrawn due to bioavailability related problems (35). However, several data presented in the literature indicate that, in general, a certain degree of selectivity is achievable towards the viral G4 of interest in comparison to the telomeric G4, i.e. the most abundant cellular G4 (135). In the case of HIV-1 G4s and c-exNDI compounds, the higher affinity towards the viral structure is likely caused by the extension of the NDI core and thus by the interaction with the viral G4 loop region, which is unique for this G4 (61). In general, loop and groove regions characterize each G4 and thus are amenable for selective recognition. Structural studies on cellular G4/G4 ligand complexes indicated that most G4-binding molecules interact with G4s through quasi-external stacking, in which the heteroaromatic chromophore of the small molecules is π–π stacked onto the face of an external G-quartet (136) (Figure 5) and onto the side chains positioned in the G4 grooves (94). It is therefore conceivable that the reported antiviral activity of G4 ligands is mediated by an increased interaction, hence affinity, with the groove/loop moiety of the viral G4s. To date, only one viral G4 structure has been resolved through NMR spectroscopy (49), therefore future NMR and crystallographic resolutions of viral G4s and G4/G4 ligand complexes are necessary to define the viral G4s architecture. This could help researchers identifying possible unique G4 structures which could lead to the design and development of selective molecules. In other cases, G4 ligands did not show significant selectivity for the viral versus telomeric G4s, and the G4s present in oncogene promoters were usually strongly bound by the tested compounds (82). Nonetheless, the data so far presented on the antiviral use of G4 ligands have shown in general very promising activity against a wide range of virus species. One possible explanation is that the amount of the viral G4s in the infected cells largely surpasses that of the cellular G4s (70). Indeed, usually cells are exploited to function as factories in the production of new viral genomes that are eventually assembled into new mature virus particles (See Figure 2 for the viral infection cycle). It is thus conceivable that the viral G4s become largely more abundant than the cellular G4s during virus replication. At least in one case this eventuality has been demonstrated: in HSV-1 there is a sharp increase in the number of viral G4s during viral DNA replication (70). Combining the abundance of G4s per genome and the number of new genomes, the amount of viral G4s could outstand that of cellular G4s by several logs per cell. In addition, the so far identified viral G4s are usually key regulatory elements of the virus life cycle and their stabilization/unfolding by G4 ligands can likely explain the resulting massive virus inhibition. If this behaviour is demonstrated also in other viruses, it would be possible to exploit G4 ligands that are not strictly selective for the viral G4s. This scenario would highly and rapidly expand the research and pharmacological application of G4 ligands as antiviral agents. A further point to be addressed is the necessity to standardize methods to study the antiviral activity of the G4 ligands. One starting point should be the detection of the inhibitory activity of the ligand on the virus life cycle. If an effect is obtained, further investigation on the mechanism of action has to be performed. In this regard, the time of addition method (137) can be of assistance as it indicates the last viral step at which the compound is active and it thus narrows the possible molecular targets. However, because of the complexity and uniqueness of each virus, the investigation of the target and mechanism of action at the molecular level may not be straightforward. For example, PDS inhibited EBNA1 synthesis in vitro but not in cells, while PhenDC3 in cells led to the exact opposite effect, i.e. enhanced EBNA1 synthesis (74,75). It is likely that multiple G4-mediated mechanisms are involved in the observed outcomes. Finally, targeting G4s in the viral genomes leads to the exciting possibility of affecting viruses that undergo latency. These viruses, such as HIV, the herpes and papilloma virus families, comprise an initial acute infection and a subsequent latent infection. The latter is characterized by the maintenance of the virus genome in the human host for the entire life of the host. The latent virus may reactivate from time to time to produce new mature virus. Current therapies that normally target viral proteins fail to remove the latent virus, i.e. the virus genome, from its host. Selectively targeting the viral genome in a G4-mediated approach would allow removing not only the replicating virus but also the latent one, therefore eradicating so far incurable infective agents. In this picture it is worth considering the virus-induced manipulation of host chromatin. In recent years, studies about the role of chromatin in viral infections showed dynamic virus–host chromatin interactions and chromatin machinery modulation by virus encoded proteins (138). For example, the HSV-1 epigenetic regulation of viral chromatin by viral gene products plays a key role in determining whether the virus develops a lytic or latent infection (139). Considering the recent evidences reported by Hänsel-Hertsch et al. that G4 formation reflects the suppressive role of heterochromatin and that it occurs only in highly transcribed regulatory nucleosome-depleted chromatin regions (13), it would be interesting to understand how the virus and its G4s affect and could be affected by such a complex mechanism. To conclude, all the data reported in this review indicate that: i) G4 structures are crucial elements in the regulation of viruses’ life cycle, both in lytic and latent states; ii) G4 ligands efficiently act as antiviral agents. This should encourage researchers to continue investigating on G4-binding small molecules: as a matter of fact, albeit quarfloxin clinical evaluation did not progress, its success in Phase I clinical trial, i.e. optimal toxicity profile (35), suggests that improvements of G4 ligand pharmacological profiles will very likely lead to concrete clinical applications of these compounds. Therefore, research in the next future will need to improve i) the understanding of G4 activity and regulation at the viral level, ii) the selectivity of G4 ligands toward the viral versus cellular G4s, iii) the drug-like properties of the antiviral G4 ligands to be employed in in vivo studies. G4-mediated antiviral drugs may represent a significant turning point in the management of viral infections, especially for people who cannot access immunization, like immunocompromised patients or elderly people. In addition, the G4-mediated antiviral effects reported in latent infections (62) may pave the way for cutting-edge therapeutic approaches in the treatment of human fatal malignancies related to latent viruses, such as AIDS, herpes- and HPV-related cancer. FUNDING Bill and Melinda Gates Foundation (GCE) [OPP1035881, OPP1097238]; European Research Council (ERC Consolidator) [615879]. Funding for open access charge: Bill and Melinda Gates Foundation [OPP1097238]. Conflict of interest statement. None declared. REFERENCES 1. Burge S., Parkinson G.N., Hazel P., Todd A.K., Neidle S. Quadruplex DNA: sequence, topology and structure. Nucleic Acids Res. 2006; 34: 5402– 5415. Google Scholar CrossRef Search ADS PubMed 2. Fay M.M., Lyons S.M., Ivanov P. RNA G-quadruplexes in biology: principles and molecular mechanisms. J. Mol. Biol. 2017; 429: 2127– 2147. Google Scholar CrossRef Search ADS PubMed 3. Kwok C.K., Merrick C.J. G-quadruplexes: prediction, characterization, and biological application. Trends Biotechnol. 2017; 35: 997– 1013. Google Scholar CrossRef Search ADS PubMed 4. Parrotta L., Ortuso F., Moraca F., Rocca R., Costa G., Alcaro S., Artese A. Targeting unimolecular G-quadruplex nucleic acids: a new paradigm for the drug discovery?. Expert Opin. Drug Discov. 2014; 9: 1167– 1187. Google Scholar CrossRef Search ADS PubMed 5. Kudlicki A.S. G-quadruplexes involving both strands of genomic DNA are highly abundant and colocalize with functional sites in the human genome. PLoS One . 2016; 11: e0146174. Google Scholar CrossRef Search ADS PubMed 6. Wu R.Y., Zheng K.W., Zhang J.Y., Hao Y.H., Tan Z. Formation of DNA:RNA hybrid G-quadruplex in bacterial cells and its dominance over the intramolecular DNA G-quadruplex in mediating transcription termination. Angew. Chem. Int. Ed. Engl. 2015; 54: 2447– 2451. Google Scholar CrossRef Search ADS PubMed 7. Zheng K.W., Xiao S., Liu J.Q., Zhang J.Y., Hao Y.H., Tan Z. Co-transcriptional formation of DNA:RNA hybrid G-quadruplex and potential function as constitutional cis element for transcription control. Nucleic Acids Res. 2013; 41: 5533– 5541. Google Scholar CrossRef Search ADS PubMed 8. Huppert J.L. Four-stranded nucleic acids: structure, function and targeting of G-quadruplexes. Chem. Soc. Rev. 2008; 37: 1375– 1384. Google Scholar CrossRef Search ADS PubMed 9. Bedrat A., Lacroix L., Mergny J.L. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Res. 2016; 44: 1746– 1759. Google Scholar CrossRef Search ADS PubMed 10. Huppert J.L., Balasubramanian S. Prevalence of quadruplexes in the human genome. Nucleic Acids Res. 2005; 33: 2908– 2916. Google Scholar CrossRef Search ADS PubMed 11. Eddy J., Maizels N. Gene function correlates with potential for G4 DNA formation in the human genome. Nucleic Acids Res. 2006; 34: 3887– 3896. Google Scholar CrossRef Search ADS PubMed 12. Chambers V.S., Marsico G., Boutell J.M., Di Antonio M., Smith G.P., Balasubramanian S. High-throughput sequencing of DNA G-quadruplex structures in the human genome. Nat. Biotechnol. 2015; 33: 877– 881. Google Scholar CrossRef Search ADS PubMed 13. Hansel-Hertsch R., Beraldi D., Lensing S.V., Marsico G., Zyner K., Parry A., Di Antonio M., Pike J., Kimura H., Narita M.et al. G-quadruplex structures mark human regulatory chromatin. Nat. Genet. 2016; 48: 1267– 1272. Google Scholar CrossRef Search ADS PubMed 14. Kouzine F., Wojtowicz D., Baranello L., Yamane A., Nelson S., Resch W., Kieffer-Kwon K.R., Benham C.J., Casellas R., Przytycka T.M.et al. Permanganate/S1 nuclease footprinting reveals non-B DNA structures with regulatory potential across a mammalian genome. Cell Syst. 2017; 4: 344– 356. Google Scholar CrossRef Search ADS PubMed 15. Rhodes D., Lipps H.J. G-quadruplexes and their regulatory roles in biology. Nucleic Acids Res. 2015; 43: 8627– 8637. Google Scholar CrossRef Search ADS PubMed 16. Kerkour A., Marquevielle J., Ivashchenko S., Yatsunyk L.A., Mergny J.L., Salgado G.F. High-resolution three-dimensional NMR structure of the KRAS proto-oncogene promoter reveals key features of a G-quadruplex involved in transcriptional regulation. J. Biol. Chem. 2017; 292: 8082– 8091. Google Scholar CrossRef Search ADS PubMed 17. Phan A.T., Kuryavyi V., Luu K.N., Patel D.J. Structure of two intramolecular G-quadruplexes formed by natural human telomere sequences in K+ solution. Nucleic Acids Res. 2007; 35: 6517– 6525. Google Scholar CrossRef Search ADS PubMed 18. Parkinson G.N., Lee M.P., Neidle S. Crystal structure of parallel quadruplexes from human telomeric DNA. Nature . 2002; 417: 876– 880. Google Scholar CrossRef Search ADS PubMed 19. Biffi G., Tannahill D., McCafferty J., Balasubramanian S. Quantitative visualization of DNA G-quadruplex structures in human cells. Nat. Chem. 2013; 5: 182– 186. Google Scholar CrossRef Search ADS PubMed 20. Henderson A., Wu Y., Huang Y.C., Chavez E.A., Platt J., Johnson F.B., Brosh R.M. Jr, Sen D., Lansdorp P.M. Detection of G-quadruplex DNA in mammalian cells. Nucleic Acids Res. 2014; 42: 860– 869. Google Scholar CrossRef Search ADS PubMed 21. Laguerre A., Hukezalie K., Winckler P., Katranji F., Chanteloup G., Pirrotta M., Perrier-Cornet J.M., Wong J.M., Monchaud D. Visualization of RNA-quadruplexes in live cells. J. Am. Chem. Soc. 2015; 137: 8521– 8525. Google Scholar CrossRef Search ADS PubMed 22. Biffi G., Tannahill D., Miller J., Howat W.J., Balasubramanian S. Elevated levels of G-quadruplex formation in human stomach and liver cancer tissues. PLoS One . 2014; 9: e102711. Google Scholar CrossRef Search ADS PubMed 23. Simone R., Balendra R., Moens T.G., Preza E., Wilson K.M., Heslegrave A., Woodling N.S., Niccoli T., Gilbert-Jaramillo J., Abdelkarim S.et al. G-quadruplex-binding small molecules ameliorate C9orf72 FTD/ALS pathology in vitro and in vivo. EMBO Mol. Med. 2018; 10: 22– 31. Google Scholar CrossRef Search ADS PubMed 24. Zhang S., Wu Y., Zhang W. G-quadruplex structures and their interaction diversity with ligands. Chemmedchem . 2014; 9: 899– 911. Google Scholar CrossRef Search ADS PubMed 25. Huppert J.L. Structure, location and interactions of G-quadruplexes. FEBS J. 2010; 277: 3452– 3458. Google Scholar CrossRef Search ADS PubMed 26. Tan Z., Tang J., Kan Z.Y., Hao Y.H. Telomere G-quadruplex as a potential target to accelerate telomere shortening by expanding the incomplete end-replication of telomere DNA. Curr. Top. Med. Chem. 2015; 15: 1940– 1946. Google Scholar CrossRef Search ADS PubMed 27. Sissi C., Palumbo M. Telomeric G-quadruplex architecture and interactions with potential drugs. Curr. Pharm. Des. 2014; 20: 6489– 6509. Google Scholar CrossRef Search ADS PubMed 28. Neidle S. Human telomeric G-quadruplex: the current status of telomeric G-quadruplexes as therapeutic targets in human cancer. FEBS J. 2010; 277: 1118– 1125. Google Scholar CrossRef Search ADS PubMed 29. Juranek S.A., Paeschke K. Cell cycle regulation of G-quadruplex DNA structures at telomeres. Curr. Pharm. Des. 2012; 18: 1867– 1872. Google Scholar CrossRef Search ADS PubMed 30. Rigo R., Palumbo M., Sissi C. G-quadruplexes in human promoters: a challenge for therapeutic applications. Biochim. Biophys. Acta . 2017; 1861: 1399– 1413. Google Scholar CrossRef Search ADS PubMed 31. Cogoi S., Xodo L.E. G4 DNA in ras genes and its potential in cancer therapy. Biochim. Biophys. Acta . 2016; 1859: 663– 674. Google Scholar CrossRef Search ADS PubMed 32. Chen B.J., Wu Y.L., Tanaka Y., Zhang W. Small molecules targeting c-Myc oncogene: promising anti-cancer therapeutics. Int. J. Biol. Sci. 2014; 10: 1084– 1096. Google Scholar CrossRef Search ADS PubMed 33. Brooks T.A., Kendrick S., Hurley L. Making sense of G-quadruplex and i-motif functions in oncogene promoters. FEBS J. 2010; 277: 3459– 3469. Google Scholar CrossRef Search ADS PubMed 34. Bidzinska J., Cimino-Reale G., Zaffaroni N., Folini M. G-quadruplex structures in the human genome as novel therapeutic targets. Molecules . 2013; 18: 12368– 12395. Google Scholar CrossRef Search ADS PubMed 35. Balasubramanian S., Hurley L.H., Neidle S. Targeting G-quadruplexes in gene promoters: a novel anticancer strategy?. Nat. Rev. Drug Discov. 2011; 10: 261– 275. Google Scholar CrossRef Search ADS PubMed 36. Verma A., Halder K., Halder R., Yadav V.K., Rawal P., Thakur R.K., Mohd F., Sharma A., Chowdhury S. Genome-wide computational and expression analyses reveal G-quadruplex DNA motifs as conserved cis-regulatory elements in human and related species. J. Med. Chem. 2008; 51: 5641– 5649. Google Scholar CrossRef Search ADS PubMed 37. Hershman S.G., Chen Q., Lee J.Y., Kozak M.L., Yue P., Wang L.S., Johnson F.B. Genomic distribution and functional analyses of potential G-quadruplex-forming sequences in Saccharomyces cerevisiae. Nucleic Acids Res. 2008; 36: 144– 156. Google Scholar CrossRef Search ADS PubMed 38. Smargiasso N., Gabelica V., Damblon C., Rosu F., De Pauw E., Teulade-Fichou M.P., Rowe J.A., Claessens A. Putative DNA G-quadruplex formation within the promoters of Plasmodium falciparum var genes. BMC Genomics . 2009; 10: 362. Google Scholar CrossRef Search ADS PubMed 39. Beaume N., Pathak R., Yadav V.K., Kota S., Misra H.S., Gautam H.K., Chowdhury S. Genome-wide study predicts promoter-G4 DNA motifs regulate selective functions in bacteria: radioresistance of D. radiodurans involves G4 DNA-mediated regulation. Nucleic Acids Res. 2013; 41: 76– 89. Google Scholar CrossRef Search ADS PubMed 40. Perrone R., Lavezzo E., Riello E., Manganelli R., Palu G., Toppo S., Provvedi R., Richter S.N. Mapping and characterization of G-quadruplexes in Mycobacterium tuberculosis gene promoter regions. Sci. Rep. 2017; 7: 5743. Google Scholar CrossRef Search ADS PubMed 41. Harris L.M., Merrick C.J. G-quadruplexes in pathogens: a common route to virulence control?. PLoS Pathog. 2015; 11: e1004562. Google Scholar CrossRef Search ADS PubMed 42. Metifiot M., Amrane S., Litvak S., Andreola M.L. G-quadruplexes in viruses: function and potential therapeutic applications. Nucleic Acids Res. 2014; 42: 12352– 12366. Google Scholar CrossRef Search ADS PubMed 43. Musumeci D., Riccardi C., Montesarchio D. G-quadruplex forming oligonucleotides as anti-HIV agents. Molecules . 2015; 20: 17511– 17532. Google Scholar CrossRef Search ADS PubMed 44. Platella C., Riccardi C., Montesarchio D., Roviello G.N., Musumeci D. G-quadruplex-based aptamers against protein targets in therapy and diagnostics. Biochim. Biophys. Acta . 2017; 1861: 1429– 1447. Google Scholar CrossRef Search ADS PubMed 45. Amrane S., Kerkour A., Bedrat A., Vialet B., Andreola M.L., Mergny J.L. Topology of a DNA G-quadruplex structure formed in the HIV-1 promoter: a potential target for anti-HIV drug development. J. Am. Chem. Soc. 2014; 136: 5249– 5252. Google Scholar CrossRef Search ADS PubMed 46. Perrone R., Nadai M., Frasson I., Poe J.A., Butovskaya E., Smithgall T.E., Palumbo M., Palu G., Richter S.N. A dynamic G-quadruplex region regulates the HIV-1 long terminal repeat promoter. J. Med. Chem. 2013; 56: 6521– 6530. Google Scholar CrossRef Search ADS PubMed 47. Shen W., Gorelick R.J., Bambara R.A. HIV-1 nucleocapsid protein increases strand transfer recombination by promoting dimeric G-quartet formation. J. Biol. Chem. 2011; 286: 29838– 29847. Google Scholar CrossRef Search ADS PubMed 48. Perrone R., Nadai M., Poe J.A., Frasson I., Palumbo M., Palu G., Smithgall T.E., Richter S.N. Formation of a unique cluster of G-quadruplex structures in the HIV-1 Nef coding region: implications for antiviral activity. PLoS One . 2013; 8: e73121. Google Scholar CrossRef Search ADS PubMed 49. De Nicola B., Lech C.J., Heddi B., Regmi S., Frasson I., Perrone R., Richter S.N., Phan A.T. Structure and possible function of a G-quadruplex in the long terminal repeat of the proviral HIV-1 genome. Nucleic Acids Res. 2016; 44: 6442– 6451. Google Scholar CrossRef Search ADS PubMed 50. Tosoni E., Frasson I., Scalabrin M., Perrone R., Butovskaya E., Nadai M., Palu G., Fabris D., Richter S.N. Nucleolin stabilizes G-quadruplex structures folded by the LTR promoter and silences HIV-1 viral transcription. Nucleic Acids Res. 2015; 43: 8884– 8897. Google Scholar CrossRef Search ADS PubMed 51. Scalabrin M., Frasson I., Ruggiero E., Perrone R., Tosoni E., Lago S., Tassinari M., Palu G., Richter S.N. The cellular protein hnRNP A2/B1 enhances HIV-1 transcription by unfolding LTR promoter G-quadruplexes. Sci. Rep. 2017; 7: 45244. Google Scholar CrossRef Search ADS PubMed 52. Raiber E.A., Kranaster R., Lam E., Nikan M., Balasubramanian S. A non-canonical DNA structure is a binding motif for the transcription factor SP1 in vitro. Nucleic Acids Res. 2011; 40: 1499– 1508. Google Scholar CrossRef Search ADS PubMed 53. Todd A.K., Neidle S. The relationship of potential G-quadruplex sequences in cis-upstream regions of the human genome to SP1-binding elements. Nucleic Acids Res. 2008; 36: 2700– 2704. Google Scholar CrossRef Search ADS PubMed 54. Perrone R., Lavezzo E., Palu G., Richter S.N. Conserved presence of G-quadruplex forming sequences in the long terminal repeat promoter of lentiviruses. Sci. Rep. 2017; 7: 2018. Google Scholar CrossRef Search ADS PubMed 55. Perrone R., Butovskaya E., Daelemans D., Palu G., Pannecouque C., Richter S.N. Anti-HIV-1 activity of the G-quadruplex ligand BRACO-19. J. Antimicrob. Chemother. 2014; 69: 3248– 3258. Google Scholar CrossRef Search ADS PubMed 56. Sundquist W.I., Heaphy S. Evidence for interstrand quadruplex formation in the dimerization of human immunodeficiency virus 1 genomic RNA. Proc. Natl. Acad. Sci. U.S.A. 1993; 90: 3393– 3397. Google Scholar CrossRef Search ADS PubMed 57. Piekna-Przybylska D., Sharma G., Bambara R.A. Mechanism of HIV-1 RNA dimerization in the central region of the genome and significance for viral evolution. J. Biol. Chem. 2013; 288: 24140– 24150. Google Scholar CrossRef Search ADS PubMed 58. Aiken C., Trono D. Nef stimulates human immunodeficiency virus type 1 proviral DNA synthesis. J. Virol. 1995; 69: 5048– 5056. Google Scholar PubMed 59. Miller M.D., Warmerdam M.T., Gaston I., Greene W.C., Feinberg M.B. The human immunodeficiency virus-1 nef gene product: a positive factor for viral infection and replication in primary lymphocytes and macrophages. J. Exp. Med. 1994; 179: 101– 113. Google Scholar CrossRef Search ADS PubMed 60. Lyonnais S., Hounsou C., Teulade-Fichou M.P., Jeusset J., Le Cam E., Mirambeau G. G-quartets assembly within a G-rich DNA flap. A possible event at the center of the HIV-1 genome. Nucleic Acids Res. 2002; 30: 5276– 5283. Google Scholar CrossRef Search ADS PubMed 61. Perrone R., Doria F., Butovskaya E., Frasson I., Botti S., Scalabrin M., Lago S., Grande V., Nadai M., Freccero M.et al. Synthesis, binding and antiviral properties of potent core-extended naphthalene diimides targeting the HIV-1 long terminal repeat promoter G-quadruplexes. J. Med. Chem. 2015; 58: 9639– 9652. Google Scholar CrossRef Search ADS PubMed 62. Piekna-Przybylska D., Sharma G., Maggirwar S.B., Bambara R.A. Deficiency in DNA damage response, a new characteristic of cells infected with latent HIV-1. Cell Cycle . 2017; 16: 968– 978. Google Scholar CrossRef Search ADS PubMed 63. Thellman N.M., Triezenberg S.J. Herpes simplex virus establishment, maintenance, and reactivation: in vitro modeling of latency. Pathogens . 2017; 6: 28. Google Scholar CrossRef Search ADS 64. Sauerbrei A. Diagnosis, antiviral therapy, and prophylaxis of varicella-zoster virus infections. Eur. J. Clin. Microbiol. Infect. Dis. 2016; 35: 723– 734. Google Scholar CrossRef Search ADS PubMed 65. Dunmire S.K., Hogquist K.A., Balfour H.H. Infectious mononucleosis. Curr. Top. Microbiol. Immunol. 2015; 390: 211– 240. Google Scholar PubMed 66. Everly D., Sharma-Walia N., Sadagopan S., Chandran B. Robertson E Herpesviruses and cancer. Cancer Associated Viruses . 2012; Boston: Springer. 133– 167. Google Scholar CrossRef Search ADS 67. Gupta R., Warren T., Wald A. Genital herpes. Lancet . 2007; 370: 2127– 2137. Google Scholar CrossRef Search ADS PubMed 68. Biswas B., Kandpal M., Jauhari U.K., Vivekanandan P. Genome-wide analysis of G-quadruplexes in herpesvirus genomes. BMC Genomics . 2016; 17: 949. Google Scholar CrossRef Search ADS PubMed 69. Artusi S., Nadai M., Perrone R., Biasolo M.A., Palu G., Flamand L., Calistri A., Richter S.N. The herpes simplex virus-1 genome contains multiple clusters of repeated G-quadruplex: Implications for the antiviral activity of a G-quadruplex ligand. Antiviral Res. 2015; 118: 123– 131. Google Scholar CrossRef Search ADS PubMed 70. Artusi S., Perrone R., Lago S., Raffa P., Di Iorio E., Palu G., Richter S.N. Visualization of DNA G-quadruplexes in herpes simplex virus 1-infected cells. Nucleic Acids Res. 2016; 44: 10343– 10353. Google Scholar PubMed 71. Norseen J., Johnson F.B., Lieberman P.M. Role for G-quadruplex RNA binding by Epstein-Barr virus nuclear antigen 1 in DNA replication and metaphase chromosome attachment. J. Virol. 2009; 83: 10336– 10346. Google Scholar CrossRef Search ADS PubMed 72. Tellam J.T., Zhong J., Lekieffre L., Bhat P., Martinez M., Croft N.P., Kaplan W., Tellam R.L., Khanna R. mRNA Structural constraints on EBNA1 synthesis impact on in vivo antigen presentation and early priming of CD8+ T cells. PLoS Pathog. 2014; 10: e1004423. Google Scholar CrossRef Search ADS PubMed 73. Lista M.J., Martins R.P., Angrand G., Quillevere A., Daskalogianni C., Voisset C., Teulade-Fichou M.P., Fahraeus R., Blondel M. A yeast model for the mechanism of the Epstein-Barr virus immune evasion identifies a new therapeutic target to interfere with the virus stealthiness. Microb. Cell . 2017; 4: 305– 307. Google Scholar CrossRef Search ADS PubMed 74. Lista M.J., Martins R.P., Billant O., Contesse M.A., Findakly S., Pochard P., Daskalogianni C., Beauvineau C., Guetta C., Jamin C.et al. Nucleolin directly mediates Epstein-Barr virus immune evasion through binding to G-quadruplexes of EBNA1 mRNA. Nat. Commun. 2017; 8: 16043. Google Scholar CrossRef Search ADS PubMed 75. Murat P., Zhong J., Lekieffre L., Cowieson N.P., Clancy J.L., Preiss T., Balasubramanian S., Khanna R., Tellam J. G-quadruplexes regulate Epstein-Barr virus-encoded nuclear antigen 1 mRNA translation. Nat. Chem. Biol. 2014; 10: 358– 364. Google Scholar CrossRef Search ADS PubMed 76. Goncalves P.H., Ziegelbauer J., Uldrick T.S., Yarchoan R. Kaposi sarcoma herpesvirus-associated cancers and related diseases. Curr. Opin. HIV AIDS . 2016; 12: 47– 56. Google Scholar CrossRef Search ADS 77. Madireddy A., Purushothaman P., Loosbroock C.P., Robertson E.S., Schildkraut C.L., Verma S.C. G-quadruplex-interacting compounds alter latent DNA replication and episomal persistence of KSHV. Nucleic Acids Res. 2016; 44: 3675– 3694. Google Scholar CrossRef Search ADS PubMed 78. Tesini B.L., Epstein L.G., Caserta M.T. Clinical impact of primary infection with roseoloviruses. Curr. Opin. Virol. 2014; 9: 91– 96. Google Scholar CrossRef Search ADS PubMed 79. Hill J.A., Zerr D.M. Roseoloviruses in transplant recipients: clinical consequences and prospects for treatment and prevention trials. Curr. Opin. Virol. 2014; 9: 53– 60. Google Scholar CrossRef Search ADS PubMed 80. Arbuckle J.H., Medveczky M.M., Luka J., Hadley S.H., Luegmayr A., Ablashi D., Lund T.C., Tolar J., De Meirleir K., Montoya J.G.et al. The latent human herpesvirus-6A genome specifically integrates in telomeres of human chromosomes in vivo and in vitro. Proc. Natl. Acad. Sci. U.S.A. 2010; 107: 5563– 5568. Google Scholar CrossRef Search ADS PubMed 81. Gilbert-Girard S., Gravel A., Artusi S., Richter S.N., Wallaschek N., Kaufer B.B., Flamand L. Stabilization of telomere G-quadruplexes interferes with human herpesvirus 6A chromosomal integration. J. Virol. 2017; 91: e00402. Google Scholar CrossRef Search ADS PubMed 82. Callegaro S., Perrone R., Scalabrin M., Doria F., Palu G., Richter S.N. A core extended naphtalene diimide G-quadruplex ligand potently inhibits herpes simplex virus 1 replication. Sci. Rep. 2017; 7: 2341. Google Scholar CrossRef Search ADS PubMed 83. Marusic M., Hosnjak L., Krafcikova P., Poljak M., Viglasky V., Plavec J. The effect of single nucleotide polymorphisms in G-rich regions of high-risk human papillomaviruses on structural diversity of DNA. Biochim. Biophys. Acta . 2016; 1861: 1229– 1236. Google Scholar CrossRef Search ADS PubMed 84. Tluckova K., Marusic M., Tothova P., Bauer L., Sket P., Plavec J., Viglasky V. Human papillomavirus G-quadruplexes. Biochemistry . 2013; 52: 7207– 7216. Google Scholar CrossRef Search ADS PubMed 85. Biswas B., Kandpal M., Vivekanandan P. A G-quadruplex motif in an envelope gene promoter regulates transcription and virion secretion in HBV genotype B. Nucleic Acids Res. 2017; 45: 11268– 11280. Google Scholar CrossRef Search ADS PubMed 86. Satkunanathan S., Thorpe R., Zhao Y. The function of DNA binding protein nucleophosmin in AAV replication. Virology . 2017; 510: 46– 54. Google Scholar CrossRef Search ADS PubMed 87. Kusov Y., Tan J., Alvarez E., Enjuanes L., Hilgenfeld R. A G-quadruplex-binding macrodomain within the “SARS-unique domain” is essential for the activity of the SARS-coronavirus replication-transcription complex. Virology . 2015; 484: 313– 322. Google Scholar CrossRef Search ADS PubMed 88. Tan J., Vonrhein C., Smart O.S., Bricogne G., Bollati M., Kusov Y., Hansen G., Mesters J.R., Schmidt C.L., Hilgenfeld R. The SARS-unique domain (SUD) of SARS coronavirus contains two macrodomains that bind G-quadruplexes. PLoS Pathog. 2009; 5: e1000428. Google Scholar CrossRef Search ADS PubMed 89. Wang S.R., Min Y.Q., Wang J.Q., Liu C.X., Fu B.S., Wu F., Wu L.Y., Qiao Z.X., Song Y.Y., Xu G.H.et al. A highly conserved G-rich consensus sequence in hepatitis C virus core gene represents a new anti-hepatitis C target. Sci. Adv. 2016; 2: e1501535. Google Scholar CrossRef Search ADS PubMed 90. Fleming A.M., Ding Y., Alenko A., Burrows C.J. Zika Virus Genomic RNA Possesses Conserved G-Quadruplexes Characteristic of the Flaviviridae Family. ACS Infect. Dis. 2016; 2: 674– 681. Google Scholar CrossRef Search ADS PubMed 91. Wang S.R., Zhang Q.Y., Wang J.Q., Ge X.Y., Song Y.Y., Wang Y.F., Li X.D., Fu B.S., Xu G.H., Shu B.et al. Chemical targeting of a G-Quadruplex RNA in the Ebola virus L gene. Cell Chem. Biol. 2016; 23: 1113– 1122. Google Scholar CrossRef Search ADS PubMed 92. Krafcikova P., Demkovicova E., Viglasky V. Ebola virus derived G-quadruplexes: thiazole orange interaction. Biochim. Biophys. Acta . 2016; 1861: 1321– 1328. Google Scholar CrossRef Search ADS PubMed 93. Islam M.K., Jackson P.J., Rahman K.M., Thurston D.E. Recent advances in targeting the telomeric G-quadruplex DNA sequence with small molecules as a strategy for anticancer therapies. Future Med. Chem. 2016; 8: 1259– 1290. Google Scholar CrossRef Search ADS PubMed 94. Neidle S. Quadruplex nucleic acids as targets for anticancer therapeutics. Nat. Rev. Chem. 2017; 1: 0041. Google Scholar CrossRef Search ADS 95. Ou T.M., Lu Y.J., Tan J.H., Huang Z.S., Wong K.Y., Gu L.Q. G-quadruplexes: targets in anticancer drug design. Chemmedchem . 2008; 3: 690– 713. Google Scholar CrossRef Search ADS PubMed 96. Perry P.J., Reszka A.P., Wood A.A., Read M.A., Gowan S.M., Dosanjh H.S., Trent J.O., Jenkins T.C., Kelland L.R., Neidle S. Human telomerase inhibition by regioisomeric disubstituted amidoanthracene-9,10-diones. J. Med. Chem. 1998; 41: 4873– 4884. Google Scholar CrossRef Search ADS PubMed 97. Sun D., Thompson B., Cathers B.E., Salazar M., Kerwin S.M., Trent J.O., Jenkins T.C., Neidle S., Hurley L.H. Inhibition of human telomerase by a G-quadruplex-interactive compound. J. Med. Chem. 1997; 40: 2113– 2116. Google Scholar CrossRef Search ADS PubMed 98. Read M.A., Wood A.A., Harrison J.R., Gowan S.M., Kelland L.R., Dosanjh H.S., Neidle S. Molecular modeling studies on G-quadruplex complexes of telomerase inhibitors: structure-activity relationships. J. Med. Chem. 1999; 42: 4538– 4546. Google Scholar CrossRef Search ADS PubMed 99. Harrison R.J., Gowan S.M., Kelland L.R., Neidle S. Human telomerase inhibition by substituted acridine derivatives. Bioorg. Med. Chem. Lett. 1999; 9: 2463– 2468. Google Scholar CrossRef Search ADS PubMed 100. Read M., Harrison R.J., Romagnoli B., Tanious F.A., Gowan S.H., Reszka A.P., Wilson W.D., Kelland L.R., Neidle S. Structure-based design of selective and potent G quadruplex-mediated telomerase inhibitors. Proc. Natl. Acad. Sci. U.S.A. 2001; 98: 4844– 4849. Google Scholar CrossRef Search ADS PubMed 101. Harrison R.J., Cuesta J., Chessari G., Read M.A., Basra S.K., Reszka A.P., Morrell J., Gowan S.M., Incles C.M., Tanious F.A.et al. Trisubstituted acridine derivatives as potent and selective telomerase inhibitors. J. Med. Chem. 2003; 46: 4463– 4476. Google Scholar CrossRef Search ADS PubMed 102. Burger A.M., Dai F., Schultes C.M., Reszka A.P., Moore M.J., Double J.A., Neidle S. The G-quadruplex-interactive molecule BRACO-19 inhibits tumor growth, consistent with telomere targeting and interference with telomerase function. Cancer Res. 2005; 65: 1489– 1496. Google Scholar CrossRef Search ADS PubMed 103. Gowan S.M., Harrison J.R., Patterson L., Valenti M., Read M.A., Neidle S., Kelland L.R. A G-quadruplex-interactive potent small-molecule inhibitor of telomerase exhibiting in vitro and in vivo antitumor activity. Mol. Pharmacol. 2002; 61: 1154– 1162. Google Scholar CrossRef Search ADS PubMed 104. Taetz S., Baldes C., Murdter T.E., Kleideiter E., Piotrowska K., Bock U., Haltner-Ukomadu E., Mueller J., Huwer H., Schaefer U.F.et al. Biopharmaceutical characterization of the telomerase inhibitor BRACO19. Pharm. Res. 2006; 23: 1031– 1037. Google Scholar CrossRef Search ADS PubMed 105. Parkinson G.N., Ghosh R., Neidle S. Structural basis for binding of porphyrin to human telomeres. Biochemistry . 2007; 46: 2390– 2397. Google Scholar CrossRef Search ADS PubMed 106. Han F.X., Wheelhouse R.T., Hurley L.H. Interactions of TMPyP4 and TMPyP2 with quadruplex DNA. Structural basis for the differential effects on telomerase inhibition. J. Am. Chem. Soc. 1999; 121: 3561– 3570. Google Scholar CrossRef Search ADS 107. Wheelhouse R.T., Sun D., Han H., Han F.X., Hurley L.H. Cationic porphyrins as telomerase inhibitors: the interaction of tetra-(N-methyl-4-pyridyl)porphine with quadruplex DNA. J. Am. Chem. Soc. 1998; 120: 3261– 3262. Google Scholar CrossRef Search ADS 108. Izbicka E., Wheelhouse R.T., Raymond E., Davidson K.K., Lawrence R.A., Sun D., Windle B.E., Hurley L.H., Von Hoff D.D. Effects of cationic porphyrins as G-quadruplex interactive agents in human tumor cells. Cancer Res. 1999; 59: 639– 644. Google Scholar PubMed 109. Grand C.L., Han H., Munoz R.M., Weitman S., Von Hoff D.D., Hurley L.H., Bearss D.J. The cationic porphyrin TMPyP4 down-regulates c-MYC and human telomerase reverse transcriptase expression and inhibits tumor growth in vivo. Mol. Cancer Ther. 2002; 1: 565– 573. Google Scholar PubMed 110. Martino L., Pagano B., Fotticchia I., Neidle S., Giancola C. Shedding light on the interaction between TMPyP4 and human telomeric quadruplexes. J. Phys. Chem. B . 2009; 113: 14779– 14786. Google Scholar CrossRef Search ADS PubMed 111. Samudrala R., Zhang X., Wadkins R.M., Mattern D.L. Synthesis of a non-cationic, water-soluble perylenetetracarboxylic diimide and its interactions with G-quadruplex-forming DNA. Bioorg. Med. Chem. 2007; 15: 186– 193. Google Scholar CrossRef Search ADS PubMed 112. Fedoroff O.Y., Salazar M., Han H., Chemeris V.V., Kerwin S.M., Hurley L.H. NMR-Based model of a telomerase-inhibiting compound bound to G-quadruplex DNA. Biochemistry . 1998; 37: 12367– 12374. Google Scholar CrossRef Search ADS PubMed 113. Taka T., Huang L., Wongnoppavich A., Tam-Chang S.W., Lee T.R., Tuntiwechapikul W. Telomere shortening and cell senescence induced by perylene derivatives in A549 human lung cancer cells. Bioorg. Med. Chem. 2013; 21: 883– 890. Google Scholar CrossRef Search ADS PubMed 114. Sissi C., Lucatello L., Paul Krapcho A., Maloney D.J., Boxer M.B., Camarasa M.V., Pezzoni G., Menta E., Palumbo M. Tri-, tetra- and heptacyclic perylene analogues as new potential antineoplastic agents based on DNA telomerase inhibition. Bioorg. Med. Chem. 2007; 15: 555– 562. Google Scholar CrossRef Search ADS PubMed 115. Cuenca F., Greciano O., Gunaratnam M., Haider S., Munnur D., Nanjunda R., Wilson W.D., Neidle S. Tri- and tetra-substituted naphthalene diimides as potent G-quadruplex ligands. Bioorg. Med. Chem. Lett. 2008; 18: 1668– 1673. Google Scholar CrossRef Search ADS PubMed 116. Doria F., Richter S.N., Nadai M., Colloredo-Mels S., Mella M., Palumbo M., Freccero M. BINOL-amino acid conjugates as triggerable carriers of DNA-targeted potent photocytotoxic agents. J. Med. Chem. 2007; 50: 6570– 6579. Google Scholar CrossRef Search ADS PubMed 117. Richter S.N., Maggi S., Mels S.C., Palumbo M., Freccero M. Binol quinone methides as bisalkylating and DNA cross-linking agents. J. Am. Chem. Soc. 2004; 126: 13973– 13979. Google Scholar CrossRef Search ADS PubMed 118. Di Antonio M., Doria F., Richter S.N., Bertipaglia C., Mella M., Sissi C., Palumbo M., Freccero M. Quinone methides tethered to naphthalene diimides as selective G-quadruplex alkylating agents. J. Am. Chem. Soc. 2009; 131: 13132– 13141. Google Scholar CrossRef Search ADS PubMed 119. Doria F., Nadai M., Folini M., Di Antonio M., Germani L., Percivalle C., Sissi C., Zaffaroni N., Alcaro S., Artese A.et al. Hybrid ligand-alkylating agents targeting telomeric G-quadruplex structures. Org. Biomol. Chem. 2012; 10: 2798– 2806. Google Scholar CrossRef Search ADS PubMed 120. Doria F., Nadai M., Folini M., Scalabrin M., Germani L., Sattin G., Mella M., Palumbo M., Zaffaroni N., Fabris D.et al. Targeting loop adenines in G-quadruplex by a selective oxirane. Chemistry . 2013; 19: 78– 81. Google Scholar CrossRef Search ADS PubMed 121. Nadai M., Doria F., Di Antonio M., Sattin G., Germani L., Percivalle C., Palumbo M., Richter S.N., Freccero M. Naphthalene diimide scaffolds with dual reversible and covalent interaction properties towards G-quadruplex. Biochimie . 2011; 93: 1328– 1340. Google Scholar CrossRef Search ADS PubMed 122. Collie G.W., Promontorio R., Hampel S.M., Micco M., Neidle S., Parkinson G.N. Structural basis for telomeric G-quadruplex targeting by naphthalene diimide ligands. J. Am. Chem. Soc. 2012; 134: 2723– 2731. Google Scholar CrossRef Search ADS PubMed 123. Micco M., Collie G.W., Dale A.G., Ohnmacht S.A., Pazitna I., Gunaratnam M., Reszka A.P., Neidle S. Structure-based design and evaluation of naphthalene diimide G-quadruplex ligands as telomere targeting agents in pancreatic cancer cells. J. Med. Chem. 2013; 56: 2959– 2974. Google Scholar CrossRef Search ADS PubMed 124. Marchetti C., Zyner K.G., Ohnmacht S.A., Robson M., Haider S.M., Morton J.P., Marsico G., Vo T., Laughlin-Toth S., Ahmed A.A.et al. Targeting multiple effector pathways in pancreatic ductal adenocarcinoma with a G-quadruplex-binding small molecule. J. Med. Chem. 2018; doi:10.1021/acs.jmedchem.7b01781. 125. Ohnmacht S.A., Marchetti C., Gunaratnam M., Besser R.J., Haider S.M., Di Vita G., Lowe H.L., Mellinas-Gomez M., Diocou S., Robson M.et al. A G-quadruplex-binding compound showing anti-tumour activity in an in vivo model for pancreatic cancer. Sci. Rep. 2015; 5: 11385. Google Scholar CrossRef Search ADS PubMed 126. Rodriguez R., Muller S., Yeoman J.A., Trentesaux C., Riou J.F., Balasubramanian S. A novel small molecule that alters shelterin integrity and triggers a DNA-damage response at telomeres. J. Am. Chem. Soc. 2008; 130: 15758– 15759. Google Scholar CrossRef Search ADS PubMed 127. Muller S., Kumari S., Rodriguez R., Balasubramanian S. Small-molecule-mediated G-quadruplex isolation from human cells. Nat. Chem. 2010; 2: 1095– 1098. Google Scholar CrossRef Search ADS PubMed 128. Muller S., Sanders D.A., Di Antonio M., Matsis S., Riou J.F., Rodriguez R., Balasubramanian S. Pyridostatin analogues promote telomere dysfunction and long-term growth inhibition in human cancer cells. Org. Biomol. Chem. 2012; 10: 6537– 6546. Google Scholar CrossRef Search ADS PubMed 129. Riou J.F., Guittat L., Mailliet P., Laoui A., Renou E., Petitgenet O., Megnin-Chanet F., Helene C., Mergny J.L. Cell senescence and telomere shortening induced by a new series of specific G-quadruplex DNA ligands. Proc. Natl. Acad. Sci. U.S.A. 2002; 99: 2672– 2677. Google Scholar CrossRef Search ADS PubMed 130. De Cian A., Delemos E., Mergny J.L., Teulade-Fichou M.P., Monchaud D. Highly efficient G-quadruplex recognition by bisquinolinium compounds. J. Am. Chem. Soc. 2007; 129: 1856– 1857. Google Scholar CrossRef Search ADS PubMed 131. Neidle S. Quadruplex nucleic acids as novel therapeutic targets. J. Med. Chem. 2016; 59: 5987– 6011. Google Scholar CrossRef Search ADS PubMed 132. Chung W.J., Heddi B., Hamon F., Teulade-Fichou M.P., Phan A.T. Solution structure of a G-quadruplex bound to the bisquinolinium compound Phen-DC(3). Angew. Chem., Int. Ed. Engl. 2014; 53: 999– 1002. Google Scholar CrossRef Search ADS 133. De Cian A., Cristofari G., Reichenbach P., De Lemos E., Monchaud D., Teulade-Fichou M.P., Shin-Ya K., Lacroix L., Lingner J., Mergny J.L. Reevaluation of telomerase inhibition by quadruplex ligands and their mechanisms of action. Proc. Natl. Acad. Sci. U.S.A. 2007; 104: 17347– 17352. Google Scholar CrossRef Search ADS PubMed 134. Duan W., Rangan A., Vankayalapati H., Kim M.Y., Zeng Q., Sun D., Han H., Fedoroff O.Y., Nishioka D., Rha S.Y.et al. Design and synthesis of fluoroquinophenoxazines that interact with human telomeric G-quadruplexes and their biological effects. Mol. Cancer Ther. 2001; 1: 103– 120. Google Scholar PubMed 135. Palm W., de Lange T. How shelterin protects mammalian telomeres. Annu. Rev. Genet. 2008; 42: 301– 334. Google Scholar CrossRef Search ADS PubMed 136. Campbell N.H., Parkinson G.N., Reszka A.P., Neidle S. Structural basis of DNA quadruplex recognition by an acridine drug. J. Am. Chem. Soc. 2008; 130: 6722– 6724. Google Scholar CrossRef Search ADS PubMed 137. Daelemans D., Pauwels R., De Clercq E., Pannecouque C. A time-of-drug addition approach to target identification of antiviral compounds. Nat. Protoc. 2011; 6: 925– 933. Google Scholar CrossRef Search ADS PubMed 138. Knipe D.M., Lieberman P.M., Jung J.U., McBride A.A., Morris K.V., Ott M., Margolis D., Nieto A., Nevels M., Parks R.J.et al. Snapshots: chromatin control of viral infection. Virology . 2013; 435: 141– 156. Google Scholar CrossRef Search ADS PubMed 139. Knipe D.M., Cliffe A. Chromatin control of herpes simplex virus lytic and latent infection. Nat. Rev. Microbiol. 2008; 6: 211– 221. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
ZYH005, a novel DNA intercalator, overcomes all-trans retinoic acid resistance in acute promyelocytic leukemiaTong, Qingyi;You, Huijuan;Chen, Xintao;Wang, Kongchao;Sun, Weiguang;Pei, Yufeng;Zhao, Xiaodan;Yuan, Ming;Zhu, Hucheng;Luo, Zengwei;Zhang, Yonghui
doi: 10.1093/nar/gky202pmid: 29554366
Abstract Despite All-trans retinoic acid (ATRA) has transformed acute promyelocytic leukemia (APL) from the most fatal to the most curable hematological cancer, there remains a clinical challenge that many high-risk APL patients who fail to achieve a complete molecular remission or relapse and become resistant to ATRA. Herein, we report that 5-(4-methoxyphenethyl)-[1, 3] dioxolo [4, 5-j] phenanthridin-6(5H)–one (ZYH005) exhibits specific anticancer effects on APL and ATRA-resistant APL in vitro and vivo, while shows negligible cytotoxic effect on non-cancerous cell lines and peripheral blood mononuclear cells from healthy donors. Using single-molecule magnetic tweezers and molecule docking, we demonstrate that ZYH005 is a DNA intercalator. Further mechanistic studies show that ZYH005 triggers DNA damage, and caspase-dependent degradation of the PML-RARa fusion protein. As a result, APL and ATRA-resistant APL cells underwent apoptosis upon ZYH005 treatment and this apoptosis-inducing effect is even stronger than that of arsenic trioxide and anticancer agents including 5-fluorouracil, cisplatin and doxorubicin. Moreover, ZYH005 represses leukemia development in vivo and prolongs the survival of both APL and ATRA-resistant APL mice. To our knowledge, ZYH005 is the first synthetic phenanthridinone derivative, which functions as a DNA intercalator and can serve as a potential candidate drug for APL, particularly for ATRA-resistant APL. INTRODUCTION Normally, cells are equipped with DNA damage response (DDR) pathways and damage to DNA is detected and repaired. However, most cancer cells have relaxed DDR pathways, and more importantly, they are capable of ignoring DNA damage and allowing cells to achieve high proliferation rates, increasing their susceptibility to DNA damage drugs compared to that of normal cells since replication of damaged DNA increases the possibility of cell death (1,2). Consequently, the concept of targeting DNA in cancer therapy has inspired the development of numerous anticancer drugs, particularly DNA-binding drugs such as cisplatin, carboplatin, oxaliplatin, mitoxantrone, amsacrine, temozolomide and anthracyclines (3–5). Despite dose-limiting side effects, the extensive use of these DNA-binding drugs in clinical practice has revealed their utility, and they will continue to be a staple in anticancer regimens. Meanwhile, the discovery of new DNA-binding drugs with improved effects and a high specificity for cancer cells is of great importance. DNA-binding drugs include covalent binding ligands (alkylating agents) and non-covalent ligands (groove binders and intercalators) (5). DNA intercalators, which bind DNA by inserting aromatic moieties between adjacent DNA base pairs, have attracted considerable attention due to their potent anticancer activity. For example, several acridine and anthraquinone derivatives (i.e. anthracycline) are excellent DNA intercalators that are currently available on the market and widely used as anticancer agents (6,7). Acridine and anthraquinone represent two of the main frameworks of DNA intercalators, and the other well-known framework is phenanthridine (6). For many decades, phenanthridine derivatives have been recognized for their efficient DNA intercalative binding capability (8) and have been applied as gold-standard DNA/RNA-fluorescent markers (ethidium bromide, EB) and probes for DNA (propidium iodide, PI); however, they are also considered disadvantageous due to their potential genotoxic and mutagenic effects. In the past decade, Amaryllidaceae alkaloids with a phenanthridinone rather than phenanthridine skeleton, such as narciclasine, cis-dihydronarciclasine, 7-deoxypancratistatin, lycoricidine and pancratistatine, have been reported to have potent and selective anticancer effects (9–12). Importantly, narciclasine and other natural phenanthridinone alkaloids are considered potentially useful drug leads for the treatment of apoptosis-resistant cancers and metastatic cancers (13–15). In addition, synthetic phenanthridinone derivatives which exhibited greater anticancer activity than several standard chemotherapeutics (Taxol, doxorubicin, gemcitabine and cisplatin) also have been reported (16). And some phenanthridinone derivatives have been proven to be potent PARP inhibitors (6(5H)-phenanthridinone, PJ34) (17,18), topoisomerase I inhibitors (ARC-111 and its analogues) (19,20), all of which exhibit pronounced and targeted antitumor activity. Therefore, phenanthridinone alkaloids and their derivatives have been under intense scrutiny and are currently being pursued as clinical candidates for cancer treatment (12,14,21,22). These findings prompted us to develop novel phenanthridinone derivatives with selective anticancer effects and to explore whether these derivatives act as DNA intercalators in the same manner as phenanthridines. Acute promyelocytic leukemia (APL) is the M3 subtype of acute myeloid leukemia (AML), with 98% of patients harboring the t(15;17) chromosomal translocation, involving the fusion of the genes encoding PML (promyelocytic leukemia) and RARα (retinoic acid receptor alpha) (23–26). In addition, impaired homologous recombination (HR) (25,27,28), non-homologous end-joining (NHEJ) repair (25) and base excision repair (BER) (27) pathways have been found in APL, all of which are considered essential contributors to APL pathogenesis. All-trans retinoic acid (ATRA) and arsenic trioxide (ATO) are PML-RARα targeting drugs that bind to the RARα and PML moieties, respectively. These two drugs have transformed APL from the most fatal to the most curable hematological cancer (23,29,30). Despite the unprecedented success, many high-risk patients fail to achieve complete molecular remission or relapse and become resistant to ATRA (31,32). Therefore, alternative agents must be developed, particularly for relapsed APL with ATRA resistance. We previously isolated and identified 24 new Amaryllidaceae alkaloids (33–35), and demonstrated that N-methylhemeanthidine chloride, which with a phenanthridine core, exhibited prominent anticancer effects on pancreatic cancer and AML in vitro and in vivo (36,37). Later, we developed an environmentally friendly and inexpensive procedure in which the phenanthridinone skeleton was synthesized through Na2S2O8-promoted decarboxylative cyclization of biaryl-2-oxamic acid (38). This is of great significance because the development of phenanthridinone alkaloids is hampered by a lack of efficient supply, since total syntheses are complex and biotechnological approaches are completely missing. In this study, we synthesized 20 phenanthridinone-based alkaloids using this method and found that ZYH005 exerts specific anticancer effects on APL and ATRA-resistant APL models. Mechanistically, ZYH005 exerts its effects through intercalative binding to DNA, which subsequently triggers DNA double-strand breaks and the accumulation of DNA damage, causes G2/M cell cycle arrest and caspase-dependent degradation of PML-RARa. As a result, APL and ATRA-resistant APL cells underwent apoptosis upon ZYH005 treatment. Notably, this apoptosis-inducing effect of ZYH005 was found to be stronger than that of ATO and widely used anticancer agents (5-fluorouracil, cisplatin and doxorubicin). To the best of our knowledge, this is the first report of a synthesized phenanthridinone alkaloid that functions as a DNA intercalator and has potential therapeutic value for the treatment of both APL and relapsed APL with ATRA resistance. MATERIALS AND METHODS Synthesis of ZYH005 and its analogues Briefly, the phenanthridinone was synthesized from a ketoacid through transition metal-free decarboxylative cyclization (38). Then, the appropriate side chain was introduced to obtain ZYH005 through a nucleophilic substitution reaction. The compounds ZYH001-ZYH004 and ZYH006-ZYH020 were obtained using the same procedure as that used to synthesize ZYH005, with different starting materials. A detailed description of the process by which these compounds were synthesized, as well as their 1H NMR data (13C NMR data for selected compounds), is included in the Supplementary Information. Reagents All-trans retinoic acid, nitroblue tetrazolium (NBT) and dimethylsulfoxide (DMSO) were purchased from Sigma-Aldrich (MO, USA). Arsenic trioxide was obtained from Tongji Hospital (Wuhan, China). Cremophor® EL was purchased from Aladdin Chemicals (Shanghai, China). The cell lysis buffer, BCA Protein Assay Kit and ACK buffer were purchased from Beyotime Biotechnology (Shanghai, China). Matrigel was purchased from BD Biosciences (CA, USA). The Wright-Giemsa Staining Kit was purchased from Jiancheng Bioengineering Institute (Nanjing, China). Z-VAD-FMK, 5-fluorouracil and cisplatin were purchased from Selleck Chemicals (Shanghai, China). Doxorubicin, idarubicin, chloroquine and MG-132 were purchased from MedChemExpress (NJ, USA). Antibodies against PARP, cleaved-caspase-3, Bcl-2, Bax, Bak, γH2AX, β-actin, GAPDH and the appropriate secondary antibodies were purchased from Cell Signaling Technology (MA, USA). Polyclonal antibodies against RARα (C-20, sc-551) and PKCδ (C20, sc-937) were purchased from Santa Cruz Biotechnology (CA, USA). Mice BALB/c-nu/nu mice were purchased from Beijing HFK Bioscience Co., Ltd. (Beijing, China), and FVB/N mice were purchased from Beijing Vital River Laboratory Animal Technology Co., Ltd. (Beijing, China). The mice were housed under specific pathogen-free (SPF) conditions and handled in accordance with the Guidelines for the Care and Use of Laboratory Animals of Tongji Medical College of Huazhong University of Science and Technology. Cell lines and cell culture The APL cell lines NB4, NB4-MR2 (39), NB4-LR2 (40), hPML-RARα and mutant hPML-RARα transgenic mouse-derived leukemic cells (41) were kind gifts from Professor Guoqiang Chen and Yingli Wu (Shanghai Jiao Tong University, Shanghai, China). The DND-41, KOPT-K1 and CUTLL1 cell lines were kindly provided by Dr Warren Pear (University of Pennsylvania, USA) and Professor Hudan Liu (Wuhan University, China) and maintained as previously described (42). The MOLT4, K562, Kg1a, THP-1, Kasumi, and HL60 cell lines were purchased from American Type Culture Collection (VA, USA). HPDE6-C7 and NCM460 cell lines were purchased from the Institute of Biochemistry and Cell Biology of the Chinese Academy of Sciences (Shanghai, China). Cells were incubated at 37°C in a humidified atmosphere of 5% CO2/95% air (v/v).The leukemia cells were cultured in RPMI 1640 medium (Hyclone, UT, USA), and the other cells were cultured in Dulbecco's modified Eagle's medium (Hyclone, UT, USA), both types of media were supplemented with 10% (v/v) heat-inactivated FBS. All cell lines were cultured for fewer than 6 months after resuscitation and tested for mycoplasma contamination using the MycoAlert Mycoplasma Detection kit (Lonza, Slough, UK). Cell viability assay Cell viability was measured with an MTS Kit (Promega, WI, USA). Briefly, the cells were seeded into 96-well plates at a density of 5000 cells/well and incubated with or without drugs. After 24 h, 20 μl of MTS was added to the wells. The cells were subsequently incubated in the dark for 3 h; then, the optical density values were measured at 490 nm. The 50% inhibitory concentration (IC50) for each drug was calculated using the SPSS software. Isolation and culture of peripheral blood mononuclear cells (PBMCs) Experiments involving healthy volunteers (a healthy 31-year-old male, a healthy 31-year-old female, and a healthy 27-year-old male) donated blood were conducted with prior approval of Research Ethics Board of the Huazhong University of Science and Technology, with informed consent obtained from all subjects. Peripheral blood mononuclear cells (PBMCs) from healthy volunteers 1, 2 and 3 (PBMCs-V1, PBMCs-V2, PBMCs-V3) were collected and isolated by density gradient centrifugation (12). The isolated PBMCs were maintained in RPMI 1640 media supplemented and maintained in the same way as the APL cell lines. PBMCs were treated within 1 h of collection with various doses of ZYH005 for up to 24 h; the cell viability was measured with an MTS Kit. Single-molecule magnetic tweezers assay The 6618 bp dsDNA (∼50% GC) was prepared by PCR using TaKaRa Ex Taq DNA Polymerase (TaKaRa, Dalian, China) on a lambda phage DNA template (Thermo Scientific, MA, USA ). The primers (Sangon Biotech, Shanghai, China) were labeled with 5′-biotin and 5′-thiol, respectively. The DNA was tethered between a streptavidin-coated paramagnetic bead (Dynabeads M-280, Thermo Scientific, MA, USA) and a (3-Aminopropyl) triethoxy silane (APTES, Sigma-Aldrich, MO, USA) -functionalized coverslip, through a biotin-streptavidin interaction and covalent bond (sulfo-SMCC crosslinker, Thermo Scientific, MA, USA). The height of the bead above the coverslip surface was measured based on the bead image using single-molecule magnetic tweezers (43). Extension changes in dsDNA were measured based on force-jump measurements using single-molecule magnetic tweezers (43). At each force, the magnet was held for 5 s to measure extension changes at equilibrium from the average bead height. The torsional constraint DNA for the supercoiling assay was prepared by ligating the 6573 bp DNA with a 510 bp dsDNA handle containing ∼50 biotinylated dUTP (Biotin-11-dUTP, Thermo Scientific, MA, USA) at one end and a 510 bp handle containing ∼50 digoxigenin-dUTP (Digoxigenin-11-dUTP, Roche, Basel, Switzerland) as reported in a previous study (44). dsDNA was tethered between a streptavidin coated 1-μm diameter paramagnetic bead (Dynabeads MyOne, Thermo Scientific, MA, USA) and an anti-digoxigenin antibody (Thermo Scientific, MA, USA) coated coverslip. The rotation of the magnetic bead was controlled by rotating a permanent magnet pair using a rotary motor as reported in a previous study (44). The rotation-extension curve data were recorded by measuring DNA extension for 20 s at different magnet turns. Molecular docking Compounds were subjected to molecular docking experiments using ICM 3.8.1 modeling software (45) on an Intel i7 4960 processor (MolSoft LLC, CA, USA). Molecules were built with Chemdraw and optimized at molecular mechanical and semiempirical levels using Open Babel GUI. The X-ray crystallographic structure of the DNA dodecamer d (CGTACG)2 with doxorubicin was selected from the Protein Data Bank (PDB code: 2GB9) for the docking study (46). Ligand-binding pocket residues were selected using graphical tools in the ICM software to create the boundaries of the docking search. In the docking calculation, potential energy maps of the receptor were calculated using default parameters. Compounds were imported into ICM and an index file was created. Conformational sampling was based on the Monte Carlo procedure, and finally the lowest-energy and the most favorable orientation of the ligand were selected. Immunofluorescence assay Cells treated with or without ZYH005 were placed on slides and fixed in ice-cold 4% paraformaldehyde for 15 min. After permeabilization with 0.1% (v/v) Triton X-100 in PBS and blocking with 2% (w/v) bovine serum albumin (BSA) in PBS, the cells were incubated with an antibody against γH2AX overnight at 4°C. Then, the cells were stained with the appropriate secondary antibody for 1 h, mounting medium with DAPI (Invitrogen, MA, USA) was added to the slides, and cover slips were mounted onto the slides in a manner that did not create bubbles. Fluorescence signals were detected with an LSM710 confocal microscope (ZEISS, Weimar, Germany). For the quantitative analysis, γH2AX foci were counted by eye during the microscopic examination and imaging process using a 100× objective. At least 100 cells per group were examined, and three independent experiments were performed. Cell cycle, apoptosis assay and immunoblot analysis Cell cycle, apoptosis assay and immunoblot analysis were done as previously describe d (47,48). Xenograft tumor mouse model assay A total of 5 × 106 NB4 cells in a PBS/Matrigel solution (1:1) were subcutaneously injected into the flanks of 5–6-week-old BALB/c-nu/nu mice. After the cells formed palpable tumors (100–150 mm3), the mice were randomly assigned to groups treated with vehicle (5% DMSO, 5% Cremophor® EL, 90% Saline) or ZYH005 (10 mg/kg/day, intravenously) according to the preliminary toxicity test. Two weeks later, the animals were sacrificed; their tumors were removed, weighed and photographed. The visceral organs and tumors were collected and fixed in 10% formalin for hematoxylin and eosin (H&E) or immunohistochemistry (IHC) analysis. Transplantation of ATRA-sensitive/-resistant leukemic mice and treatment The female FVB/N mice (6–7 weeks old) were sublethally irradiated (4.5 Gy), then 2 × 105 viable leukemic cells isolated from spleens of leukemic hPML-RARα or mutant hPML-RARα transgenic mice were intravenously injected via tail vein (41,49,50). Four days after transplantation, the mice were randomly assigned to groups treated with vehicle (5% DMSO, 5% Cremophor® EL, 90% saline) or ZYH005 (10 mg/kg/day, intravenously). Normal mice (irradiated but not transplanted, no treatment) and mice treated with ATO (5 mg/kg/day, intraperitoneally) or ATRA (15 mg/kg/day, intraperitoneally, for ATRA-resistant APL model) were used as the controls. Cytological and histological analyses were performed as reported (41). Statistical analysis Comparisons between groups were performed using a standard two-tailed Student's t-test or one-way analysis of variance (ANOVA) followed by Tukey's post hoc test and stated in the figure legends. All experiments were repeated at least three times. The data are presented as the mean ± S.D., and P values < 0.05 were considered significant. RESULTS Selection of ZYH005 for subsequent experiments Alkaloids with N-phenylethyl phenanthridinone exhibited more potent cytotoxic activity (33). Therefore, we synthesized compounds with methoxyl, benzyl, phenylethyl, phenylpropyl and (4-methoxylphenyl) ethyl substituents at the hetero nitrogen atom of the phenanthridinone ring (ZYH001-ZYH005) (Supplemental Figure S1A). We preliminarily assessed their anti-proliferation effects on five cancer cell lines (HL60, SMMC-7721, A549, MCF-7, SW480), and found that ZYH005 inhibits the proliferation of all cancer cell lines at low concentrations after 48 h of treatment, especially the proliferation of the AML cell line HL60 (IC50 = 0.037 μM). Moreover, ZYH005 was more effective than the other N-alkyl derivatives and the positive control cisplatin (DDP) (Supplemental Figure S1B). Then, we focused on N-(4-methoxylphenyl) ethyl derivatives and synthesized compounds ZYH006–ZYH020 (Supplemental Figure S2A). Further investigation of the anti-proliferation effects of ZYH005–ZYH020 on HL60 cells showed that compounds ZYH006–ZYH020 exerted weaker inhibitory effects than ZYH005 (Supplemental Figure S2B). Therefore, ZYH005 was selected for subsequent pharmaceutical research experiments (Figure 1A, Supplementary methods). The 1H spectral and 13C NMR spectral data for ZYH005 are shown in Supplemental Figure S3. Figure 1. View largeDownload slide ZYH005 treatment specifically inhibits the proliferation of APL and ATRA-resistant APL cells. (A) The synthesis process of 5-(4-methoxyphenethyl)-[1, 3] dioxolo [4,5-j] phenanthridin-6(5H)–one (ZYH005). (B–D) The viability of cell lines was evaluated using an MTS kit after treated with drugs for 24 h. The effects of ZYH005 on two immortalized normal human epithelial cell lines (NCM460 and HPDE6-C7) and ten leukemia cell lines are shown in B. The effects of ZYH005 and ATRA on ATRA-resistant cell lines are shown in C. Effects of ZYH005 on peripheral blood mononuclear cells isolated from blood samples of 3 healthy volunteers (PBMCs-V1, PBMCs-V2 and PBMCs-V3) are shown in D. Data are shown as the mean ± S.D. of three independent experiments, unpaired two-tailed student's test was used for statistics, **P < 0.01 compared to the control group (DMSO < 0.1%). Figure 1. View largeDownload slide ZYH005 treatment specifically inhibits the proliferation of APL and ATRA-resistant APL cells. (A) The synthesis process of 5-(4-methoxyphenethyl)-[1, 3] dioxolo [4,5-j] phenanthridin-6(5H)–one (ZYH005). (B–D) The viability of cell lines was evaluated using an MTS kit after treated with drugs for 24 h. The effects of ZYH005 on two immortalized normal human epithelial cell lines (NCM460 and HPDE6-C7) and ten leukemia cell lines are shown in B. The effects of ZYH005 and ATRA on ATRA-resistant cell lines are shown in C. Effects of ZYH005 on peripheral blood mononuclear cells isolated from blood samples of 3 healthy volunteers (PBMCs-V1, PBMCs-V2 and PBMCs-V3) are shown in D. Data are shown as the mean ± S.D. of three independent experiments, unpaired two-tailed student's test was used for statistics, **P < 0.01 compared to the control group (DMSO < 0.1%). ZYH005 treatment selectively inhibits the proliferation of APL and ATRA-resistant APL cells To explore the anti-leukemia potential of ZYH005, we treated ten leukemia cell lines and two immortalized normal human epithelial cell lines with ZYH005 (0–0.16 μM) and then assessed their viability. As shown in Figure 1B, even after treatment for only 24 h, ZYH005 exerted significantly greater anti-proliferation effects on NB4 and HL60 cell lines than on the other cell lines. Furthermore, ZYH005 exerted minimal effects on the viability of the normal cell lines NCM460 and HPDE6-C7. The 24 h IC50 values for the NB4 and HL60 cell lines were 0.041 and 0.053 μM, respectively. We further assessed the effects of ZYH005 on ATRA-resistant cell lines. After a 24 h of treatment, high ATRA concentrations (12.5–50 μM) had almost no effect on the proliferation of the NB4-LR2 and NB4-MR2 cell lines. In contrast, ZYH005 at the concentrations of 0.04–0.06 μM inhibited the proliferation of these cell lines (Figure 1C). The effects ZYH005 on peripheral blood mononuclear cells isolated from blood samples of 3 healthy volunteers (PBMCs-V1, PBMCs-V2, PBMCs-V3) were also detected. Interestingly, the viability of PBMCs in the ZYH005-treated groups was nearly consistent with that in the non-treated groups, even at a ZYH005 concentration up to 0.64 μM (Figure 1D). Induction of promyelocytic leukemic cell differentiation plays an important role in the response of APL to both ATRA and ATO treatments (24). Therefore, we evaluated the differentiation-inducing effects of ZYH005. However, unlike ATRA and ATO, ZYH005 did not induce APL cell differentiation (Supplemental Figure S4A and B). These data demonstrate that ZYH005 is a novel molecule that is selectively cytotoxic to both APL and ATRA-resistant APL cell lines, has a lesser effect on other leukemia cell lines, and has a negligible cytotoxic effect on non-cancerous cell lines as well as PBMCs from healthy donors. ZYH005 intercalates double-stranded DNA To determine whether ZYH005 could directly interact with double-stranded DNA (dsDNA), single-molecule magnetic tweezers were firstly used to investigate the effects of ZYH005 on dsDNA mechanical properties. The binding of small ligands to dsDNA alters its structure, causing lengthening or unwinding of the dsDNA (51,52). Figure 2A shows the single-molecule stretching experiment system for measuring the force-extension curves of a 6618 bp dsDNA tethered between a coverslip and a paramagnetic bead, as detailed in the methods section. Force was applied to the DNA molecule through the bead by a pair of magnets. Extension changes were measured based on the bead height using the diffraction pattern of the bead image. Figure 2B shows typical force-height curves of a dsDNA molecule in the presence of different concentrations of ZYH005 in buffer change experiments. The force-height curves of dsDNA showed significant elongation from 20 to 60 pN after adding ZYH005, which was concentration-dependent, suggesting that the binding of ZYH005 increased the length of dsDNA as is typically observed for DNA intercalators (51,52). In addition to lengthening dsDNA, the binding of ZYH005 also changed the behavior of the DNA overstretching transition. The naked dsDNA showed sudden elongation (Figure 2B, black curve) at a force of ∼65 pN, representing the transition of B-form DNA to S-DNA (stretched DNA), or the so-called B-to-S overstretching transition (53). At a low concentration of ZYH005 (1 μM), the overstretching transition force increased, suggesting that the binding of ZYH005 stabilizes B-form DNA. Increased overstretching transition forces and stabilization of B-form DNA have also been observed with other DNA intercalators such as EB (52). Taken together, ligand-induced dsDNA elongation and overstretching force alteration at a low concentration and ligand-induced dsDNA extension both suggest that ZYH005 binds to dsDNA as an intercalator. Figure 2. View largeDownload slide ZYH005 intercalates double-stranded DNA. (A) Binding of ZYH005 with dsDNA measured based on dsDNA stretching experiments. Schematic diagram of DNA stretching with magnetic tweezers. A 6618 bp dsDNA was tethered between a 2.8 μm-diameter paramagnetic bead and coverslip. (B) Typical DNA stretching curves in the presence of various ZYH005 concentrations. (C) Titration curves of dsDNA elongation measured at different forces were fit using the Hill equation. The elongation was measured from three independent measurements. (D) Dissociation constant dependent on force. (E) Effect of ZYH005 on DNA rotation-extension curves at low force (0.4 and 1.0 pN) and saturated ZYH005 concentration (100 μM). The 6573 bp DNA exhibits torsional constraint when rotation = 0; DNA is in relaxed conformation. The inserts depict the states of super coiled DNA under different tensional and torsional constraints. (F) Low-energy binding conformation of ZYH005 bound to DNA fragments generated by virtual ligand docking. Figure 2. View largeDownload slide ZYH005 intercalates double-stranded DNA. (A) Binding of ZYH005 with dsDNA measured based on dsDNA stretching experiments. Schematic diagram of DNA stretching with magnetic tweezers. A 6618 bp dsDNA was tethered between a 2.8 μm-diameter paramagnetic bead and coverslip. (B) Typical DNA stretching curves in the presence of various ZYH005 concentrations. (C) Titration curves of dsDNA elongation measured at different forces were fit using the Hill equation. The elongation was measured from three independent measurements. (D) Dissociation constant dependent on force. (E) Effect of ZYH005 on DNA rotation-extension curves at low force (0.4 and 1.0 pN) and saturated ZYH005 concentration (100 μM). The 6573 bp DNA exhibits torsional constraint when rotation = 0; DNA is in relaxed conformation. The inserts depict the states of super coiled DNA under different tensional and torsional constraints. (F) Low-energy binding conformation of ZYH005 bound to DNA fragments generated by virtual ligand docking. To quantify the binding affinity of ZYH005 to dsDNA, titration curves of dsDNA elongation at different forces were fit using Hill equation (Figure 2C), $${x_c} - \ {x_{nacked\ DNA}} = ( {{x_{saturate}} - {x_{nacked\ DNA}}} )\ \frac{{{c^n}}}{{K_d^n( F ) + {c^n}}}$$, where the $${x_c} - {x_{nacked\ DNA}}$$ is the elongation of dsDNA at different ligand concentration, $${x_{satureate}} - {x_{nacked\ DNA}}$$ is saturated dsDNA elongation, Kd(F) is the dissociation constant of ZYH005 with DNA at different forces, and n = 1 is the Hill efficiency. The dissociation constant for intercalation of ZYH005 with DNA at zero force Kd0 is estimated to be 24 μM (binding constant Ka = 1/Kd0 = 4.1 × 104 M−1) by fitting the force-dependent binding constant (Figure 2D) with the exponential function, $${K_d}( F )\ = {K_{d0}}\ \times {e^{ - F{\rm{\Delta }}x/{k_B}T}}$$, where kB is the Boltzmann constant and T is temperature. The fitting parameter Δx = 0.1 nm indicates the dsDNA elongation upon a single ZYH005 molecule intercalation. To further analyze the effects of ZYH005 on supercoiled dsDNA and to determine whether the binding of ZYH005 causes dsDNA unwinding, we performed a supercoiling assay using torsional constraint dsDNA and magnetic tweezers (Figure 2E). The preparation of torsional constraint dsDNA and twisting of DNA by magnetic tweezers is detailed in the methods section. In the presence of a saturating concentration of ZYH005 (100 μM), evident elongation of DNA extension at 0.6 pN with negative rotations indicates that binding of ZYH005 causes DNA stiffen at negative supercoiled conformation, suggesting that ZYH005 binds to dsDNA at low- force (∼0.6 pN) regions. The shift of the buckling transition at 1 pN also indicates that ZYH005 directly interacts with dsDNA. Compared to other intercalative compounds such as YOYO-1 and EB, the peak of the supercoil curve did not show a significant shift, indicating that ZYH005 binds to dsDNA, but it did not induce significant unwinding of the dsDNA. Finally, molecular docking simulations were carried out to elucidate the interaction of ZYH005 with DNA fragments under default conditions. ZYH005 was expected to show significantly higher binding affinity with a significant ICM score (–30.11). Furthermore, the lowest-energy binding conformation is shown in Figure 2F; the docked pose was stabilized electronically by π–π stacking between the aromatic group and the side chains of the DNA base pairs. The results obtained from molecular docking studies were consistent with the results of DNA binding studies, indicating that ZYH005 is a DNA intercalator. ZYH005 treatment induces DNA damage and cell cycle arrest in APL and ATRA-resistant APL cells DNA intercalators with anticancer effects lead to structural changes in DNA and subsequently cause DNA damage (54,55). Double-stranded DNA breaks (DSBs) are probably the most detrimental of the many types of DNA damage occurring within the cell (56). Accordingly, we detected γH2AX foci, which are considered indicative of an early response to DSBs (57). As shown in Figure 3A and B, NB4 and HL60 cells displayed significant levels of γH2AX foci after 0.05 μM ZYH005 treatment for 6 h. By immunoblot analysis, we found that ZYH005 treatment increased the levels of γH2AX in both APL and ATRA-resistant APL cell lines (Figure 3C). We further observed that ZYH005 treatment increased the expression levels of γH2AX in a time-dependent manner (Figure 3D). The cell cycle is a critical regulator of the processes of cell proliferation and division after DNA damage occurs. High levels of damage to DNA induce cell-cycle arrest to prevent the transmission of damaged DNA during mitosis. Therefore, we detected the cell cycle distribution in cell lines treated with or without low-dose ZYH005 for 24 h and found that ZYH005 induced a significant G2/M cycle arrest in both APL and ATRA-resistant APL cells (Figure 3E, Supplemental Figure S5A). Figure 3. View largeDownload slide ZYH005 treatment induces DNA damage and cell cycle arrest in APL and ATRA-resistant APL cell lines. (A) Immunofluorescence microscopy analysis of γH2AX foci in NB4 and HL60 cells after 0.05 μM ZYH005 treated for 6 h. (B) Quantification of the percentage of cells with ≥ 6 γH2AX foci. At least 100 cells per group were examined. (C, D) Immunoblot analysis of γH2AX expression in cell lines after treated with ZYH005 at the indicated concentrations and times. (E) ZYH005 induces G2/M arrest in APL cells. After treated with DMSO (<0.1%) and ZYH005 for 24 h, cells were harvested, fixed and cell cycle distribution was analyzed by PI staining. (A, C and D), the data are representative of three independent experiments. (B and E) the data are expressed as the mean ± S.D. of three independent experiments, **P < 0.01, ***P < 0.001 unpaired two-tailed Student's t-test. β-Actin was used as a loading control in immunoblot analysis. Figure 3. View largeDownload slide ZYH005 treatment induces DNA damage and cell cycle arrest in APL and ATRA-resistant APL cell lines. (A) Immunofluorescence microscopy analysis of γH2AX foci in NB4 and HL60 cells after 0.05 μM ZYH005 treated for 6 h. (B) Quantification of the percentage of cells with ≥ 6 γH2AX foci. At least 100 cells per group were examined. (C, D) Immunoblot analysis of γH2AX expression in cell lines after treated with ZYH005 at the indicated concentrations and times. (E) ZYH005 induces G2/M arrest in APL cells. After treated with DMSO (<0.1%) and ZYH005 for 24 h, cells were harvested, fixed and cell cycle distribution was analyzed by PI staining. (A, C and D), the data are representative of three independent experiments. (B and E) the data are expressed as the mean ± S.D. of three independent experiments, **P < 0.01, ***P < 0.001 unpaired two-tailed Student's t-test. β-Actin was used as a loading control in immunoblot analysis. ZYH005 treatment induces apoptotic cell death in APL and ATRA-resistant APL cells Damaged DNA accumulation and cell cycle arrest are activators of apoptotic signals (58), we also noted cell shrinkage and chromatin condensation in APL cells treated with ZYH005, and these findings prompted us to investigate whether ZYH005 could induce apoptosis in these cells. Therefore, we quantified the apoptosis cells by flow cytometer. With 24 h treatment, ZYH005 at 0.025 μM is sufficient to induce apoptosis in NB4 cells, and 0.05 μM ZYH005 could induce 57.7% and 49.9% (on average) of the cells underwent apoptosis in NB4 and HL60 cells, respectively (Figure 4A, Supplemental Figure S5B). Next, we evaluated the effects of ZYH005 on the expression of marker proteins of apoptosis (59). As shown in Figure 4B, ZYH005 treatment for 24 h altered the expression levels of Bcl-2, Bax, Bak, cleaved-caspase-3 and PARP in a dose-dependent manner. In addition, treatment with 0.05 μM ZYH005 could alter the expression levels of cleaved-caspase-3 and PARP within 6 h, indicating that ZYH005-induced apoptosis is also a time-dependent phenomenon (Figure 4C). We also found that ZYH005 treatment induced proteolytic cleavage of PKCδ, a cytoplasmic caspase-3 substrate, into a 41kDa catalytic fragment in a time- and dose-dependent manner (Supplemental Figure S5C). Moreover, 0.05 μM ZYH005 induced apoptosis in ATRA-resistant APL cell lines after 24 h of treatment, whereas treatment with a 200-times greater concentration of ATRA (10 μM) did not result in apoptosis induction (Figure 4D). Meanwhile, the immunoblot analysis showed that ZYH005 treatment altered the expression of apoptosis-related proteins in ATRA-resistant APL cell lines (Figure 4E). We compared the apoptosis-inducing activity of ZYH005 with that of ATO and other chemotherapeutics in NB4 and NB4-LR2 cells. As shown in Figure 4F, ATO, 5-fluorouracil, cisplatin and doxorubicin did not exhibit apoptosis-inducing activity at the indicated concentration and time (0.05 μM, 24 h). In contrast, the apoptosis induction effect of ZYH005 was observed close to that of idarubicin, the most effective anthracyclines in both in vitro assay (60) and in clinical trials (61). These results strongly support ZYH005 as an effective apoptosis inducer that induces apoptotic cell death in both APL and ATRA-resistant APL cell lines. Figure 4. View largeDownload slide ZYH005 treatment induces apoptotic cell death in APL and ATRA-resistant APL cell lines. (A) NB4 and HL60 cells were treated with ZYH005 for 24 h, and then cell apoptosis was determined. (B) Immunoblot analysis of the expression of apoptosis-related proteins in NB4 and HL60 cells after 24 h of ZYH005 treatment. (C) Immunoblot analysis of the expression of PARP and cleaved-caspase-3 in NB4 and HL60 cells after 0.05 μM ZYH005 treated for different time. (D) ATRA-resistant cells were treated with 0.05 μM ZYH005 and 10 μM ATRA for 24 h, and then cell apoptosis was determined. (E) Immunoblot analysis of apoptosis-related proteins in ATRA-resistant cells after ZYH005 treated for 24 h. (F) NB4 and NB4-LR2 cells were treated with ZYH005, arsenic trioxide (ATO), 5-fluorouracil (5-Fu), cisplatin (DDP), doxorubicin (DOX) and idarubicin (IDA) at 0.05 μM for 24 h, and then cell apoptosis was determined. (A, D and F), for DOX and IDA treatment, cell apoptosis was determined by an Annexin V-APC/7-AAD staining kit; for other treatments, cell apoptosis was determined by an Annexin V-FITC/PI staining kit. Data are expressed as the mean ± S.D. of three independent experiments. *P < 0.05, **P < 0.01, ***P < 0.001, ns, not significant compared to ZYH005-treated group; unpaired two-tailed Student's t-test in A and D; one-way ANOVA analysis followed by Tukey's post hoc test in F.β-Actin was used as a loading control in immunoblot analysis. Figure 4. View largeDownload slide ZYH005 treatment induces apoptotic cell death in APL and ATRA-resistant APL cell lines. (A) NB4 and HL60 cells were treated with ZYH005 for 24 h, and then cell apoptosis was determined. (B) Immunoblot analysis of the expression of apoptosis-related proteins in NB4 and HL60 cells after 24 h of ZYH005 treatment. (C) Immunoblot analysis of the expression of PARP and cleaved-caspase-3 in NB4 and HL60 cells after 0.05 μM ZYH005 treated for different time. (D) ATRA-resistant cells were treated with 0.05 μM ZYH005 and 10 μM ATRA for 24 h, and then cell apoptosis was determined. (E) Immunoblot analysis of apoptosis-related proteins in ATRA-resistant cells after ZYH005 treated for 24 h. (F) NB4 and NB4-LR2 cells were treated with ZYH005, arsenic trioxide (ATO), 5-fluorouracil (5-Fu), cisplatin (DDP), doxorubicin (DOX) and idarubicin (IDA) at 0.05 μM for 24 h, and then cell apoptosis was determined. (A, D and F), for DOX and IDA treatment, cell apoptosis was determined by an Annexin V-APC/7-AAD staining kit; for other treatments, cell apoptosis was determined by an Annexin V-FITC/PI staining kit. Data are expressed as the mean ± S.D. of three independent experiments. *P < 0.05, **P < 0.01, ***P < 0.001, ns, not significant compared to ZYH005-treated group; unpaired two-tailed Student's t-test in A and D; one-way ANOVA analysis followed by Tukey's post hoc test in F.β-Actin was used as a loading control in immunoblot analysis. ZYH005 treatment induces caspase-dependent PML-RARα degradation The PML-RARα fusion protein determines not only the phenotype of APL but also the response of APL to treatment (23), and several studies have shown that PML-RARa expression induces strong resistance to apoptosis (62). Therefore, we analyzed the effects of ZYH005 on PML-RARα. As shown in Figure 5A, the PML-RARα levels were significantly decreased in NB4, NB4-LR2 and NB4-MR2 cells after ZYH005 treatment for 24 h. Caspase family members have been recognized as key participants in apoptosis and play important roles in PML-RARα degradation (63,64). Therefore, we treated cells with ZYH005 in the presence or absence of the caspase inhibitor Z-VAD-FMK. As shown in Figure 5B, in both APL and ATRA-resistant APL cells, the ZYH005-induced apoptosis was significantly attenuated by the addition of Z-VAD-FMK. Additionally, the immunoblot results showed that Z-VAD-FMK blocked ZYH005-induced PML-RARα degradation and alterations of apoptosis-related proteins such as Bcl-2, Bax and cleaved-caspase-3; however, proteins related to the DNA damage response, such as PARP and γH2AX, were partially blocked by Z-VAD-FMK (Figure 5C), indicating that ZYH005-induced DNA damage is an event that occurs earlier than PML-RARα degradation and apoptosis. Figure 5. View largeDownload slide ZYH005 induces caspase-dependent PML-RARα degradation and apoptosis. (A) Immunoblot analysis of PML-RARα expression in NB4 and ATRA-resistant cells treated with ZYH005 for 24 h. (B and C) NB4 and ATRA-resistant cell lines were treated with 80 μM Z-VAD-FMK for 1 h, and then 0.05 μM ZYH005 was added. 24 h later, cell apoptosis was determined by an Annexin V-FITC and PI staining kit (B). Data are expressed as the mean ± S.D. of three independent experiments. *P < 0.05, **P < 0.01, ***P < 0.001, unpaired two-tailed Student's t-test. The expression levels of related proteins were evaluated by immunoblot analysis (C). GAPDH was used as a loading control. Figure 5. View largeDownload slide ZYH005 induces caspase-dependent PML-RARα degradation and apoptosis. (A) Immunoblot analysis of PML-RARα expression in NB4 and ATRA-resistant cells treated with ZYH005 for 24 h. (B and C) NB4 and ATRA-resistant cell lines were treated with 80 μM Z-VAD-FMK for 1 h, and then 0.05 μM ZYH005 was added. 24 h later, cell apoptosis was determined by an Annexin V-FITC and PI staining kit (B). Data are expressed as the mean ± S.D. of three independent experiments. *P < 0.05, **P < 0.01, ***P < 0.001, unpaired two-tailed Student's t-test. The expression levels of related proteins were evaluated by immunoblot analysis (C). GAPDH was used as a loading control. A lysosome inhibitor (chloroquine), proteasome inhibitor (MG132) and RNF4 (the E3 ubiquitin ligase of PML) shRNA–silencing NB4 cells (65,66) were used to determine whether the autophagy-lysosome and proteasome-ubiquitin pathways had roles in the effect of ZYH005 on PML-RARα degradation, respectively. The results showed that all of them failed to attenuate the apoptosis and PML-RARα degradation induced by ZYH005 (Supplementary Figure S6A–E). Moreover, Z-VAD-FMK could block apoptosis and PML-RARα degradation in RNF4 shRNA-transfected NB4 cells (Supplementary Figure S6F and G). These results suggest that ZYH005-induced DNA damage activates caspase 3 and then induces caspase-dependent PML-RARa degradation and apoptosis. ZYH005 treatment induces leukemia regression in APL and ATRA-resistant APL mice To explore the anti-leukemia effect of ZYH005 in vivo, we first established NB4 xenograft mouse models and treated the mice with vehicle or ZYH005 (intravenously, 10 mg/kg/day according to the preliminary test). Two weeks later, we noted significant differences in the tumor weights between the ZYH005- and vehicle-treated mice, with a tumor inhibition rate of 62.3% (P = 0.0048) (Supplemental Figure S7A). Expression of Ki67, which is an important marker of cell proliferation (67), was significantly lower in tumor tissues following ZYH005 treatment (Supplemental Figure S7B). Moreover, ZYH005 injection did not affect the body weights of the tested animals (Supplemental Figure S7C). The H&E staining of visceral organs including heart, liver, spleen, lung and kidney were also relatively constant between ZYH005-treated mice and vehicle-treated mice (Supplementary Figure S7D), manifesting minimal toxicity of ZYH005 in vivo, which is the precondition to be a clinical candidate agent. Next, we transplanted leukemic cells isolated from spleens of leukemic hPML-RARα or mutant hPML-RARα transgenic mice intravenously into sub-lethally irradiated isogenic FVB/N recipients to generate APL mice or ATRA-resistant APL mice (41,49,50). On day 4 after transplantation, mice were treated intravenously with vehicle, ZYH005, ATO (for APL mice) or ATRA (for ATRA-resistant APL mice). All mice were sacrificed for histological analyses when the first vehicle-treated mouse was moribund. In both APL and ATRA-resistant APL leukemic mouse models, results of Wright-Giemsa staining showed that mice in ZYH005-treated group had near-normal peripheral blood (PB), and leukemic blasts were rarely observed in their bone marrow (BM); results of histological examination with H&E staining revealed that leukemic cell infiltration into spleens and livers was inhibited by ZYH005 (Figure 6A and B). Moreover, vehicle-treated mice exhibited enlarged spleens and livers, which could be alleviated by ZYH005 treatment (Figure 6C and Supplementary Figure S8A). ZYH005 was also found to facilitate recovery of swollen spleens and livers, even in ATRA-resistant leukemic mice (Figure 6D and Supplementary Figure S8B). Finally, the effect of ZYH005 on the survival of leukemic mice was also evaluated using the same treatment procedure as described above but using at least six mice for each treatment group. All vehicle-treated leukemic mice developed leukemia within 24–28 days post-transplantation and died within the following 4 days. When the first vehicle-treated mouse died, we stop the treatments. Both APL and ATRA-resistant APL mice showed longer survival following treatment with ZYH005 (Figure 6E and F). These results indicate that ZYH005 has significant therapeutic efficacy against APL and ATRA-resistant APL in vivo. Figure 6. View largeDownload slide Effects of ZYH005 on APL and ATRA-resistant APL mouse models. On day 4 after transplantation, APL and ATRA-resistant APL mice were treated with vehicle (5% DMSO, 5% Cremophor® EL, 90% Saline) or ZYH005 (10 mg/kg/day, intravenously). Normal mice (irradiated but not transplanted, no treatment) and mice treated with ATO (5 mg/kg/day, intraperitoneally) or ATRA (15 mg/kg/day, intraperitoneally, for the ATRA-resistant APL model) were used as the controls. (A–D) When the first vehicle-treated leukemic mice were moribund, all mice were sacrificed and analyzed. Peripheral blood (PB) and bone marrow (BM) were subjected to Wright-Giemsa staining. Invasion of the spleen and liver by leukemic cells was detected by hematoxylin and eosin (H&E) staining, the arrowheads indicate infiltrating cells (A and B). The spleens and livers were weighed (C and D), *P < 0.05, **P < 0.01, ***P < 0.001, unpaired two-tailed Student's t-test. (E–F) Kaplan–Meier survival curves of APL (E) and ATRA-resistant APL (F) mice. The numbers of mice are indicated in parentheses, and ***P < 0.001 compared to the vehicle-treated mice. Every experiment was repeated at least three times with the similar results. (G) Diagram of signaling pathways regulated by ZYH005 in APL and ATRA-resistant APL. Figure 6. View largeDownload slide Effects of ZYH005 on APL and ATRA-resistant APL mouse models. On day 4 after transplantation, APL and ATRA-resistant APL mice were treated with vehicle (5% DMSO, 5% Cremophor® EL, 90% Saline) or ZYH005 (10 mg/kg/day, intravenously). Normal mice (irradiated but not transplanted, no treatment) and mice treated with ATO (5 mg/kg/day, intraperitoneally) or ATRA (15 mg/kg/day, intraperitoneally, for the ATRA-resistant APL model) were used as the controls. (A–D) When the first vehicle-treated leukemic mice were moribund, all mice were sacrificed and analyzed. Peripheral blood (PB) and bone marrow (BM) were subjected to Wright-Giemsa staining. Invasion of the spleen and liver by leukemic cells was detected by hematoxylin and eosin (H&E) staining, the arrowheads indicate infiltrating cells (A and B). The spleens and livers were weighed (C and D), *P < 0.05, **P < 0.01, ***P < 0.001, unpaired two-tailed Student's t-test. (E–F) Kaplan–Meier survival curves of APL (E) and ATRA-resistant APL (F) mice. The numbers of mice are indicated in parentheses, and ***P < 0.001 compared to the vehicle-treated mice. Every experiment was repeated at least three times with the similar results. (G) Diagram of signaling pathways regulated by ZYH005 in APL and ATRA-resistant APL. Myeloid cells positive for four markers (CD34+, c-kit+, FcγRIII/II+, Gr1int) in APL mice have been identified as leukemia initiating cells (LIC), and PML-RARa degradation triggers LIC clearance (68,69). Therefore, we analyzed these cells in APL mice and found that the numbers of these cells were decreased in the BM of ZYH005-treated mice, result similar those noted in ATO-treated mice (Supplementary Figure S8C), indicating that ZYH005 can eliminate LIC in leukemic mice, which mirrors PML-RARα degradation in vivo. In addition, we conducted immunohistochemical analysis of cleaved caspase-3 expression in the spleens of APL and ATRA-resistant APL mice, and the results showed that ZYH005 treatment induced apoptosis in the APL mice similarly to ATO treatment (70) (Supplementary Figure S8D). We also performed immunoblot analyses using NB4 xenograft tumor lysate, and the results showed that ZYH005 treatment activated caspase-3 and induced PML-RARα degradation in vivo (Supplementary Figure S8E). Collectively, these data provide evidence of the mechanisms of ZYH005 treatment, which mainly mediate intercalative binding to DNA, subsequently trigger DNA double-strand breaks and accumulation of DNA damage, induce G2/M cell cycle arrest, caspases 3 activation and ultimately cause PML-RARα degradation and apoptosis in both APL and ATRA-resistant APL models (Figure 6G). DISCUSSION In this study, we report that ZYH005, a novel synthesized phenanthridinone alkaloid, has specific and effective cytotoxic effects in APL and ATRA-resistant APL models. In contrast, ZYH005 showed negligible toxic effect on PBMCs from healthy volunteers as well as non-cancerous cells. Single-molecule force-spectroscopy experiments have been used to characterize the binding mode of drug-DNA interactions (51), the binding affinity (52), binding kinetics (71) and the effects of drugs on the supercoiled dsDNA (72). Using single-molecule magnetic tweezers, we show that ZYH005-binding increases DNA extension and twists DNA, which suggesting that ZYH005 is a DNA intercalator. It should be noted that ZYH005 is a moderate intercalator when comparing with classical intercalators such as YOYO-1 and EB, as ZYH005 only causes moderate change of the extension and twists of DNA at low force. We further demonstrate that ZYH005 exerts its effects by triggering DNA damage, cell cycle arrest, caspase-dependent degradation of PML-RARa and apoptosis. These findings highlight the potential of ZYH005 for APL therapy and for overcoming ATRA resistance. Cells respond to DNA damage through DDR signalling networks, which determine cell fate and promote not only DNA repair and cell survival but also cell cycle arrest and cell death, thus determining the outcome of cancer therapy with DNA-binding drugs (1). Therefore, defects in the DDR have been exploited therapeutically in the treatment of cancer with DNA-binding chemotherapies and PARP inhibitors (2,73). Compromised DDR has been found in APL (25,27,28), indicating that APL cells may be sensitive to DNA-binding drugs. Associated with this, before the introduction of ATRA and ATO for APL, the first-line therapy for APL was DNA intercalators (daunorubicin, doxorubicin, idarubicin) plus the DNA synthesis inhibitor cytarabine (Ara-C), and the complete remission rate was as high as 75% (23,74). Until now, ATRA in combination with these DNA binding chemotherapies remained standard practice for APL treatment (23,31,74). Recently, Esposito et al. reported that APL is extremely sensitive to DDR target drugs (PARP inhibitors) due to their compromised DDR (28). These findings suggest that the impairment DDR not only determines APL pathogenesis but also plays an important role in the therapy response of APL. Moreover, molecular mechanism underlying ATRA-resistant is genetic mutations in the RARa ligand binding domain (75) which is not related to the DDR, and impaired DDR are still observed in ATRA-resistant cells (26). ZYH005 was identified according to the results of specific cytotoxic activity to APL and ATRA-resistant APL cells. Moreover, in ZYH005-treated APL cells, γH2AX were increased sharply even at 1 h and cleaved PARP were detectable at 3 h, but caspase-3 weakly activated at 6 h, indicating that DNA damage is the earliest event in the cell after ZYH005 treatment. These findings suggest that impairment of the DDR determined the response of APL and ATRA-resistant APL cells to ZYH005 and reflecting a new strategy to overcome ATRA resistance with DNA-binding drugs. Several therapeutics have been tested to overcome ATRA resistance, such as ATO, Am80 (76) and histone deacetylase inhibitors (77), among these, ATO is considered the most active agent (78). The PML-RARa fusion protein is clearly the central player driving APL, which also induces strong resistance to apoptosis in APL cells (62). ATO could directly target the PML component of PML-RARα and degrade it and then led to apoptosis of APL cells. Similarly, ZYH005 treatment induces PML-RARα degradation in APL and ATRA-resistant APL cells. The difference is ATO degrade PML-RARα mainly through ubiquitin proteasome system (79,80), while ZYH005 induces caspase-dependent PML-RARα degradation. Moreover, ZYH005 induces apoptosis in both APL and ATRA-resistant APL cells, and its apoptosis-inducing effect is more potent than those of ATO and other widely used anticancer drugs such as 5-fluorouracil, cisplatin and doxorubicin. On the other hand, the caspase inhibitor Z-VAD-FMK blocked ZYH005-induced PML-RARα degradation and apoptosis, but proteins related to the DNA damage such as PARP and γH2AX, were partially blocked by Z-VAD-FMK. Therefore, we conclude that ZYH005-induced DNA damage activates caspase 3, the activated-caspase 3 cleaves the PML/RARa fusion protein subsequently facilitates APL and ATRA-resistant APL cells undergoing apoptosis. The precise mechanism of ZYH005-induced degradation of PML-RARα remains unclear; yet the apoptosis inducing effect of ZYH005 offers a potent therapeutic advantage of ZYH005 and its role in overcoming ATRA resistance warrants further validation in patient-derived samples in future studies. Although ATO is used as the best salvage therapy agent for ATRA-resistance patients, these patients subsequently resistant to ATO have also been reported (23,81). Furthermore, ATRA is a polyene compound whose sensitivity to light and heat may render it unstable, and the use of ATO is associated with many acute adverse effects and clinical complications, such as differentiation syndrome (7–35%), pseudotumour cerebri, hyperleukocytosis, leukocytosis (32–73%) and hepatic toxicity (74,82). Notably, ZYH005 possesses some advantages over the above agents, as the drug has a simple, stable structure, inexpensive synthesis procedure, and exerts negligible toxicity in vitro and in vivo. Additionally, we observed that of the series of N-(4-methoxylphenyl) ethyl derivatives, ZYH006 and ZYH007 showed activity similar to that of ZYH005 in the preliminary anticancer evaluation, indicating that the heteroatom substituent in the core may determine the anticancer activity of these compounds; these findings as well as the potential of intercalation in double-stranded DNA provide a valuable resource for the field of biologically active phenanthridinone alkaloids. In conclusion, our findings indicate that the novel compound ZYH005 may be a promising candidate drug for newly diagnosed APL and relapsed APL with ATRA resistance. The potential therapeutic value of ZYH005 as well as that of some of its analogues, in APL and other cancers warrants further study. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We are grateful to Professor Yingli Wu of Shanghai Jiao Tong University for generously providing cells regarding this study. We also thank Professor Hudan Liu of Wuhan University, Professor Jie Yan of National University of Singapore for discussions and comments. FUNDING National natural Science Foundation of China [81503305 to Q.T. and 21708009 to H.Y.]; China Postdoctoral Science Foundation funded project [2015M582227 to Q.T.]; Program for Changjiang Scholars of Ministry of Education of the People's Republic of China [T2016088 to Y.Z.]; National natural Science Foundation for Distinguished Young Scholars [81725021 to Y.Z.]; Innovative Research Groups of the National Natural Science Foundation of China [81721005 to Y.Z.]; Academic Frontier Youth Team of HUST; Integrated Innovative Team for Major Human Diseases Program of Tongji Medical College, HUST; Fundamental Research Fund for the Central Universities [2017KFYXJJ153 to H.Y and 2017KFYXJJ152 to Z.L.]; Singapore Ministry of Education Academic Research Fund Tier 3 [MOE 2012-T3-1-001 to J.Y.]. Funding for open access charge: National natural Science Foundation of China [81503305 to Q.T.]. Conflict of interest statement. None declared. REFERENCES 1. Roos W.P., Thomas A.D., Kaina B. DNA damage and the balance between survival and death in cancer biology. Nat. Rev. Cancer . 2016; 16: 20– 33. Google Scholar CrossRef Search ADS PubMed 2. Pearl L.H., Schierz A.C., Ward S.E., Al-Lazikani B., Pearl F.M. Therapeutic opportunities within the DNA damage response. Nat. Rev. Cancer . 2015; 15: 166– 180. Google Scholar CrossRef Search ADS PubMed 3. Puigvert J., Sanjiv K., Helleday T. Targeting DNA repair, DNA metabolism and replication stress as anti-cancer strategies. FEBS J. 2016; 283: 232– 245. Google Scholar CrossRef Search ADS PubMed 4. Palchaudhuri R., Hergenrother P.J. DNA as a target for anticancer compounds: methods to determine the mode of binding and the mechanism of action. Curr. Opin. Biotechnol. 2007; 18: 497– 503. Google Scholar CrossRef Search ADS PubMed 5. Sirajuddin M., Ali S., Badshah A. Drug–DNA interactions and their study by UV–visible, fluorescence spectroscopies and cyclic voltametry. J. Photochem. Photobiol. B: Biol. 2013; 124: 1– 19. Google Scholar CrossRef Search ADS 6. Wheate N.J., Brodie C.R., Collins J.G., Kemp S., Aldrich-Wright J.R. DNA intercalators in cancer therapy: organic and inorganic drugs and their spectroscopic tools of analysis. Mini Rev. Med. Chem. 2007; 7: 627– 648. Google Scholar CrossRef Search ADS PubMed 7. Ferguson L.R., Denny W.A. Genotoxicity of non-covalent interactions: DNA intercalators. Mut. Res./Fundam. Mol. Mech. Mutagen. 2007; 623: 14– 23. Google Scholar CrossRef Search ADS 8. Tumir L.-M., Stojković M.R., Piantanida I. Come-back of phenanthridine and phenanthridinium derivatives in the 21st century. Beilstein J. Org. Chem. 2014; 10: 2930. Google Scholar CrossRef Search ADS PubMed 9. Nair J.J., van Staden J., Bastida J. Apoptosis-inducing effects of amaryllidaceae alkaloids. Curr. Med. Chem. 2016; 23: 161– 185. Google Scholar CrossRef Search ADS PubMed 10. Wu Z.-p., Chen Y., Xia B., Wang M., Dong Y.-F., Feng X. Two novel ceramides with a phytosphingolipid and a tertiary amide structure from Zephyranthes candida. Lipids . 2009; 44: 63– 70. Google Scholar CrossRef Search ADS PubMed 11. McNulty J., Thorat A., Vurgun N., Nair J.J., Makaji E., Crankshaw D.J., Holloway A.C., Pandey S. Human cytochrome P450 liability studies of trans-dihydronarciclasine: a readily available, potent, and selective cancer cell growth inhibitor. J. Nat. Prod. 2010; 74: 106– 108. Google Scholar CrossRef Search ADS PubMed 12. Griffin C., Hamm C., McNulty J., Pandey S. Pancratistatin induces apoptosis in clinical leukemia samples with minimal effect on non-cancerous peripheral blood mononuclear cells. Cancer Cell Int. 2010; 10: 6. Google Scholar CrossRef Search ADS PubMed 13. Van Goietsenoven G., Andolfi A., Lallemand B., Cimmino A., Lamoral-Theys D., Gras T., Abou-Donia A., Dubois J., Lefranc F., Mathieu V. Amaryllidaceae alkaloids belonging to different structural subgroups display activity against apoptosis-resistant cancer cells. J. Nat. Prod. 2010; 73: 1223– 1227. Google Scholar CrossRef Search ADS PubMed 14. Van Goietsenoven G., Mathieu V., Lefranc F., Kornienko A., Evidente A., Kiss R. Narciclasine as well as other amaryllidaceae isocarbostyrils are promising GTP‐ase targeting agents against brain cancers. Med. Res. Rev. 2013; 33: 439– 455. Google Scholar CrossRef Search ADS PubMed 15. Luchetti G., Johnston R., Mathieu V., Lefranc F., Hayden K., Andolfi A., Lamoral‐Theys D., Reisenauer M.R., Champion C., Pelly S.C. Bulbispermine: a crinine‐type amaryllidaceae alkaloid exhibiting cytostatic activity toward apoptosis‐resistant glioma cells. ChemMedChem. 2012; 7: 815– 822. Google Scholar CrossRef Search ADS PubMed 16. Ma D., Pignanelli C., Tarade D., Gilbert T., Noel M., Mansour F., Adams S., Dowhayko A., Stokes K., Vshyvenko S. Cancer cell mitochondria targeting by pancratistatin analogs is dependent on functional complex II and III. Scientific Rep. 2017; 7: 42957. Google Scholar CrossRef Search ADS 17. Yamamoto H., Mukoyoshi K., Hattori K. 2003; Google Patents. 18. Wahlberg E., Karlberg T., Kouznetsova E., Markova N., Macchiarulo A., Thorsell A.-G., Pol E., Frostell Å., Ekblad T., Öncü D. Family-wide chemical profiling and structural analysis of PARP and tankyrase inhibitors. Nat. Biotechnol. 2012; 30: 283– 288. Google Scholar CrossRef Search ADS PubMed 19. Li T.-K., Houghton P.J., Desai S.D., Daroui P., Liu A.A., Hars E.S., Ruchelman A.L., LaVoie E.J., Liu L.F. Characterization of ARC-111 as a novel topoisomerase I-targeting anticancer drug. Cancer Res. 2003; 63: 8400– 8407. Google Scholar PubMed 20. Zhu S., Ruchelman A.L., Zhou N., Liu A., Liu L.F., LaVoie E.J. 6-Substituted 6H-dibenzo [c, h][2, 6] naphthyridin-5-ones: reversed lactam analogues of ARC-111 with potent topoisomerase I-targeting activity and cytotoxicity. Bioorg. Med. Chem. 2006; 14: 3131– 3143. Google Scholar CrossRef Search ADS PubMed 21. Kornienko A., Evidente A. Chemistry, biology, and medicinal potential of narciclasine and its congeners. Chem. Rev. 2008; 108: 1982– 2014. Google Scholar CrossRef Search ADS PubMed 22. Griffin C., Karnik A., McNulty J., Pandey S. Pancratistatin selectively targets cancer cell mitochondria and reduces growth of human colon tumor xenografts. Mol. Cancer Ther. 2011; 10: 57– 68. Google Scholar CrossRef Search ADS PubMed 23. Coombs C., Tavakkoli M., Tallman M. Acute promyelocytic leukemia: where did we start, where are we now, and the future. Blood Cancer J. 2015; 5: e304. Google Scholar CrossRef Search ADS PubMed 24. Li J., Zhu H., Hu J., Mi J., Chen S., Chen Z., Wang Z. Progress in the treatment of acute promyelocytic leukemia: optimization and obstruction. Int. J. Hematol. 2014; 100: 38– 50. Google Scholar CrossRef Search ADS PubMed 25. Voisset E., Moravcsik E., Stratford E.W., Jaye A., Palgrave C.J., Hills R.K., Salomoni P., Kogan S.C., Solomon E., Grimwade D. Pml Nuclear Body Disruption Cooperates in APL Pathogenesis, Impacting DNA Damage Repair Pathways. Blood . 2016; 128: 742. 26. Di Masi A., Cilli D., Berardinelli F., Talarico A., Pallavicini I., Pennisi R., Leone S., Antoccia A., Noguera N., Lo-Coco F. PML nuclear body disruption impairs DNA double-strand break sensing and repair in APL. Cell Death Dis. 2016; 7: e2308. Google Scholar CrossRef Search ADS PubMed 27. Casorelli I., Tenedini E., Tagliafico E., Blasi M., Giuliani A., Crescenzi M., Pelosi E., Testa U., Peschle C., Mele L. Identification of a molecular signature for leukemic promyelocytes and their normal counterparts: Focus on DNA repair genes. Leukemia . 2006; 20: 1978. Google Scholar CrossRef Search ADS PubMed 28. Esposito M.T., Zhao L., Fung T.K., Rane J.K., Wilson A., Martin N., Gil J., Leung A.Y., Ashworth A., So C.W.E. Synthetic lethal targeting of oncogenic transcription factors in acute leukemia by PARP inhibitors. Nat. Med. 2015; 21: 1481– 1490. Google Scholar CrossRef Search ADS PubMed 29. Griffin J.D. Blood's 70th anniversary: arsenic—from poison pill to magic bullet. Blood . 2016; 127: 1729. Google Scholar CrossRef Search ADS PubMed 30. Nichol J.N., Garnier N., Miller W.H. Triple A therapy: the molecular underpinnings of the unique sensitivity of leukemic promyelocytes to anthracyclines, all-trans-retinoic acid and arsenic trioxide. Best Pract. Res. Clin. Haematol. 2014; 27: 19– 31. Google Scholar CrossRef Search ADS PubMed 31. Norsworthy K.J., Altman J.K. Optimal treatment strategies for high-risk acute promyelocytic leukemia. Curr. Opin. Hematol. 2016; 23: 127– 136. Google Scholar CrossRef Search ADS PubMed 32. Altman J.K., Rademaker A., Cull E., Weitner B.B., Ofran Y., Rosenblat T.L., Haidau A., Park J.H., Ram S.L., Orsini J.M. Administration of ATRA to newly diagnosed patients with acute promyelocytic leukemia is delayed contributing to early hemorrhagic death. Leuk. Res. 2013; 37: 1004– 1009. Google Scholar CrossRef Search ADS PubMed 33. Luo Z., Wang F., Zhang J., Li X., Zhang M., Hao X., Xue Y., Li Y., Horgen F.D., Yao G. Cytotoxic alkaloids from the whole plants of Zephyranthes candida. J. Nat. Prod. 2012; 75: 2113– 2120. Google Scholar CrossRef Search ADS PubMed 34. Zhan G., Zhou J., Liu R., Liu T., Guo G., Wang J., Xiang M., Xue Y., Luo Z., Zhang Y. Galanthamine, plicamine, and secoplicamine alkaloids from Zephyranthes candida and their anti-acetylcholinesterase and anti-inflammatory activities. J. Nat. Prod. 2016; 79: 760– 766. Google Scholar CrossRef Search ADS PubMed 35. Zhan G., Qu X., Liu J., Tong Q., Zhou J., Sun B., Yao G. Zephycandidine A, the first naturally occurring imidazo [1, 2-f] phenanthridine alkaloid from Zephyranthes candida, exhibits significant anti-tumor and anti-acetylcholinesterase activities. Scientific Rep. 2016; 6: 33990. Google Scholar CrossRef Search ADS 36. Guo G., Yao G., Zhan G., Hu Y., Yue M., Cheng L., Liu Y., Ye Q., Qing G., Zhang Y. N-methylhemeanthidine chloride, a novel Amaryllidaceae alkaloid, inhibits pancreatic cancer cell proliferation via down-regulating AKT activation. Toxicol. Appl. Pharmacol. 2014; 280: 475– 483. Google Scholar CrossRef Search ADS PubMed 37. Ye Q., Jiang J., Zhan G., Yan W., Huang L., Hu Y., Su H., Tong Q., Yue M., Li H. Small molecule activation of NOTCH signaling inhibits acute myeloid leukemia. Scientific Rep. 2016; 6: 26510. Google Scholar CrossRef Search ADS 38. Yuan M., Chen L., Wang J., Chen S., Wang K., Xue Y., Yao G., Luo Z., Zhang Y. Transition-metal-free synthesis of phenanthridinones from biaryl-2-oxamic acid under radical conditions. Org. Lett. 2015; 17: 346– 349. Google Scholar CrossRef Search ADS PubMed 39. Rosenauer A., Raelson J.V., Nervi C., Eydoux P., DeBlasio A., Miller W.J. Alterations in expression, binding to ligand and DNA, and transcriptional activity of rearranged and wild-type retinoid receptors in retinoid-resistant acute promyelocytic leukemia cell lines. Blood . 1996; 88: 2671– 2682. Google Scholar PubMed 40. Gu Z.-M., Wu Y.-L., Zhou M.-Y., Liu C.-X., Xu H.-Z., Yan H., Zhao Y., Huang Y., Sun H.-D., Chen G.-Q. Pharicin B stabilizes retinoic acid receptor-α and presents synergistic differentiation induction with ATRA in myeloid leukemic cells. Blood . 2010; 116: 5289– 5297. Google Scholar CrossRef Search ADS PubMed 41. Liu C.-X., Yin Q.-Q., Zhou H.-C., Wu Y.-L., Pu J.-X., Xia L., Liu W., Huang X., Jiang T., Wu M.-X. Adenanthin targets peroxiredoxin I and II to induce differentiation of leukemic cells. Nat. Chem. Biol. 2012; 8: 486– 493. Google Scholar CrossRef Search ADS PubMed 42. Weng A.P., Millholland J.M., Yashiro-Ohtani Y., Arcangeli M.L., Lau A., Wai C., del Bianco C., Rodriguez C.G., Sai H., Tobias J. c-Myc is an important direct target of Notch1 in T-cell acute lymphoblastic leukemia/lymphoma. Genes Dev. 2006; 20: 2096– 2109. Google Scholar CrossRef Search ADS PubMed 43. Chen H., Fu H., Zhu X., Cong P., Nakamura F., Yan J. Improved high-force magnetic tweezers for stretching and refolding of proteins and short DNA. Biophys. J. 2011; 100: 517– 523. Google Scholar CrossRef Search ADS PubMed 44. Zhao X., Peter S., Droge P., Yan J. Oncofetal HMGA2 effectively curbs unconstrained (+) and (-) DNA supercoiling. Scientific Rep. 2017; 7: 8440. Google Scholar CrossRef Search ADS 45. Totrov M., Abagyan R. Flexible protein–ligand docking by global energy optimization in internal coordinates. Proteins: Struct. Funct. Bioinformatics . 1997; 29: 215– 220. Google Scholar CrossRef Search ADS 46. Hopcroft N.H., Brogden A.L., Searcey M., Cardin C.J. X-ray crystallographic study of DNA duplex cross-linking: simultaneous binding to two d (CGTACG) 2 molecules by a bis (9-aminoacridine-4-carboxamide) derivative. Nucleic Acids Res. 2006; 34: 6663– 6672. Google Scholar CrossRef Search ADS PubMed 47. Zhu H., Chen C., Tong Q., Yang J., Wei G., Xue Y., Wang J., Luo Z., Zhang Y. Asperflavipine A: a cytochalasan heterotetramer uniquely defined by a highly complex tetradecacyclic ring system from Aspergillus flavipes QCS12. Angew. Chem. 2017; 129: 5326– 5330. Google Scholar CrossRef Search ADS 48. Zhu H., Chen C., Xue Y., Tong Q., Li X.N., Chen X., Wang J., Yao G., Luo Z., Zhang Y. Asperchalasine A, a cytochalasan dimer with an unprecedented decacyclic ring system, from Aspergillus flavipes. Angew. Chem. Int. Ed. 2015; 54: 13374– 13378. Google Scholar CrossRef Search ADS 49. Brown D., Kogan S., Lagasse E., Weissman I., Alcalay M., Pelicci P.G., Atwater S., Bishop J.M. A PMLRARα transgene initiates murine acute promyelocytic leukemia. Proc. Natl. Acad. Sci. U.S.A. 1997; 94: 2551– 2556. Google Scholar CrossRef Search ADS PubMed 50. Kogan S.C., Hong S.-h., Shultz D.B., Privalsky M.L., Bishop J.M. Leukemia initiated by PMLRARα: the PML domain plays a critical role while retinoic acid–mediated transactivation is dispensable. Blood . 2000; 95: 1541– 1550. Google Scholar PubMed 51. Eckel R., Ros R., Ros A., Wilking S.D., Sewald N., Anselmetti D. Identification of binding mechanisms in single molecule-DNA complexes. Biophys. J. 2003; 85: 1968– 1973. Google Scholar CrossRef Search ADS PubMed 52. Vladescu I.D., McCauley M.J., Nunez M.E., Rouzina I., Williams M.C. Quantifying force-dependent and zero-force DNA intercalation by single-molecule stretching. Nat. Methods . 2007; 4: 517– 522. Google Scholar CrossRef Search ADS PubMed 53. Cluzel P., Lebrun A., Heller C., Lavery R., Viovy J.L., Chatenay D., Caron F. DNA: an extensible molecule. Science . 1996; 271: 792– 794. Google Scholar CrossRef Search ADS PubMed 54. Ross W.E., Bradley M.O. DNA double-strand breaks in mammalian cells after exposure to intercalating agents. Biochim. Biophys. Acta (BBA)-Nucleic Acids Protein Synth. 1981; 654: 129– 134. Google Scholar CrossRef Search ADS 55. Yang F., Kemp C.J., Henikoff S. Anthracyclines induce double-strand DNA breaks at active gene promoters. Mut. Res./Fundam. Mol. Mech. Mutagen. 2015; 773: 9– 15. Google Scholar CrossRef Search ADS 56. Khanna K.K., Jackson S.P. DNA double-strand breaks: signaling, repair and the cancer connection. Nat. Genet. 2001; 27: 247– 254. Google Scholar CrossRef Search ADS PubMed 57. Mah L., El-Osta A., Karagiannis T. γH2AX: a sensitive molecular marker of DNA damage and repair. Leukemia . 2010; 24: 679– 686. Google Scholar CrossRef Search ADS PubMed 58. Rello S., Stockert J., Moreno V., Gamez A., Pacheco M., Juarranz A., Canete M., Villanueva A. Morphological criteria to distinguish cell death induced by apoptotic and necrotic treatments. Apoptosis . 2005; 10: 201– 208. Google Scholar CrossRef Search ADS PubMed 59. Elmore S. Apoptosis: a review of programmed cell death. Toxicol. Pathol. 2007; 35: 495– 516. Google Scholar CrossRef Search ADS PubMed 60. Pearlman M., Jendiroba D., Pagliaro L., Keyhani A., Liu B., Freireich E.J. Dexrazoxane in combination with anthracyclines lead to a synergistic cytotoxic response in acute myelogenous leukemia cell lines. Leuk. Res. 2003; 27: 617– 626. Google Scholar CrossRef Search ADS PubMed 61. Wouters K.A., Kremer L., Miller T.L., Herman E.H., Lipshultz S.E. Protecting against anthracycline‐induced myocardial damage: a review of the most promising strategies. Br. J. Haematol. 2005; 131: 561– 578. Google Scholar CrossRef Search ADS PubMed 62. Zhu J., Zhou J., Peres L., Riaucoux F., Honoré N., Kogan S., de Thé H. A sumoylation site in PML/RARA is essential for leukemic transformation. Cancer Cell . 2005; 7: 143– 153. Google Scholar CrossRef Search ADS PubMed 63. Nervi C., Ferrara F.F., Fanelli M., Rippo M.R., Tomassini B., Ferrucci P.F., Ruthardt M., Gelmetti V., Gambacorti-Passerini C., Diverio D. Caspases mediate retinoic acid–induced degradation of the acute promyelocytic leukemia PML/RARα fusion protein. Blood . 1998; 92: 2244– 2251. Google Scholar PubMed 64. Gianni M. In acute promyelocytic leukemia NB4 cells, the synthetic retinoid CD437 induces contemporaneously apoptosis, a caspase-3-mediated degradation of PML/RARα protein and the PML retargeting on PML-nuclear bodies. Leukemia (08876924) . 1999; 13: 739– 749. Google Scholar CrossRef Search ADS 65. Lallemand-Breitenbach V., Jeanne M., Benhenda S., Nasr R., Lei M., Peres L., Zhou J., Zhu J., Raught B., de Thé H. Arsenic degrades PML or PML–RARα through a SUMO-triggered RNF4/ubiquitin-mediated pathway. Nat. Cell Biol. 2008; 10: 547– 555. Google Scholar CrossRef Search ADS PubMed 66. Tatham M.H., Geoffroy M.-C., Shen L., Plechanovova A., Hattersley N., Jaffray E.G., Palvimo J.J., Hay R.T. RNF4 is a poly-SUMO-specific E3 ubiquitin ligase required for arsenic-induced PML degradation. Nat. Cell Biol. 2008; 10: 538– 546. Google Scholar CrossRef Search ADS PubMed 67. Saito Y., Uchida N., Tanaka S., Suzuki N., Tomizawa-Murasawa M., Sone A., Najima Y., Takagi S., Aoki Y., Wake A. Induction of cell cycle entry eliminates human leukemia stem cells in a mouse model of AML. Nat. Biotechnol. 2010; 28: 275– 280. Google Scholar CrossRef Search ADS PubMed 68. Guibal F.C., Alberich-Jorda M., Hirai H., Ebralidze A., Levantini E., Di Ruscio A., Zhang P., Santana-Lemos B.A., Neuberg D., Wagers A.J. Identification of a myeloid committed progenitor as the cancer-initiating cell in acute promyelocytic leukemia. Blood . 2009; 114: 5415– 5425. Google Scholar CrossRef Search ADS PubMed 69. Nasr R. Eradication of acute promyelocytic leukemia-initiating cells by PML/RARA-targeting. Int. J. Hematol. 2010; 91: 742– 747. Google Scholar CrossRef Search ADS PubMed 70. Lallemand-Breitenbach V., Guillemin M., Janin A., Daniel M., Degos L., Kogan S., Bishop J. Retinoic acid and arsenic synergize to eradicate leukemic cells in a mouse model of acute promyelocytic leukemia. J. Exp. Med. 1999; 189: 1043– 1052. Google Scholar CrossRef Search ADS PubMed 71. Paramanathan T., Vladescu I., McCauley M.J., Rouzina I., Williams M.C. Force spectroscopy reveals the DNA structural dynamics that govern the slow binding of Actinomycin D. Nucleic Acids Res. 2012; 40: 4925– 4932. Google Scholar CrossRef Search ADS PubMed 72. Lipfert J., Klijnhout S., Dekker N.H. Torsional sensing of small-molecule binding using magnetic tweezers. Nucleic Acids Res. 2010; 38: 7122– 7132. Google Scholar CrossRef Search ADS PubMed 73. Lord C.J., Ashworth A. PARP inhibitors: synthetic lethality in the clinic. Science . 2017; 355: 1152– 1158. Google Scholar CrossRef Search ADS PubMed 74. Li J., Zhu H., Hu J., Mi J., Chen S., Chen Z., Wang Z. Progress in the treatment of acute promyelocytic leukemia: optimization and obstruction. Int. J. Hematol. 2014; 100: 38– 50. Google Scholar CrossRef Search ADS PubMed 75. Tomita A., Kiyoi H., Naoe T. Mechanisms of action and resistance to all-trans retinoic acid (ATRA) and arsenic trioxide (As2O3) in acute promyelocytic leukemia. Int. J. Hematol. 2013; 97: 717– 725. Google Scholar CrossRef Search ADS PubMed 76. Tobita T., Takeshita A., Kitamura K., Ohnishi K., Yanagi M., Hiraoka A., Karasuno T., Takeuchi M., Miyawaki S., Ueda R. Treatment with a new synthetic retinoid, Am80, of acute promyelocytic leukemia relapsed from complete remission induced by all-trans retinoic acid. Blood . 1997; 90: 967– 973. Google Scholar PubMed 77. Ungewickell A., Medeiros B.C. Novel agents in acute myeloid leukemia. Int. J. Hematol. 2012; 96: 178– 185. Google Scholar CrossRef Search ADS PubMed 78. Goto E., Tomita A., Hayakawa F., Atsumi A., Kiyoi H., Naoe T. Missense mutations in PML-RARA are critical for the lack of responsiveness to arsenic trioxide treatment. Blood . 2011; 118: 1600– 1609. Google Scholar CrossRef Search ADS PubMed 79. Lallemand-Breitenbach V., Zhu J., Chen Z., de Thé H. Curing APL through PML/RARA degradation by As 2 O 3. Trends Mol. Med. 2012; 18: 36– 42. Google Scholar CrossRef Search ADS PubMed 80. Zhang X.-W., Yan X.-J., Zhou Z.-R., Yang F.-F., Wu Z.-Y., Sun H.-B., Liang W.-X., Song A.-X., Lallemand-Breitenbach V., Jeanne M. Arsenic trioxide controls the fate of the PML-RARα oncoprotein by directly binding PML. Science . 2010; 328: 240– 243. Google Scholar CrossRef Search ADS PubMed 81. Chendamarai E., Ganesan S., Alex A.A., Kamath V., Nair S.C., Nellickal A.J., Janet N.B., Srivastava V., Lakshmi K.M., Viswabandya A. Comparison of newly diagnosed and relapsed patients with acute promyelocytic leukemia treated with arsenic trioxide: insight into mechanisms of resistance. PLoS One . 2015; 10: e0121912. Google Scholar CrossRef Search ADS PubMed 82. Cicconi L., Lo-Coco F. Current management of newly diagnosed acute promyelocytic leukemia. Ann. Oncol. 2016; 27: 1474– 1481. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Structural changes in DNA-binding proteins on complexationPoddar, Sayan;Chakravarty, Devlina;Chakrabarti, Pinak
doi: 10.1093/nar/gky170pmid: 29534202
Abstract Characterization and prediction of the DNA-biding regions in proteins are essential for our understanding of how proteins recognize/bind DNA. We analyze the unbound (U) and the bound (B) forms of proteins from the protein–DNA docking benchmark that contains 66 binary protein–DNA complexes along with their unbound counterparts. Proteins binding DNA undergo greater structural changes on complexation (in particular, those in the enzyme category) than those involved in protein–protein interactions (PPI). While interface atoms involved in PPI exhibit an increase in their solvent-accessible surface area (ASA) in the bound form in the majority of the cases compared to the unbound interface, protein–DNA interactions indicate increase and decrease in equal measure. In 25% structures, the U form has missing residues which are located in the interface in the B form. The missing atoms contribute more toward the buried surface area compared to other interface atoms. Lys, Gly and Arg are prominent in disordered segments that get ordered in the interface on complexation. In going from U to B, there may be an increase in coil and helical content at the expense of turns and strands. Consideration of flexibility cannot distinguish the interface residues from the surface residues in the U form. INTRODUCTION The interactions between DNA and proteins play a pivotal role in almost every cellular process, such as regulation of gene expression, DNA replication, rearrangement, repair, chromatin formation and organization, etc. (1). DNA-binding proteins have evolved to have a specific or general affinity for either single or double stranded DNA (2). The most intensively studied of these are the various transcription factors, each of which binds to one particular set of DNA sequence and activates or inhibits the transcription of genes (3). Generally, these proteins bind to DNA in the major groove due to the greater accessibility of the bases; however, there are also some proteins which bind in the minor groove (4). Protein–DNA interactions are mainly of two types, specific and non-specific (5,6). In case of non-specific interactions, as far as the binding interactions are concerned, the nucleotide sequence does not matter. This is important in a variety of contexts related to DNA packaging and nucleoprotein complex formation (7), and the interactions occur between functional groups on the protein and the sugar-phosphate backbone of DNA. On the other hand specific DNA–protein interactions depend not only on the specific sequence of bases but also on the orientation of the bases in the nucleotide (8). These DNA–protein interactions are strong and are mediated by various types of bonding, such as hydrogen bonding which can be direct or indirect, mediated by water molecules, ionic interactions such as salt bridges, protein side chains-DNA backbone interactions, as well as others, like van der Waals and hydrophobic interactions. Protein–DNA interactions have been characterized by analyzing the interface formed between the protein and DNA of a large number of protein–DNA complexes (1,9–11). In addition to physicochemical features and pattern of hydrogen bonding, conservation of residues, their clustering etc. have been found to be important in distinguishing DNA-binding patch from the rest of the protein surface (12). Many of these features are used to predict protein–DNA binding affinities (13–15), as well as distinguishing between single- or double-stranded DNA-binding proteins (16). These studies, however, have one bottleneck, which make their applicability in the development of a general docking algorithm rather strenuous (17)—they use the static and bound form of the protein as found in protein–DNA complexes. However, it is well-known that protein structure may undergo considerable changes while forming a complex, whether between protein molecules or between protein and DNA or RNA. Indeed, the genome-wide analyses have indicated that many of the transcription factors are intrinsically disordered (18–20). Recently, we have compared the bound and unbound forms of proteins involved in protein–protein interactions and shown that there are distinct changes in structural features accompanying complex formation (21,22). We extend the study to proteins involved in DNA binding using protein–DNA docking benchmark (23). We analyze the conformational changes that take place in the residues that constitute the interface on binding DNA, employing simple parameters such as accessible surface area (ASA) and root mean square deviations (RMSD). Missing segments as seen for the unbound protein and secondary structure assumed on binding are also studied, as also the change in crystallographic temperature factors (B factors) of interface atoms on complex formation. Compared to protein–protein interactions, interface residues in DNA-binding proteins may exhibit greater conformational changes (disorder-to-order transitions, in particular) and the interface atoms a larger change in accessible surface area (either increase or decrease) on complexation. Attempt has been made to correlate the structural changes with the change in free energy of DNA-binding of residues on mutation to Ala. MATERIALS AND METHODS We used the protein–DNA docking benchmark (23) that contains 66 binary protein–DNA complexes and their unbound counterparts. In the dataset, there are 16 NMR entries with multiple models for each structure; for our analysis only the first model was considered. In 41 structures the monomeric chain was bound to DNA, and dimers in 25 cases. For each protein, we designated the unbound structure as U and the bound structure (isolated from its partner DNA) as B; the bound structure (together with DNA) in the complex is C. For both bound [B] and unbound [U] structures the accessible surface area (ASA) values were calculated separately using the NACCESS program (24), which employs the Lee and Richards algorithm (25). We used EMBOSS software (26) to perform the local alignment (Smith–Waterman algorithm) and global alignment (Needleman and Wunsch) of the polypeptide chains constituting the U/B pairs. 52 of the 66 U/B pairs have sequence identity ≥98%. Based on the sequence alignment, the interface residues as seen in the complex were mapped to those in the unbound state using PROFIT (27) and Biopython (28). We have also identified the interface atoms in our analysis, which are the atoms losing more than 0.1 Å2 of surface area upon complexation formation (B to C) (29). Conformational changes were measured using both the backbone and interface RMSDs, which were calculated on equivalent Cα positions, as well as all atoms comprising the interface, respectively, after superposition (using all the non-hydrogen backbone atoms) with PROFIT. Residues showing large RMSDs were also identified. There are also some proteins that have modified residues (Supplementary Table S1A) which are tagged HETATM in PDB (instead of the usual ATOM); selenomethionine residues belonging to this category were manually edited to Met and tagged as ATOM. Supplementary Table S1b documents the structures where residue name differs in U and B. As a result of order-to-disorder transition in the two PDB files, some coordinates may be missing in the U state, the information on these missing residues/atoms are provided in Supplementary Table S1C. The residues given in Supplementary Table S1B and those in Supplementary Table S1C for which all the atoms are missing were not included in the ASA calculations; however, when all the atoms are not missing those present, common to both U and B states were included. While calculating ASA(B) and ASA(U), we have used only the interface atoms common to both as we had done for the previous analysis of protein–protein structures (21,22). The analysis of atoms mainly depends on matching labels that may often be assigned in an arbitrary fashion. For example, the corresponding atom in U might be labelled OD1 or OD2 for an atom OD1 of an interface residue Asp in a bound structure B. As a result direct comparison of the ASA values of two atoms having similar label is not justified. Residues which have ambiguous atom label pairs are Asp (OD1/OD2), Arg (NH1/NH2), Asn (OD1/ND2), Glu (OE1/OE2), Gln (OE1/NE2), His (CE1/NE2, CD2/ND1), Phe (CD1/CD2, CE1/CE2), Leu (CD1/CD2), Val (CG1/CG2), Tyr (CD1/CD2, CE1/CE2). To circumvent the problem, atoms with both the labels were taken for calculation of ASA. This would mean that if OD1 is present in the interface, both OD1 and OD2 are taken to be a part of the interface. This increases the number of interface atoms by 7%. Based on the calculated ASA value for U and B, ΔASA and δASA values have been calculated as follows. \begin{equation*}\Delta {\rm{ASA}} = [{\rm{ASA}}({\rm{B}})-{\rm{ASA}}({\rm{U}})],\end{equation*}where ASA(B) is the solvent accessible surface area of the interface atoms in the complex state, and ASA(U) is the accessible surface area of the equivalent mapped atoms in the isolated state. \begin{equation*}\delta {\rm{A}} = \Delta {\rm{ASA}}/{\rm{ASA(B)}},\end{equation*}difference in ASA relative to the total value in the complex state. It may be mentioned that ΔASA and δA are the average values for all the interface atoms in a given structure. In some places we have used ΔASA for a given residue, or calculated δA using all the atoms of the interface residue, or for the residues located in protein surface—these have been mentioned explicitly. Additionally, δA was also calculated for surface and interface residues (considering all the residue atoms). The buried surface area, BSA = [ASA(B) – ASA(C)], is calculated using all the interface atoms (22). Secondary structure was calculated using DSSP (30) software. Changes in secondary structural composition were enumerated in terms of changes in helix, strand, turn and coil (ΔH, ΔS, ΔT and ΔC). To quantify the change in the percentage composition of the secondary structures for the interface residues between the bound (nBi) and the unbound (nUi) states the Euclidean distance was calculated as \begin{equation*}D = \surd \left( {\Sigma _i^m{{(n_i^B - n_i^U)}^2}/(m - 1)} \right),\end{equation*}where m = 4 (different forms of secondary structure, i.e. helix, strand, turn and coil). B factors were analyzed to discern the flexibility of the interface and surface regions. The normalized values were used, defined as follows: \begin{equation*}{b_{{fr}^{^\prime}}} = [{b_{fr}} - \mu (bf)]/\sigma (bf),\end{equation*}where bfr is the average B factor of C, Cα, O, N and Cß of the residue r (Cß cannot be considered when the residue is Gly), μ(bf) and σ(bf) are the mean and the standard deviation of B factors for that chain, respectively. After scaling, the bfr' values were used to derive the averages over the interface, surface and core and rim regions of the interface (29). The Euclidean metrics, Δb, for the B factors of residues in different states/structural regions were calculated in a similar way, Δb = √(Σin(bf(1)i- bf(2)i)2/(n-1)), where n represents the number of amino acid types, and bf(1)i, bf(2)i are the scaled B factors of residue type i in states 1 and 2, respectively. The states compared were interface, non-interface, bound and unbound. RESULTS Changes in accessible surface area (ASA) and root mean square deviations (RMSD) in going from U to B states The distribution of δA (Figure 1A) shows a rather bimodal distribution, with almost equal number of proteins having positive (34 cases) and negative (32) δA values, with an average of –0.15 ± 11%. (Instead of using the first model, if all the models were used for NMR structures, the average value would be –0.24 ± 11%, not a significant change, as was observed earlier on using NMR models in the analysis of protein–protein interactions (22)). Two examples of large δA values (both positive and negative) are shown in Figure 2 (31–33). As a control we have plotted the distribution of δA values for surface residues, which has a small average value (–0.003 ± 8.6%) as seen above, but the distribution now is quite normal (Supplementary Figure S1). δA, when calculated based on the whole interface residue (Figure 1B), indicates a trend towards having a negative value (–4.4 ± 10%). Figure 1. View largeDownload slide Distribution of δA values using interface (A) atoms, and (B) residues. The distribution of δA values using only interface atoms shows a bimodal distribution where both partner attraction and partner accommodation is taking place (average = –0.15 ± 11%), while the distribution for interface residues shows that partner accommodation effect prevails (average = –4.4 ± 10%). Figure 1. View largeDownload slide Distribution of δA values using interface (A) atoms, and (B) residues. The distribution of δA values using only interface atoms shows a bimodal distribution where both partner attraction and partner accommodation is taking place (average = –0.15 ± 11%), while the distribution for interface residues shows that partner accommodation effect prevails (average = –4.4 ± 10%). Figure 2. View largeDownload slide Examples of local movements leading large δA values, +ve (that makes interface residues more accessible in the bound state) in (A) and –ve in (B). The unbound protein is in pink, and the bound in green; DNA strands are in orange and the bases are shown as sticks. (A) Partner attraction effect—parts of two loops of Restriction Endonuclease (31) shift position on binding DNA. Interface residues, in stick representation, Ile51 and Arg30 are shown (in red) for the bound state (2odi), and (in blue) for unbound state of the enzyme (2odh). ASA increases from 58.2 to 260 Å2 for these two residues. (B) Partner accommodation effect—glucocorticoid receptor rearranges on binding DNA. Interface residues Arg489 and Ser459 are shown for the bound state (1r4o) (32), and for unbound state, apo enzyme (1gdc) (33). ASA decreases from 155 to 76 Å2 for these two residues. Figure 2. View largeDownload slide Examples of local movements leading large δA values, +ve (that makes interface residues more accessible in the bound state) in (A) and –ve in (B). The unbound protein is in pink, and the bound in green; DNA strands are in orange and the bases are shown as sticks. (A) Partner attraction effect—parts of two loops of Restriction Endonuclease (31) shift position on binding DNA. Interface residues, in stick representation, Ile51 and Arg30 are shown (in red) for the bound state (2odi), and (in blue) for unbound state of the enzyme (2odh). ASA increases from 58.2 to 260 Å2 for these two residues. (B) Partner accommodation effect—glucocorticoid receptor rearranges on binding DNA. Interface residues Arg489 and Ser459 are shown for the bound state (1r4o) (32), and for unbound state, apo enzyme (1gdc) (33). ASA decreases from 155 to 76 Å2 for these two residues. We checked if depending on their type the residues may favour +ve or –ve δA values in Figure 1A. Supplementary Table S2 provides the number of +ve and –ve cases for all the 20 amino acids, which indicates that although overall there is not much distinction (P value = 0.89), there seems to be a slight excess of +ve values (P value = 0.28) for five hydrophobic residues (excluding Leu), whereas the negatively-charged Asp has distinctly more number of –ve cases. If we consider all the atoms of the interface residues (corresponding to Figure 1B), the trend becomes more prominent with all the charged residues showing an excess of –ve δA values. Conformational changes were also measured based on RMSD values. The scattered plot of interface and backbone RMSDs (Figure 3A) shows that the former is mostly greater than the latter (points above the diagonal line, except for five structures). The average backbone RMSD (only Cα) is 2.5 ± 1.9 Å and average interface RMSD (using all interface atoms) is 3.7 ± 2.3 Å. There are 12 (out of 66) structures (18.2%) whose backbone RMSD is greater than 4 Å and 20 structures (30%) with interface RMSD > 4 Å. In comparison, for protein–protein complexes, the average RMSD is 1.4 ± 1.6 Å and 2.2 ± 1.5 Å for backbone and interface, respectively (Figure 3B). Histograms for both the backbone and interface RMSDs can be compared between protein–protein and protein–DNA datasets (Supplementary Figure S2), which shows that the changes in both backbone as well the interface are more for the protein–DNA interaction. Figure 3. View largeDownload slide The scatter plot of RMSD between unbound and bound states in (A) protein–DNA and (B) protein–protein complexes. The line of regression is drawn along with the Spearman's correlation coefficient; the diagonal line is shown as dashed. Figure 3. View largeDownload slide The scatter plot of RMSD between unbound and bound states in (A) protein–DNA and (B) protein–protein complexes. The line of regression is drawn along with the Spearman's correlation coefficient; the diagonal line is shown as dashed. To understand what might cause the occurrence of large ΔASA values (positive, as well as negative) we calculated the residue-wise RMSD values. A scattered plot for ΔASA versus RMSD for a few such residues is shown in Figure 4 (34–42), which shows that residues with high RMSD values tend to have high ΔASA values also. The residues undergoing such extreme changes have been shown in Supplementary Figure S3. The structures are mostly of enzymes and some of the residues are discussed below. Figure 4. View large Download slide Scatter plot of ΔASA versus RMSD for a few selected residues (having large ΔASA); symbols for five residue types are indicated. The PDB codes (U/B) and their residues are: 2hmy (34)/7mht (35) (ILE 86, ILE 249, ILE 258), 1bam (36) /3bam (37) (MET 198), 1ynm (38) /2fl3 (39) (PHE 15, PHE 91, PHE 137), 3v6p/3v6t (40) (ASP 335, ASP 369, ASP 573, ASP 641), 1jus (41)/1jt0 (42) (TYR 40, TYR 41). Molecular diagrams indicating changes in these structures are shown in Supplementary Figure S3. Figure 4. View large Download slide Scatter plot of ΔASA versus RMSD for a few selected residues (having large ΔASA); symbols for five residue types are indicated. The PDB codes (U/B) and their residues are: 2hmy (34)/7mht (35) (ILE 86, ILE 249, ILE 258), 1bam (36) /3bam (37) (MET 198), 1ynm (38) /2fl3 (39) (PHE 15, PHE 91, PHE 137), 3v6p/3v6t (40) (ASP 335, ASP 369, ASP 573, ASP 641), 1jus (41)/1jt0 (42) (TYR 40, TYR 41). Molecular diagrams indicating changes in these structures are shown in Supplementary Figure S3. In HhaI methyltransferase oligonucleotide complex (34,35), there are mismatched bases in the substrate that are flipped out of the DNA helix and pushed in the active-site pocket of the enzyme. This results in the entry of residues, such as Ile86 into the helix, which shows a large change in RMSD and a positive ΔASA (Supplementary Figure S3A). Type II restriction endonucleases, such as BamHI (36,37) recognize short (four to eight base pairs) palindromic DNA sequences and cleave both the strands and contain at least three residues, mostly acidic that bind divalent cations, which are essential for activity. Interestingly, changes associated with DNA binding causes a large RMSD and positive ΔASA for a hydrophobic residue, Met198 (Supplementary Figure S3B). Similarly, in a monomeric endocuclease, BcnI (31), which introduces double-strand breaks by sequentially nicking individual DNA strands, exhibit large positive ΔASA values for Ile51 and Arg30 (Figure 2A) Restriction endonuclease, HinP1I cleaves the palindromic tetranucleotide sequence G↓CGC. It is a 2-fold related dimer with two active sites and two DNA duplexes bound on the outer surfaces of the dimer facing away from each other (38,39). Phe91 intercalates the duplex from the major groove causing the DNA to be kinked by ∼60°. Upon binding to cognate DNA the largest change in HinP1I occurs in the N-terminal 17 residues, which become part of a long helix that binds to the minor groove. Phe15 lying in this region, as well as Phe91 mentioned earlier both have high RMSD values as well as large positive ΔASA values (Supplementary Figure S3C). TAL (transcription activator-like) effectors (40) are major virulence factors secreted by bacteria that cause diseases in plants. They recognize host DNA sequence through a central domain of tandem repeats, each comprising of 33–35 conserved amino acids that targets a specific base pair by using two hypervariable residues [known as repeat variable diresidues (RVD)] at positions 12 and 13. The structure of each repeat consists of two helices connected by a short RVD-containing loop, which contacts the DNA major groove. The 12th residue stabilizes the RVD loop, wherea the 13th makes a base-specific contact. The frequently occurring RVDs, His/Asp (HD), Asn/Gly (NG) and Asn/Ile (NI) recognize three distinct bases. In the structure with HD as the RVD (Supplementary Figure S3D), the Asp residues exhibit partner accommodation effect resulting in large negative ΔASA, as well as high RMSD values. The Staphylococcus aureus multidrug binding protein QacR represses transcription of the qacA multidrug transporter gene and is induced by structurally diverse cationic lipophilic drugs. When bound to DNA (41,42), Tyr40 and Tyr41 of QacR form hydrogen bond with phosphates and van der Waals contacts with sugar moieties and the result is a decrease in ΔASA values (Supplementary Figure S3E). It is found that most of the residues with high RMSD and ΔASA values (positive or negative) belong to the enzyme category, such as restriction endonucleases. Overall, in the 66 complexes, two major classes are enzymes (27) and the helix-turn-helix proteins (20), the rest being divided among zinc-coordinating (3), other α-helix (6), β-sheet (4) and β hairpin/ribbon (6) structures (1). The maximum RMSD values for interface residues are observed for the enzyme category (4.3 ± 2.9 Å), followed by zinc-coordinating proteins (4.2 ± 2.6 Å); the helix-turn-helix proteins exhibit a value of 3.1 ± 1.6 Å. In a related study conformational changes associated with DNA binding were categorized into six classes, and the members exhibiting the largest changes were found to be either endonuleases or polymerases (43). Analysis of the missing residues in the unbound form The segments missing in the unbound proteins, which upon binding DNA are structured were also analysed. A missing segment is defined as the one with three (or more) missing residues (lying in the interface or elsewhere). An example is presented in Figure 5, where a helical portion (58–63) of the C-terminal region of the structure and a loop in the N-terminal part (residues 0–6, Supplementary Table S3) are ordered in the complex (44,45). Mostly, the interface and non-interface residues occur interspersed in the same stretch (Figure 6A) (46,47), but in Glucocorticoid receptor, these occur in two separate stretches (Figure 6B) (32,33). 23 structures have missing segments (Supplementary Table S3) in the unbound form of the protein. In 17 of these structures the segments contain interface residues. On an average the missing stretches constitutes 6% residues of these 23 structures. Missing atoms constitute 7% of the total interface atoms in the whole dataset and 18% in 19 structures (with one or more residues, missing entirely, Supplementary Table S1C). On average the contribution to BSA from missing residues (9.9 ± 3.4 Å2 per atom) is greater than that from non-missing residues (8.8 ± 0.9 Å2 per atom, P value = 0.07). 54% of the polypeptide chains with missing stretches have segments from the chain termini (60% of which constitute the interface also), similar to what was observed in protein–protein interactions (22). Figure 5. View largeDownload slide Two missing regions (encircled) in Aristaless Homeodomain (3a02) (left) (44) are ordered (in red) in the DNA bound complex (1fjl) (right) (45). Figure 5. View largeDownload slide Two missing regions (encircled) in Aristaless Homeodomain (3a02) (left) (44) are ordered (in red) in the DNA bound complex (1fjl) (right) (45). Figure 6. View largeDownload slide Two examples with missing regions in the unbound structure (green cartoon) that are seen in the B form (cyan); the protein region that gets ordered in the interface is indicated in blue, and the region that is not part of the interface is in red. (A) The interface residues are interspersed with the non-interface in the missing segment of DNA polymerase I (complexed with DNA, 4ktq (46) and unbound protein, 1ktq) (47). (B) The missing interface and non-interface residues form separate segments in glucocorticoid receptor (bound to DNA, 1r4o and the unbound protein, 1gdc) (32,33). Figure 6. View largeDownload slide Two examples with missing regions in the unbound structure (green cartoon) that are seen in the B form (cyan); the protein region that gets ordered in the interface is indicated in blue, and the region that is not part of the interface is in red. (A) The interface residues are interspersed with the non-interface in the missing segment of DNA polymerase I (complexed with DNA, 4ktq (46) and unbound protein, 1ktq) (47). (B) The missing interface and non-interface residues form separate segments in glucocorticoid receptor (bound to DNA, 1r4o and the unbound protein, 1gdc) (32,33). We had observed in our previous analysis of protein–protein interactions (22), the number (197) of missing residues in the interface are more compared to those (131) in the non-interface region, whereas for protein–DNA structures, the opposite trend is seen (126 in interface vs. 258 in non-interface regions, considering those structures which have at least one missing residue in the interface) (Table 1). The interface residues which are found in greater number in these disordered stretches are Lys, Gly and Arg. Interestingly, Ala scores high if non-interface regions are only considered. As in protein–protein interactions (22), the secondary structures attained by the missing residues in the complex are mostly irregular, followed by helix and turn, strand being the least observed. Statistics on residues missing in the U form and their secondary structure in the B form Table 1. Statistics on residues missing in the U form and their secondary structure in the B form Residue Number missinga % relative to total number of Secondary structure in the B form of residues missing in U (%)b Interface residues of the same type Missing residuesb H S T C Ala 6 (25) 5.6 4.8 (8.1) 33.3 (67.7) 0 16.7 (9.7) 50 (22.6) Arg 14 (13) 5.1 11.1 (7.0) 35.7 (40.7) 0 (3.7) 28.6 (29.6) 35.7 (25.9) Asn 4 (18) 2.4 3.2 (5.7) 0 (13.6) 50(13.6) 25(22.7) 20 (50) Asp 3 (10) 2.9 2.4 (3.4) 33.3 (38.5) 0 (7.7) 33.3(23.1) 33.3 (30.8) Cys 0 (2) 0.0 0 (0.5) 0 (50) 0 0 0 (50) Gln 3 (17) 2.0 2.4 (5.2) 33.3 (30) 33.3(15) 0(10) 33.3 (45) Glu 4 (16) 4.3 3.2 (5.2) 0 (40) 0(10) 75(45) 25 (5) Gly 20 (21) 10.3 15.9 (10.7) 20(14.6) 0 40(51.2) 40 (34.1) His 3 (5) 4.2 2.4 (2.1) 0 (37.5) 0 33.3(25) 66.7 (37.5) Ile 2 (12) 2.3 1.6 (3.6) 0 (42.9) 0 50(28.6) 50 (28.6) Leu 3 (35) 2.7 2.4 (9.9) 100 (52.6) 0(5.3) 0(13.2) 0 (28.9) Lys 29 (20) 8.5 23 (12.8) 13.8 (24.5) 6.9(6.1) 31(34.7) 48.5 (34.7) Met 0 (6) 0 0 (1.6) 0 0 0 0 (100) Phe 2 (2) 2.9 1.6 (1.0) 0 0(25) 0 100 (75) Pro 6 (11) 7.7 4.8 (4.4) 33.3 (29.4) 0 16.7(29.4) 50 (41.2) Ser 13 (17) 6.3 10.31 (7.8) 23.1 (30) 0(3.3) 23.1(23.3) 53.9 (43.3) Thr 8 (12) 3.6 6.35 (5.2) 25(30) 0 (10) 37.5 (20) 37.5 (40) Trp 2 (2) 4.7 1.6 (1.0) 0 0 0 (25) 100 (75) Tyr 1 (2) 0.9 0.8 (0.8) 0 0 (33.3) 100(66.7) 0 Val 3 (12) 3.1 2.4 (3.9) 66.7(73.3) 33.3(6.7) 0(13.3) 0 (6.7) Total 126 (258) Residue Number missinga % relative to total number of Secondary structure in the B form of residues missing in U (%)b Interface residues of the same type Missing residuesb H S T C Ala 6 (25) 5.6 4.8 (8.1) 33.3 (67.7) 0 16.7 (9.7) 50 (22.6) Arg 14 (13) 5.1 11.1 (7.0) 35.7 (40.7) 0 (3.7) 28.6 (29.6) 35.7 (25.9) Asn 4 (18) 2.4 3.2 (5.7) 0 (13.6) 50(13.6) 25(22.7) 20 (50) Asp 3 (10) 2.9 2.4 (3.4) 33.3 (38.5) 0 (7.7) 33.3(23.1) 33.3 (30.8) Cys 0 (2) 0.0 0 (0.5) 0 (50) 0 0 0 (50) Gln 3 (17) 2.0 2.4 (5.2) 33.3 (30) 33.3(15) 0(10) 33.3 (45) Glu 4 (16) 4.3 3.2 (5.2) 0 (40) 0(10) 75(45) 25 (5) Gly 20 (21) 10.3 15.9 (10.7) 20(14.6) 0 40(51.2) 40 (34.1) His 3 (5) 4.2 2.4 (2.1) 0 (37.5) 0 33.3(25) 66.7 (37.5) Ile 2 (12) 2.3 1.6 (3.6) 0 (42.9) 0 50(28.6) 50 (28.6) Leu 3 (35) 2.7 2.4 (9.9) 100 (52.6) 0(5.3) 0(13.2) 0 (28.9) Lys 29 (20) 8.5 23 (12.8) 13.8 (24.5) 6.9(6.1) 31(34.7) 48.5 (34.7) Met 0 (6) 0 0 (1.6) 0 0 0 0 (100) Phe 2 (2) 2.9 1.6 (1.0) 0 0(25) 0 100 (75) Pro 6 (11) 7.7 4.8 (4.4) 33.3 (29.4) 0 16.7(29.4) 50 (41.2) Ser 13 (17) 6.3 10.31 (7.8) 23.1 (30) 0(3.3) 23.1(23.3) 53.9 (43.3) Thr 8 (12) 3.6 6.35 (5.2) 25(30) 0 (10) 37.5 (20) 37.5 (40) Trp 2 (2) 4.7 1.6 (1.0) 0 0 0 (25) 100 (75) Tyr 1 (2) 0.9 0.8 (0.8) 0 0 (33.3) 100(66.7) 0 Val 3 (12) 3.1 2.4 (3.9) 66.7(73.3) 33.3(6.7) 0(13.3) 0 (6.7) Total 126 (258) If the two types (interface and non-interface) of missing stretches are represented by o-o-o-x-o-o and x-x-x-x-x (where o indicates a residue in the interface, and x a non-interface residue), the table gives statistics using all the residues of type ‘o’. Additionally, within the parentheses are values using all the residues (o + x) (footnote b below) or only the x residues (footnote a below). Six structures have missing residues (46 in number) only in the non-interface region. aThe numbers in parentheses correspond to the non-interface residues of the missing stretches in Supplementary Table S3. bThe numbers in parentheses are calculated considering all the residues of missing stretches in Supplementary Table S3. View Large Table 1. Statistics on residues missing in the U form and their secondary structure in the B form Residue Number missinga % relative to total number of Secondary structure in the B form of residues missing in U (%)b Interface residues of the same type Missing residuesb H S T C Ala 6 (25) 5.6 4.8 (8.1) 33.3 (67.7) 0 16.7 (9.7) 50 (22.6) Arg 14 (13) 5.1 11.1 (7.0) 35.7 (40.7) 0 (3.7) 28.6 (29.6) 35.7 (25.9) Asn 4 (18) 2.4 3.2 (5.7) 0 (13.6) 50(13.6) 25(22.7) 20 (50) Asp 3 (10) 2.9 2.4 (3.4) 33.3 (38.5) 0 (7.7) 33.3(23.1) 33.3 (30.8) Cys 0 (2) 0.0 0 (0.5) 0 (50) 0 0 0 (50) Gln 3 (17) 2.0 2.4 (5.2) 33.3 (30) 33.3(15) 0(10) 33.3 (45) Glu 4 (16) 4.3 3.2 (5.2) 0 (40) 0(10) 75(45) 25 (5) Gly 20 (21) 10.3 15.9 (10.7) 20(14.6) 0 40(51.2) 40 (34.1) His 3 (5) 4.2 2.4 (2.1) 0 (37.5) 0 33.3(25) 66.7 (37.5) Ile 2 (12) 2.3 1.6 (3.6) 0 (42.9) 0 50(28.6) 50 (28.6) Leu 3 (35) 2.7 2.4 (9.9) 100 (52.6) 0(5.3) 0(13.2) 0 (28.9) Lys 29 (20) 8.5 23 (12.8) 13.8 (24.5) 6.9(6.1) 31(34.7) 48.5 (34.7) Met 0 (6) 0 0 (1.6) 0 0 0 0 (100) Phe 2 (2) 2.9 1.6 (1.0) 0 0(25) 0 100 (75) Pro 6 (11) 7.7 4.8 (4.4) 33.3 (29.4) 0 16.7(29.4) 50 (41.2) Ser 13 (17) 6.3 10.31 (7.8) 23.1 (30) 0(3.3) 23.1(23.3) 53.9 (43.3) Thr 8 (12) 3.6 6.35 (5.2) 25(30) 0 (10) 37.5 (20) 37.5 (40) Trp 2 (2) 4.7 1.6 (1.0) 0 0 0 (25) 100 (75) Tyr 1 (2) 0.9 0.8 (0.8) 0 0 (33.3) 100(66.7) 0 Val 3 (12) 3.1 2.4 (3.9) 66.7(73.3) 33.3(6.7) 0(13.3) 0 (6.7) Total 126 (258) Residue Number missinga % relative to total number of Secondary structure in the B form of residues missing in U (%)b Interface residues of the same type Missing residuesb H S T C Ala 6 (25) 5.6 4.8 (8.1) 33.3 (67.7) 0 16.7 (9.7) 50 (22.6) Arg 14 (13) 5.1 11.1 (7.0) 35.7 (40.7) 0 (3.7) 28.6 (29.6) 35.7 (25.9) Asn 4 (18) 2.4 3.2 (5.7) 0 (13.6) 50(13.6) 25(22.7) 20 (50) Asp 3 (10) 2.9 2.4 (3.4) 33.3 (38.5) 0 (7.7) 33.3(23.1) 33.3 (30.8) Cys 0 (2) 0.0 0 (0.5) 0 (50) 0 0 0 (50) Gln 3 (17) 2.0 2.4 (5.2) 33.3 (30) 33.3(15) 0(10) 33.3 (45) Glu 4 (16) 4.3 3.2 (5.2) 0 (40) 0(10) 75(45) 25 (5) Gly 20 (21) 10.3 15.9 (10.7) 20(14.6) 0 40(51.2) 40 (34.1) His 3 (5) 4.2 2.4 (2.1) 0 (37.5) 0 33.3(25) 66.7 (37.5) Ile 2 (12) 2.3 1.6 (3.6) 0 (42.9) 0 50(28.6) 50 (28.6) Leu 3 (35) 2.7 2.4 (9.9) 100 (52.6) 0(5.3) 0(13.2) 0 (28.9) Lys 29 (20) 8.5 23 (12.8) 13.8 (24.5) 6.9(6.1) 31(34.7) 48.5 (34.7) Met 0 (6) 0 0 (1.6) 0 0 0 0 (100) Phe 2 (2) 2.9 1.6 (1.0) 0 0(25) 0 100 (75) Pro 6 (11) 7.7 4.8 (4.4) 33.3 (29.4) 0 16.7(29.4) 50 (41.2) Ser 13 (17) 6.3 10.31 (7.8) 23.1 (30) 0(3.3) 23.1(23.3) 53.9 (43.3) Thr 8 (12) 3.6 6.35 (5.2) 25(30) 0 (10) 37.5 (20) 37.5 (40) Trp 2 (2) 4.7 1.6 (1.0) 0 0 0 (25) 100 (75) Tyr 1 (2) 0.9 0.8 (0.8) 0 0 (33.3) 100(66.7) 0 Val 3 (12) 3.1 2.4 (3.9) 66.7(73.3) 33.3(6.7) 0(13.3) 0 (6.7) Total 126 (258) If the two types (interface and non-interface) of missing stretches are represented by o-o-o-x-o-o and x-x-x-x-x (where o indicates a residue in the interface, and x a non-interface residue), the table gives statistics using all the residues of type ‘o’. Additionally, within the parentheses are values using all the residues (o + x) (footnote b below) or only the x residues (footnote a below). Six structures have missing residues (46 in number) only in the non-interface region. aThe numbers in parentheses correspond to the non-interface residues of the missing stretches in Supplementary Table S3. bThe numbers in parentheses are calculated considering all the residues of missing stretches in Supplementary Table S3. View Large Changes in secondary structure The change in percentage composition of the secondary structural elements during U to B transition was calculated. 58 (88%) structures showed some changes. The Euclidean distance (D) between the compositions of the four structural elements in the two states was also calculated and the average was found to be 6.2 (±5.5) for all, but the average increased to 8.0 (±5.1) for the structures (26) where regular secondary structural were formed at the expense of turns or coils. For understanding structural changes during complex formation we have used structural pairs with D > 6 (Supplementary Figure S4) where we can see that there is an increase in coil and helical content at the cost of turns and strands. In these 26 structures, we mostly find that extension of either existing helix or strand to be more frequent than formation of new helix or strand, and this was more for helices compared to strands (Figure 7A). Also it was observed that in case of helix mostly the extension takes place in the N-terminal, whereas for strand C-terminal extension is preferred (Figure 7B). Two examples of helix extension and helix formation are shown in Figure 8 (48–50). Figure 7. View largeDownload slide Percentage composition of (A) helix/strand extension and formation, (B) extension of helix/strand in N or C terminal. Figure 7. View largeDownload slide Percentage composition of (A) helix/strand extension and formation, (B) extension of helix/strand in N or C terminal. Figure 8. View largeDownload slide Examples showing change in secondary structural elements (left side, U; right side, B). Interface stretches have been marked in yellow. (A) Bacteriophage lambda cII protein (1zpq/1zs4) (48) in complex exhibits extension of its helix from Asp282 to Arg278 (which consisted of bends and coils in the U state) in N-terminal. (B) Wild type gene-regulating protein ARC (1arq) (49) in complex with the DNA (1bdt) (50) shows formation of helix from Met4 to Lys6. Figure 8. View largeDownload slide Examples showing change in secondary structural elements (left side, U; right side, B). Interface stretches have been marked in yellow. (A) Bacteriophage lambda cII protein (1zpq/1zs4) (48) in complex exhibits extension of its helix from Asp282 to Arg278 (which consisted of bends and coils in the U state) in N-terminal. (B) Wild type gene-regulating protein ARC (1arq) (49) in complex with the DNA (1bdt) (50) shows formation of helix from Met4 to Lys6. Comparison of B factors The relative vibrational motion in different parts of the protein structure is determined by B factor, also known as the temperature factor. The parts of the molecules which are highly flexible have high B factors. B factors are generally used to assess the difference between the interface and rest of the protein surface. Usually interface residues, involved in protein–protein interactions, are less flexible and have lower B factors compared to those in the surface regions (51,22). Similar result was also obtained in protein–DNA interaction (52). While supporting this observation, another study that also compared the binding and the non-binding regions in the apo form did not find any clear trend in the B factors in the two regions (53). This necessitated a systematic analysis of B factors between the two forms, as well as between different regions in the structure in them. We have taken the scaled mean B factor of the C, Cα, O, N and Cß atoms as the representative for the whole residue, and the average values were calculated for each residue type for the interface and the surface regions in both U and B states. On an average B factor in the bound form was found to be higher for the surface residues as opposed to interface residues (P value < 2.2 × 10−16, Supplementary Table S4). The normalized B factors for all the interface residues are found to be negative in the bound form against mostly positive values in the unbound form (P value < 2.2 × 10−16). So we can conclude that during unbound to bound transition the interface residues experience a decrease in their B factor. However for surface residues, an overall opposite trend was observed, i.e. while going from unbound to bound state these residues experience an increase in their B factor (P value = 0.0057). The same trend was reported in protein–protein interactions (22). In the unbound form B factor between the surface and interface regions are almost similar (P value = 0.2). Euclidean distances between the interface and surface residues (Figure 9A) were calculated for both bound and unbound structure, and it is found that the maximum change takes place between the surface and interface regions in the complex. Figure 9. View largeDownload slide Euclidean distances involving B-factors between (A) interface and surface regions, and (B) core and rim region of the interface, in the U and B forms. Figure 9. View largeDownload slide Euclidean distances involving B-factors between (A) interface and surface regions, and (B) core and rim region of the interface, in the U and B forms. We have further divided the interface residues into core and rim regions (29), and the B factors for them were also compared between these two regions in the B and U form (Figure 9B, Supplementary Table S5). From the Euclidean distance we can see that the decrease in flexibility is more pronounced in the core region between the two forms while the rim residues show a smaller difference. This is also reflected in the P values (2.2 × 10−16 for the core and 6.117 × 10−7 for the rim). However, no significant difference was observed in the U form between the B factors of core and rim residues (P value = 0.4521). Correlation between structural changes of residues on DNA binding and free energy of binding To get an insight into binding affinity one has to understand thermodynamics data in terms of structural changes that occur on binding. Alanine-scanning data are unavailable for protein–DNA interfaces. To circumvent the issue, a data set of free energy changes upon point mutations in a general context was resorted to. Such a data set was compiled by Kumar et al. and presents a list of mutations in protein–DNA complexes for which experimental free-energy changes are available in ProNIT (54). From these data, single point mutations, where ΔG values for complex formation was available in the wild type as well as protein mutant, were extracted to relate them to structural differences between unbound and bound forms in a previous study (55). The authors computed the free-energy change (ΔΔG) upon mutation as ΔΔG = ΔG_mutant – ΔG_wild. A higher value of ΔΔG for a given mutation would indicate larger destabilization caused by the mutation. In that selected list, 11 of our 66 structures were found, but only 7 had ΔG_mutant and ΔG_wild values enabling us to calculate ΔΔG for the residues. Considering only the point mutations to Ala, and excluding two with multiple observations that matched rather poorly, we were left with eight entries (Supplementary Table S6). Supplementary Figure S5 shows a plot of absolute values of ΔASA (considering all atoms of the residue) versus ΔΔG, the correlation coefficient being 0.79. Although the correlation is heavily biased by a single outlier, there seems to be an indication that a mutation in the interface residue undergoing change in ASA on complex formation is destabilizing as reflected by a positive ΔΔG. It may be pertinent to add that for the data points ΔΔG is negatively correlated with BSA (–0.75) (Supplementary Table S6). DISCUSSION δA indicates the relative percentage change of the accessible surface area of the interface atoms in going from the unbound to bound state of the protein. The large values (Figure 1A) have been explained in terms of partner attraction (when δA is positive) and partner accommodation (δA is negative) effects (21,22); when the interface residues of the protein are drawn towards DNA, the phenomenon is referred to as partner attraction, whereas partner accommodation is when protein residues move away from the DNA to accommodate it. Although the distribution of δA is rather bimodoal, considering all the atoms of the interface residues in the calculation of δA (–4.4 ± 10%, Figure 1B) it can be seen that the partner accommodation effect prevails. Protein-DNA interactions are mediated by both charged and hydrophobic residues, and Asp, belonging to the former class, indicates a reduction in ASA (Supplementary Table S2), while the latter residues, in general, tend to display an increase in ASA. It can be seen in Figure 4 that ΔASA is negative for Asp, whereas the values are positive for Ile, Met and Phe. On the other hand protein–protein interactions (in particular the interfaces in homodimeric associations) are enriched in nonpolar contacts/residues (56) and U-to-B transition is accompanied by an increase in ASA (22). The –ve δA value of Asp could be due to the its moving away from the negatively charged DNA chain—the longer Glu side chain can accomplish this without much change in the solvent accessibility of its interface atoms (Supplementary Table S2). It is also of interest to understand the effect of the change of accessible surface area of interface residues accompanying U to B transition on free energy of binding. Based on a very limited amount of data of ΔΔG of binding on mutation of residues to Ala (Supplementary Figure S5 and Table S6), there appears to be a trend of ΔΔG increasing with increase in the absolute value of ΔASA. The results obtained from analysis of protein–DNA complexes can be compared with the analysis involving protein–protein interaction affinity dataset (21,22). The distribution of δA values for protein–protein complexes showed that the partner attraction effect (δA = 3.3 ± 4.9%) prevails and number of structures undergoing large RMSD changes (≥4 Å) both in case of backbone and interface are less than what is observed for protein–DNA interactions. For the protein–protein dataset, there are 23 and 15 structures with interface and backbone RMSD values higher than 4 Å, this constitutes 8.2% and 5.34% respectively as opposed to 30% and 18.2% for interface and backbone RMSD respectively in case of protein–DNA interaction which is due to greater number of protein components undergoing greater structural changes on binding DNA. Figure 4 shows that some structures have residues with large |ΔASA| and RMSD values. These are not due to any crystallization artefacts, but have biological significance, many of these belonging to the enzyme category (restriction endonuclease and methyltransferase). It has been reported that highly specific and multi-specific DNA binding domains exhibit large conformational changes upon DNA-binding, and Asp is enriched in specific DNA-binding proteins (57). It is interesting to observe that residues having large changes in RMSD and ΔASA, aspartates in particular (Figure 4), all belong to the specific category. A comparison of U and B forms of proteins allowed us to identify the residues that undergo disorder to order transition on binding DNA. For structures having missing residues in the U form the missing atoms constitute 18% of the interface atoms, somewhat greater than 12% observed in protein–protein interactions (22). The missing atoms contribute more (9.9 ± 3.4 Å2) to the BSA in the bound state as compared to the non-missing atoms (8.8 ± 0.9 Å2), similar to what was observed in protein–protein interactions (the corresponding values being (11.5 ± 6.8 and 9.4 ± 1.5 Å2, respectively), indicating a greater degree of surface burial by missing atoms that can offset the entropic penalty associated with U-to-B transition (22). Intrinsically disordered regions (IDRs) and proteins have functional repertoire that complements ordered proteins, and there are attempts to predict and characterize such short regions (58,59). Our analysis have identified the disordered segments that are involved in binding DNA (Supplementary Table S3), and though the examples are rather limited in number, some residues such as Lys, Gly and Arg have been shown to interact with DNA. Such information could be incorporated into high-throughput methods for predicting DNA binding residues located in IDRs from protein sequence. CONCLUSION In this paper, we have compared the unbound and bound forms of DNA-binding proteins. While the interface atoms undergo an increase in ASA in going from U to B states in the majority of cases in protein–protein interactions, here the increase and decrease are found to equal extent. In general residues exhibit greater RMSDs and change in ASA values in protein–DNA interactions (PDIs) than protein–protein interactions (PPIs). However, during U-to-B transition both the types of interactions bring about similar changes in secondary structures and flexibility of residues located in the interface and the surface. 16% of the proteins in PPIs have missing (disordered) residues in the U form which get ordered in the B form (22); the number is higher 25% in PDIs. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank Prof. Shandar Ahmad for discussion on ProNIT database. This work is dedicated to the Centenary of Bose Institute. FUNDING Department of Science and Technology, India (Research grant to P.C.) [SR/S2/JCB-12/2006]; Department of Biotechnology (for funding the Centre). Funding for open access charge: Department of Science and Technology, India. Conflict of interest statement. None declared. REFERENCES 1. Luscombe N.M., Austin S.E., Berman H.M., Thornton J.M. An overview of the structures of protein–DNA complexes. Genome Biol. 2000; 1: 1– 37. Google Scholar CrossRef Search ADS PubMed 2. Berg O.G., von Hippel P.H. Selection of DNA binding sites by regulatory proteins. Trends Biochem. Sci. 1988; 13: 207– 211. Google Scholar CrossRef Search ADS PubMed 3. Huffman J.L., Brennan R. G. Prokaryotic transcription regulators: more than just the helix-turn-helix motif. Curr. Opin. Struct. Biol. 2002; 12: 98– 106. Google Scholar CrossRef Search ADS PubMed 4. Bewley C.A., Gronenborn A.M., Clore G.M. Minor groove-binding architectural proteins: structure, function, and DNA recognition 1. Annu. Rev. Biophys. Biomol. Struct. 1998; 27: 105– 131. Google Scholar CrossRef Search ADS PubMed 5. Sarai A., Kono H. Protein-DNA recognition patterns and predictions. Annu. Rev. Biophys. Biomol. Struct. 2005; 34: 379– 398. Google Scholar CrossRef Search ADS PubMed 6. Paillard G., Lavery R. Analyzing protein–DNA recognition mechanisms. Structure . 2004; 12: 113– 122. Google Scholar CrossRef Search ADS PubMed 7. Iwahara J., Schwieters C.D., Clore G.M. Characterization of nonspecific protein–DNA interactions by 1H paramagnetic relaxation enhancement. J. Am. Chem. Soc. 2004; 126: 12800– 12808. Google Scholar CrossRef Search ADS PubMed 8. Rohs R., Jin X., West S.M., Joshi R., Honig B., Mann R.S. Origins of specificity in protein–DNA recognition. Annu. Rev. Biochem. 2010; 79: 233. Google Scholar CrossRef Search ADS PubMed 9. Nadassy K., Wodak S.J., Janin J. Structural features of protein-nucleic acid recognition sites. Biochemistry . 1999; 38: 1999– 2017. Google Scholar CrossRef Search ADS PubMed 10. Biswas S., Guharoy M., Chakrabarti P. Dissection, residue conservation, and structural classification of protein‐DNA interfaces. Proteins . 2009; 74: 643– 654. Google Scholar CrossRef Search ADS PubMed 11. Sagendorf J.M., Berman H.M., Rohs R. DNAproDB: an interactive tool for structural analysis of DNA–protein complexes. Nucleic Acids Res. 2017; 45: W89– W97. Google Scholar CrossRef Search ADS PubMed 12. Dey S., Pal A., Guharoy M., Sonavane S., Chakrabarti P. Characterization and prediction of the binding site in DNA-binding proteins: improvement of accuracy by combining residue composition, evolutionary conservation and structural parameters. Nucleic Acids Res. 2012; 40: 7150– 7161. Google Scholar CrossRef Search ADS PubMed 13. Morozov A.V., Havranek J.J., Baker D., Siggia E.D. Protein–DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005; 33: 5781– 5798. Google Scholar CrossRef Search ADS PubMed 14. Liu R., Hu J. DNABind: a hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Proteins . 2013; 81: 1885– 1899. Google Scholar CrossRef Search ADS PubMed 15. Nagarajan R., Ahmad S., Michael Gromiha M. Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins. Nucleic Acids Res. 2013; 41: 7606– 7614. Google Scholar CrossRef Search ADS PubMed 16. Wang W., Liu J., Sun L. Surface shapes and surrounding environment analysis of single‐and double‐stranded DNA‐binding proteins in protein‐DNA interface. Proteins . 2016; 84: 979– 989. Google Scholar CrossRef Search ADS PubMed 17. Ding X.M., Pan X.Y., Xu C., Shen H.B. Computational prediction of DNA–protein interactions: a review. Curr. Computer-aided Drug Des. 2010; 6: 197– 206. Google Scholar CrossRef Search ADS 18. Liu J., Perumal N.B., Oldfield C.J., Su E.W., Uversky V.N., Dunker A.K. Intrinsic disorder in transcription factors. Biochemistry . 2006; 45: 6873– 6888. Google Scholar CrossRef Search ADS PubMed 19. Theillet F.X., Binolfi A., Frembgen-Kesner T., Hingorani K., Sarkar M., Kyne C., Li C., Crowley P.B., Gierasch L., Pielak G.J.et al. Physicochemical properties of cells and their effects on intrinsically disordered proteins (IDPs). Chem. Rev. 2014; 114: 6661– 6714. Google Scholar CrossRef Search ADS PubMed 20. Tompa P. Intrinsically disordered proteins: a 10-year recap. Trends Biochem. Sci. 2012; 37: 509– 517. Google Scholar CrossRef Search ADS PubMed 21. Chakravarty D., Guharoy M., Robert C.H., Chakrabarti P., Janin J. Reassessing buried surface areas in protein–protein complexes. Protein Sci. 2013; 22: 1453– 1457. Google Scholar PubMed 22. Chakravarty D., Janin J., Robert C.H., Chakrabarti P. Changes in protein structure at the interface accompanying complex formation. IUCrJ . 2015; 2: 643– 652. Google Scholar CrossRef Search ADS PubMed 23. van Dijk M., Bonvin A.M. A protein–DNA docking benchmark. Nucleic Acids Res. 2008; 36: e88– e88. Google Scholar CrossRef Search ADS PubMed 24. Hubbard S.J. NACCESS: program for calculating accessibilities . 1992; Department of Biochemistry and Molecular Biology University College of London. 25. Lee B., Richards F.M. The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 1971; 55: 379– 400. Google Scholar CrossRef Search ADS PubMed 26. Rice P., Longden I., Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000; 16: 276– 277. Google Scholar CrossRef Search ADS PubMed 27. McLachlan A.D. Rapid comparison of protein structures. Acta Crystallogr. A . 1982; 38: 871– 873. Google Scholar CrossRef Search ADS 28. Cock P.J., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics . 2009; 25: 1422– 1423. Google Scholar CrossRef Search ADS PubMed 29. Chakrabarti P., Janin J. Dissecting protein–protein recognition sites. Proteins . 2002; 47: 334– 343. Google Scholar CrossRef Search ADS PubMed 30. Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers . 1983; 22: 2577– 2637. Google Scholar CrossRef Search ADS PubMed 31. Sokolowska M., Kaus-Drobek M., Czapinska H., Tamulaitis G., Szczepanowski R.H., Urbanke C., Siksnys V., Bochtler M. Monomeric restriction endonuclease BcnI in the apo form and in an asymmetric complex with target DNA. J. Mol. Biol. 2007; 369: 722– 734. Google Scholar CrossRef Search ADS PubMed 32. Luisi B.F., Xu W., Otwinowski Z., Freedman L.P., Yamamoto K.R., Sigler P.B. Crystallographic analysis of the interaction of the glucocorticoid receptor with DNA. Nature . 1991; 352: 497. Google Scholar CrossRef Search ADS PubMed 33. Baumann H., Paulsen K., Kovacs H., Berglund H., Wright A.P., Gustafsson J A., Haerd T. Refined solution structure of the glucocorticoid receptor DNA-binding domain. Biochemistry . 1993; 32: 13463– 13471. Google Scholar CrossRef Search ADS PubMed 34. O’Gara M., Zhang X., Roberts R.J., Cheng X. Structure of a binary complex of HhaI methyltransferase with S-adnosyl-l-methionine formed in the presence of a short non-specific DNA oligonucleotide. J. Mol. Biol. 1999; 287: 201– 209. Google Scholar CrossRef Search ADS PubMed 35. O’Gara M., Horton J.R., Roberts R.J., Cheng X. Structures of HhaI methyltransferase complexed with substrates containing mismatches at the target base. Nat. Struct. Mol. Biol. 1998; 5: 872– 877. Google Scholar CrossRef Search ADS 36. Newman M., Strzelecka T., Dorner L.F., Schildkraut I., Aggarwal A.K. Structure of restriction endonuclease BamHI phased at 1.95 Å resolution by MAD analysis. Structure . 1994; 2: 439– 452. Google Scholar CrossRef Search ADS PubMed 37. Viadiu H., Aggarwal A.K. The role of metals in catalysis by the restriction endonuclease Bam HI. Nat. Struct. Mol. Biol. 1998; 5: 910– 916. Google Scholar CrossRef Search ADS 38. Yang Z., Horton J R., Maunus R., Wilson G.G., Roberts R.J., Cheng X. Structure of HinP1I endonuclease reveals a striking similarity to the monomeric restriction enzyme MspI. Nucleic Acids Res. 2005; 33: 1892– 1901. Google Scholar CrossRef Search ADS PubMed 39. Horton J.R., Zhang X., Maunus R., Yang Z., Wilson G.G., Roberts R.J., Cheng X. DNA nicking by HinP1I endonuclease: bending, base flipping and minor groove expansion. Nucleic Acids Res. 2006; 34: 939– 948. Google Scholar CrossRef Search ADS PubMed 40. Deng D., Yan C., Pan X., Mahfouz M., Wang J., Zhu J.K., Shi Y., Yan N. Structural basis for sequence-specific recognition of DNA by TAL effectors. Science . 2012; 335: 720– 723. Google Scholar CrossRef Search ADS PubMed 41. Schumacher M.A., Miller M.C., Grkovic S., Brown M.H., Skurray R.A., Brennan R.G. Structural mechanisms of QacR induction and multidrug recognition. Science . 2001; 294: 2158– 2163. Google Scholar CrossRef Search ADS PubMed 42. Schumacher M.A., Miller M.C., Grkovic S., Brown M.H., Skurray R.A., Brennan R.G. Structural basis for cooperative DNA binding by two dimers of the multidrug‐binding protein QacR. EMBO J. 2002; 21: 1210– 1218. Google Scholar CrossRef Search ADS PubMed 43. Andrabi M., Mizuguchi K., Ahmad S. Conformational changes in DNA‐binding proteins: Relationships with precomplex features and contributions to specificity and stability. Proteins . 2014; 82: 841– 857. Google Scholar CrossRef Search ADS PubMed 44. Miyazono K.I., Zhi Y., Takamura Y., Nagata K., Saigo K., Kojima T., Tanokura M. Cooperative DNA‐binding and sequence‐recognition mechanism of aristaless and clawless. EMBO J. 2010; 29: 1613– 1623. Google Scholar CrossRef Search ADS PubMed 45. Wilson D.S., Guenther B., Desplan C., Kuriyan J. High resolution crystal structure of a paired (Pax) class cooperative homeodomain dimer on DNA. Cell . 1995; 82: 709– 719. Google Scholar CrossRef Search ADS PubMed 46. Li Y., Korolev S., Waksman G. Crystal structures of open and closed forms of binary and ternary complexes of the large fragment of Thermus aquaticus DNA polymerase I: structural basis for nucleotide incorporation. EMBO J. 1998; 17: 7514– 7525. Google Scholar CrossRef Search ADS PubMed 47. Korolev S., Nayal M., Barnes W.M., Di Cera E., Waksman G. Crystal structure of the large fragment of Thermus aquaticus DNA polymerase I at 2.5-A resolution: structural basis for thermostability. Proc. Natl. Acad. Sci. U.S.A. 1995; 92: 9264– 9268. Google Scholar CrossRef Search ADS PubMed 48. Jain D., Kim Y., Maxwell K.L., Beasley S., Zhang R., Gussin G.N., Edwards A.M., Darst S.A. Crystal structure of bacteriophage λcII and its DNA complex. Mol. Cell . 2005; 19: 259– 269. Google Scholar CrossRef Search ADS PubMed 49. Bonvin A.M., Vis H., Breg J.N., Burgering M.J., Boelens R., Kaptein R. Nuclear magnetic resonance solution structure of the Arc repressor using relaxation matrix calculations. J. Mol. Biol. 1994; 236: 328– 341. Google Scholar CrossRef Search ADS PubMed 50. Schildbach J.F., Karzai A.W., Raumann B.E., Sauer R.T. Origins of DNA-binding specificity: role of protein contacts with the DNA backbone. Proc. Natl. Acad. Sci. U.S.A. 1999; 96: 811– 817. Google Scholar CrossRef Search ADS PubMed 51. Jones S., Thornton J.M. protein–protein interactions: a review of protein dimer structures. Progress Biophys. Mol. Biol. 1995; 63: 3161– 5965. Google Scholar CrossRef Search ADS 52. Schneider B., Gelly J.C., de Brevern A.G., Černý J. Local dynamics of proteins and DNA evaluated from crystallographic B factors. Acta Crystallogr. D . 2014; 70: 2413– 2419. Google Scholar CrossRef Search ADS 53. Xiong Y., Liu J., Wei D.Q. An accurate feature‐based method for identifying DNA‐binding residues on protein surfaces. Proteins . 2011; 79: 509– 517. Google Scholar CrossRef Search ADS PubMed 54. Kumar M.D., Bava K.A., Gromiha M.M., Prabakaran P., Kitajima K., Uedaira H., Sarai A. ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res. 2006; 34: D204– D206. Google Scholar CrossRef Search ADS PubMed 55. Ahmad S., Keskin O., Sarai A., Nussinov R. Protein–DNA interactions: structural, thermodynamic and clustering patterns of conserved residues in DNA-binding proteins. Nucleic Acids Res. 2008; 36: 5922– 5932. Google Scholar CrossRef Search ADS PubMed 56. Janin J., Bahadur R.P., Chakrabarti P. protein–protein interaction and quaternary structure. Q. Rev. Biophys. 2008; 41: 133– 180. Google Scholar CrossRef Search ADS PubMed 57. Corona R.I., Guo J.T. Statistical analysis of structural determinants for protein–DNA‐binding specificity. Proteins . 2016; 84: 1147– 1161. Google Scholar CrossRef Search ADS PubMed 58. Peng Z., Kurgan L. High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder. Nuleic Acids Res. 2015; 43: e121. Google Scholar CrossRef Search ADS 59. Khan W., Duffy F., Pollastri G., Shields D.C., Mooney C. Predicting binding within disordered protein regions to structurally characterized peptide-binding domains. PLoS ONE . 2013; 8: e72838. Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]