Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Improved design and analysis of CRISPR knockout screens

Improved design and analysis of CRISPR knockout screens Motivation: Genome-wide clustered, regularly interspaced, short palindromic repeat (CRISPR)- Cas9 screen has been widely used to interrogate gene functions. However, the rules to design bet- ter libraries beg further refinement. Results: We found single guide RNA (sgRNA) outliers are characterized by higher G-nucleotide counts, especially in regions distal from the PAM motif and are associated with stronger off-target activities. Furthermore, using non-targeting sgRNAs as negative controls lead to strong bias, which can be mitigated by using sgRNAs targeting multiple ‘safe harbor’ regions. Custom-designed screens confirmed our findings and further revealed that 19 nt sgRNAs consistently gave the best signal-to-noise ratio. Collectively, our analysis motivated the design of a new genome-wide CRISPR/Cas9 screen library and uncovered some intriguing properties of the CRISPR-Cas9 system. Availability and implementation: The MAGeCK workflow is available open source at https://bit bucket.org/liulab/mageck_nest under the MIT license. Contact: [email protected] or [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction CRISPR-Cas9 loss-of-function screens can interrogate the functions The clustered, regularly interspaced, short palindromic repeat of coding genes (Koike-Yusa et al., 2014; Shalem et al., 2014; (CRISPR)-Cas9 system is a new genome editing technology that Wang et al., 2014; Zhou et al., 2014) and non-coding elements becomes prominent in many biomedical research areas. In this sys- (Canver et al., 2015; Korkmaz et al., 2016; Zhu et al., 2016), and tem, single guide RNAs (sgRNAs) direct Cas9 nucleases to induce generate hypotheses on cell dependency, drug response, and gene double-strand breaks at targeted genomic regions (Cong et al., regulation in a high-throughput and unbiased manner (Diao et al., 2013; Jinek et al., 2012; Mali et al., 2013). Based on this system, 2016; Hart et al., 2015; Parnas et al., 2015; Wang et al., 2015). V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 4095 4096 C.-H.Chen et al. From a computational biology perspective, several algorithms have matrix with its element d ¼ 1 if sample j is affected by condition r jr been developed to characterize sgRNAs with high specificity and ef- and 0 otherwise. The objective function is a form of regularization: ficiency (Doench et al., 2016; Doench et al., 2014; Hsu et al., 2013; X * * Xu et al., 2015) that can be used in designing CRISPR screen libra- b ¼ argmax logf K ; l ðbÞ; a þ K b ðÞ 1 ij i gr NB ij ij ries. Despite these efforts, methods for designing CRISPR screens are still being refined from different aspects. First, sgRNA outliers, Where f is the probabilistic density function (PDF) of the NB dis- NB or sgRNAs with discrepant behaviors from other sgRNAs targeting tribution, and the same gene, are common in screen data, but their features and mechanisms remain poorly characterized. Second, it’s known that * gr KðbÞ¼ spacer length may vary in the CRISPR-Cas9 system (Fu et al., 2014; 2r r r Morgens et al., 2017), but the optimal length was only studied in The estimated SD, r , was calculated using the naive estimators single guide and single target. Furthermore, it remains unclear how r of b . spacer lengths affect signal-to-noise ratio (the extent of the fold gr changes of guides compared to their variances) in the screening settings. We studied both issues based on the MAGeCK-VISPR model we 2.2 Identifying sgRNA outliers previously developed (Jiang et al., 2015; Li et al., 2015). By examin- sgRNA outliers are those that have different behaviors compared ing published screens (Wang et al., 2014, 2015), we identified out- with other sgRNAs targeting the same gene. A single outlier that lier sgRNAs and uncovered their sequence features to inform future does not fit the assumed distributions can overly influence the esti- library design. We further showed stronger off-target cleavages con- mations of the beta score. Therefore, we tried to identify these out- tribute to the outlier behaviors. We also found a strong bias in liers using three-step approach: candidate outlier prediction, CRISPR screen when normalizing read counts with commonly used candidate outlier validation and outlier detection. non-targeting sgRNAs and proposed an alternative normalization to mitigate such bias. We performed custom-designed screens to valid- ate these findings, and further explored sgRNA design rules that can Step-1: Candidate outlier prediction improve the screening results, including the optimal spacer length A sgRNA is likely to be an outlier if its log fold is extremely different for higher cutting efficiencies and better signal-to-noise ratios. from other sgRNAs. Therefore, in the first step, candidate outlier Finally, we designed a genome-wide CRISPR/Cas9 screening library prediction, we identified the potential sgRNAs outliers by consider- based on these new rules and demonstrated its performance in iden- ing their LFCs. For each paired condition, we calculated the median tifying known essential genes in different cell types. and SD of the LFCs and defined the candidate outliers if their LFCs fall beyond median6 1.5 SD estimation (rÞ. Specifically, we fol- lowed the ‘quantile matching’ approach in DESeq2 (Love et al., 2014): r is chosen such that the (1–p) empirical quantile of the abso- lute values of LFC (Q ) matches the (1–p/2) theoretical quantile 2 Materials and methods jLFCj of N 0; r (Q ), where p is set as 0.32: 2.1 The MAGeCK and MAGeCK-VISPR model Our laboratory has previously developed algorithms MAGeCK and Q ðÞ 1  p jLFCj r ¼ MAGeCK-VISPR for identifying CRISPR screen hits in different Q 1 scenarios(Li et al., 2014, 2015). In two-condition comparisons, Note that for a distribution with a long tail, the traditional estima- MAGeCK uses a negative binomial (NB) model to assess the degree tion of SD will be distorted. Assuming that samples with beta scores of selections of individual sgRNAs and adopts robust rank aggrega- close to 0 follows normal distribution, we set a value of p ¼ 0.32 to tion (RRA) algorithm (Kolde et al., 2012) to aggregate multiple calculate SD using only the 68% of samples (samples within 1 SD) sgRNAs on a gene to evaluate gene selection. MAGeCK-VISPR (Li closes to zero. In this way, the samples with beta scores far from et al., 2015) further quantitatively estimates gene selections by opti- zero will not distort the estimation of SD. mizing a joint likelihood function of observing the read counts of different sgRNAs with varying behaviors in multiple conditions. The output of MAGeCK-VISPR is a ‘beta score’ for gene g in condi- Step-2: Candidate outlier in silico validation tion r, b , analogous to the ‘log fold change (LFC)’ in differential gr Noticing that a sgRNA outlier may significantly influence the beta gene expression analysis. More specifically, the read count of score estimation, a candidate outlier is validated if there is a signifi- sgRNA i in sample j,orK ; is modeled as: ij cant change of beta score; b , after removing the candidate outlier. gr K  NB l ; a ij i Therefore, in the second step, the candidate outlier in silico valid- ij ation, we calculated the beta score with and without the candidate Where l and a are the mean and over-dispersion factor of the NB ij outlier respectively using Equation (1). Define: distribution, respectively. The mean value l is further modeled as: ij raw b ¼ b , when all sgRNAs are used; gr b ¼ b , when sgRNA i is excluded. gr l ðbÞ¼ s exp d b j jr ij gr Then candidate outlier i is in silico validated if: raw i Where s is the size factor of sample j for adjusting sequencing log absðÞ b =abs b >ðÞ 5  0:2number of sgRNAs depths of the samples, and b is the vector of all beta scores for gene g. To deal with complex experimental settings, we included design With outlier removal, we could prevent the beta score estimation matrix (D). With J samples affected by R conditions, D is a binary from distortion by strong outliers. Improved design and analysis of CRISPR knockout screens 4097 Step-3: Outlier detection (a) (b)(c) With previous two steps, we could estimate the beta scores robustly. However, some moderate outliers cannot be identified if sufficient sgRNAs prevent the beta score from distortion by a single outlier. Therefore, with robust estimators of beta scores, in the final step, we re-defined a sgRNA as an outlier if the probability of observing its count conditioned on pre-calculated beta score falls below a cer- tain threshold. In other words, sgRNA i is an outlier if: (d)(e)(f) X * logf K ; l ðbÞ; a < T ij i NB ij where f is the PDF of the NB distribution. The threshold T was NB determined such that 90% of the validated outliers defined in Step 2 can be removed. Fig. 1. Identifying and characterizing stronger negative sgRNAs outliers. (a, b) 2.3 Extracting sequence features using elastic-net The LFCs of 10 sgRNAs targeting FARP1 and RPSA in 4 screens (KBM7, K562, regression Jiyoye and Raji). The red lines represent sgRNAs outliers, and the blue lines represent other sgRNAs. (c) Identifying and removing aberrantly stronger To identify the sequence features that associate with stronger negative outliers (red dots). Each row of dots represents the LFCs of sgRNAs sgRNA outliers, we applied Elastic-Net regression to extract the se- targeting the same gene. (d) The G-nucleotide counts of sgRNAs in three quence features as our previous work (Xu et al., 2015). Suppose X ¼ groups: stronger negative outliers (red), non-outliers (blue) and all sgRNAs fX ; X ; .. . ; X g is the set of encoded sequence vectors and 1 2 n (green). (e) The sequence features of stronger negative outliers versus non- Y ¼fY ; Y ; ... ; Y g is the set of outputs representing whether the 1 2 n outliers derived by elastic-net regression. The ‘Seed’ and ‘Non-seed’ regions sgRNAs are stronger outliers, where n is the number of sgRNAs are defined as a 10-nucleotide window proximal to and distal from the PAM motif, respectively. The data for Figure 1a–e is from a public screening data- samples for training. If sgRNA i is an outlier, the corresponding set (Wang et al., 2015). (f) The knockout of CD33 expression with different Y ¼ 1 and 0 otherwise. Let M be the length of the input vectors; groups of sgRNAs. The ‘Perfect Match’ are 65 perfect-match sgRNAs with an the Elastic-Net regression computes the parameters NGG PAM that produced effective CD33 knockout defined in (Doench et al., b ¼½ b ; b ; .. . ; b that minimize an object function E: 1 2 M 2016). The ‘Negative Controls’ are the same set of sgRNAs with non-NGG PAM. Those in ‘5 ‘G’s in Non-seed region’ and ‘<5 ‘G’ in Non-seed region’ T 2 1 2 E ¼jjY  b Xjj þ k ajjbjj þð1  aÞjjbjj are sgRNAs with an NGG PAM but 1-nt mismatch compared to the ‘Perfect Match’ sgRNAs Where a and k are parameters estimated using cross validation, jjbjj ¼ X X 2 2 jb j and jjbjj ¼ b .We usedglmnetinRpackagetoimple- i i ii i ment the Elastic-Net regression (Friedman et al., 2010). Based on the MAGeCK-VISPR model, we implemented an ap- proach to identify such outliers, which tests whether one sgRNA has 2.4 CRISPR screening design and experimental big effects on the gene-level beta score estimates or the probability procedure of observing the sgRNA conditioned on the gene-level beta score is We designed and performed a CRISPR screening experiment to low (Section 2). This outlier detection and removal approach did study the effects of different normalization methods and different identify sgRNAs with aberrant LFC on a gene (Fig. 1c). In published sgRNA lengths. The screening library has four types of sgRNAs: screens on four leukemia cell lines (Wang et al., 2015), 9000 out of sgRNAs targeting AAVS1 (a region whose disruption does not have 182 K sgRNAs on average were identified as outliers. Among them, any lethal phenotype), non-targeting sgRNAs, sgRNAs targeting 51 911 sgRNAs are outliers that are consistent in all four screens ribosomal genes and 503 cancer-related genes that are considered to (Supplementary Fig. S1a) and 80% of these outliers (729/911) are be lethal. The details of the library design and the experiment are in ‘strong negative outliers’ with stronger negative selection compared Supplementary Material. with other gRNAs on the same gene (as Fig. 1a). To rule out the pos- sibility that these sgRNAs knockout their intended targets with ex- tremely high efficiencies, we further limited our analysis to 564 3 Results outliers (Supplementary Table S1) that target known non-essential 3.1 sgRNAs outlier identification and characterization genes (Hart and Moffat, 2016), as inactivating these genes is unlike- Different sgRNAs targeting the same gene can lead to varying phe- ly to affect cell growth. notypes or selection levels in the screen due to different cleavage and Comparing the sequence features of these 564 ‘strong negative repair efficiencies, local chromatin structure, protein domains and outliers’ with all 18 000 sgRNAs in the library, we found that they potential off-target effects, etc. (Hsu et al., 2013; Knight et al., have higher G-nucleotide but lower C-nucleotide counts in the target 2015; Shi et al., 2015). Some sgRNAs with outlier phenotypes com- DNA sequence (Fig. 1d, Supplementary Fig. S1b–d). To identify po- pared with other sgRNAs on the same gene, regardless of the causes, tential sequence features that can distinguish outliers and non- behave consistently in multiple screen conditions (Wang et al., outliers, we trained an elastic net model (Friedman et al., 2010), a 2015; Fig. 1a and b), suggesting that the discrepant phenotypes regularized regression method that considers both the L1 and L2 could arise from intrinsic features of the sgRNA in addition to ran- penalties of the lasso and ridge methods. In the training dataset, the dom variances in the experiments. We are especially interested in predictor variable is a binary vector representing the presence or ab- ‘strong negative outliers’ (as Fig. 1a), which are defined as having sence of the nucleotides, and the response variable is a binary vari- much larger negative LFCs compared with other sgRNAs targeting able indicating whether the gRNA is an outlier. Our model showed the same gene and are more likely caused by off-target cleavages. that outliers tend to contain more G-nucleotides in the 10-nucleotide 4098 C.-H.Chen et al. non-seed region distal from the PAM motif (Fig. 1e). To exclude (a)(b) possible biases of a single library, we confirmed our finding using another screen dataset (Meyers et al., 2017; Supplementary Fig. S2a–b). We further tested our predictive model on other CRISPR- Cas9 knockout (Wang et al., 2014) or CRISPR-dCas9 inhibition screening (Horlbeck et al., 2016) datasets. The output of the model is an ‘outlier score’, indicating how likely the input sgRNA is an out- lier. We found that ‘strong negative outliers’ in both datasets have significantly higher outlier scores than non-outliers (Supplementary (c)(d) Fig. S2c and d), suggesting outlier features we found are consistent across different datasets. These findings also suggest that a better CRISPR sgRNA design should at least avoid extreme G content in the non-seed region in case of potential off-target effects. Considering that strong off-target activities can lead to ‘strong negative outliers’, we re-analyzed a previous study that measured the off-target activities between mismatched sgRNA: DNA pairs, defined as the decrease of CD33 protein level by sgRNAs with 1 nu- cleotide mismatch compared to the target DNA in CD33 locus (Doench et al., 2016). Instead of modeling off-target activities as Fig. 2. Normalizing read counts using sgRNAs targeting non-essential genes functions of mismatched nucleotide pair and position as in (Doench or AAVS1. (a-b) The distribution of beta scores in public dataset (Wang et al., et al., 2016), we tested how the nucleotide compositions in the Non- 2015) using non-targeting sgRNAs (a) and sgRNAs targeting non-essential seed region affect the off-target activities. SgRNAs with more ‘G’s genes (b) for normalization. (c) The LFC distribution of 349 non-targeting (5) in Non-seed region have significantly higher off-target activ- sgRNAs, 467 non-essential genes-targeting sgRNAs, 133 AAVS1-targeting ities than those with fewer ‘G’s (<5) (Fig. 1f). In contrast, there is no sgRNAs and 725 essential genes-targeting sgRNAs. P-values were calculated difference in off-target activities between sgRNAs with more (5) using two-sided Student’s t-test. (d) The distribution of beta score using all sgRNAs (black), non-essential genes-targeting sgRNAs (green), AAVS1- and fewer (<5) ‘C’s (Supplementary Fig. S2e). These findings sug- targeting sgRNAs (red) and non-targeting sgRNAs (blue) for normalizing read gest sgRNAs targeting sequences with high G-content in the non- counts, respectively seed region have stronger off-target activities, which can lead to strong outlier phenotypes. harbor’ site preferred for gene knock-ins (DeKelver et al., 2010; 3.2 SgRNAs targeting multiple non-essential genes as Sadelain et al., 2011). This region appears to be epigenetically open for efficient cleavage, yet cutting or modification at this site results negative controls reduce false positives in the screen in no phenotypic changes (Ogata et al., 2003). To test whether Correct interpretations of genome-wide screens require proper read sgRNAs targeting AAVS1 could serve as good negative controls, we count normalization. Since most sgRNAs should generate knockouts first designed a genome-wide screen library containing 134 AAVS1- without causing phenotype, a straightforward approach is to nor- targeting sgRNAs, 349 non-targeting sgRNAs, as well as five malize based on the total read counts of all sgRNAs (Love et al., sgRNAs per gene in the human genome and performed screening in 2014; ‘total normalization’). Alternatively, many screen libraries in- a prostate cancer LNCaP-abl cell line. SgRNAs targeting AAVS1 or clude ‘non-targeting’ negative control sgRNAs, which match no- non-essential genes induced similar LFCs that are stronger than non- where in the genome, for normalization (‘non-targeting sgRNA targeting sgRNAs, confirming the existence of cleavage toxicity in normalization’). In public datasets (Wang et al., 2014, 2015), ‘total non-essential regions (Fig. 2c). Also, by comparing normalization normalization’ resulted in a beta-score distribution centered on zero methods using different sets of sgRNAs (all, non-targeting, AAVS1- (Supplementary Fig. S3a), while ‘non-targeting sgRNA normaliza- targeting and non-essential-gene-targeting sgRNAs, respectively), tion’ led to a skewed distribution of beta scores where most of the we found normalization using the AAVS1- and non-essential-genes genes appear as negatively selected (Fig. 2a). The bias of ‘non-target- targeting sgRNAs result in almost identical distributions of beta ing sgRNA normalization’ is introduced when sgRNAs targeting scores (Fig. 2d). Moreover, both ‘all sgRNA normalization’ and non-essential genes impede cell growth from genome cleavage tox- ‘non-targeting sgRNA normalization’ lead to biases, though to dif- icity (Aguirre et al., 2016; Munoz et al., 2016), regardless of the ferent degrees (Fig. 2d). Since normalization using control guides is gene knockout effects. Therefore, a more appropriate choice of an essential step in many computational methods including negative controls is a set of sgRNAs targeting non-essential DNA MAGeCK-VISPR and CRISPR Score (CS; Wang et al., 2014), the regions. These sgRNAs have already been included in recent library results of these methods will also be affected by the choice of nega- design (Wang et al., 2017). Indeed, when normalizing read counts tive controls (Supplementary Fig. S3b). While methods that only using sgRNAs targeting the ‘gold standard’ 927 non-essential genes rely on gRNA ranks such as MAGeCK-RRA (Li et al., 2014) will previously derived from pooled shRNA screens (Hart et al., 2014), not be affected, the rankings could not clearly distinguish genes that the beta score distribution is centered on zero (Fig. 2b). are negatively, positively, or not selected, which are important when In genome-wide screens, normalizations using either sgRNAs comparing screens over multiple conditions. targeting non-essential genes or all genes lead to similar results To evaluate the normalization methods in a focused screen, we (Fig. 2b, Supplementary Fig. S3a), as the majority of the genes are also designed a small screening library that targets 600 genes, assumed to be non-essential. Such assumption may fail in focused including ribosomal genes and well-known cancer-related genes (or custom) screens where many targeted genes may be under selec- (Section 2, Supplementary Tables S2 and S3). The library also tion, which necessitates the selection of better negative control includes the same set of AAVS1-targeting and non-targeting sgRNAs. sgRNAs. AAVS1 (adeno-associated virus integration site 1) is a ‘safe Improved design and analysis of CRISPR knockout screens 4099 Similar to genome-wide screens, AAVS1-targeting sgRNAs induced sgRNAs, as well as five sgRNAs targeting each gene in the human stronger negative selections compared with non-targeting sgRNAs genome. After removing sgRNAs that are enriched in G-nucleotide (Supplementary Fig. S3c). Furthermore, using AAVS1-targeting (>40%) and have perfect matches to other coding regions, we pri- sgRNAs as negative controls in our MAGeCK algorithm substantially oritized the remaining sgRNAs based on their predicted cleavage increases the sensitivity of the screen, while keeping the same level of efficiencies(Xu et al., 2015) and the number of perfect matches in false positives (Supplementary Fig. S3d). These results validated the the whole genome (see Methods). We conducted screens in LNCaP, applicability of including AAVS1-targeting sgRNAs in genome-wide, abl and T47D cell lines using the H1/H2 library and compared to and more importantly in focused screen libraries. other genome-wide screen datasets, including Brunello library (Doench et al., 2016), TKO library (Hart et al., 2015) and Ong li- brary (Ong et al., 2017). We found H1/H2 is among the libraries 3.3 19 nt spacers give rise to higher cutting efficiencies with fewest outlier sgRNA rates (Supplementary Fig. S4b). and better signal-to-noise ratio Assuming that a good library should be able to rank known essential In spCas9 gene editing systems, truncated sgRNAs have been genes as most negatively selected ones, we found that H1/H2, reported to have a better cleavage specificity compared with full- Brunello and Ong libraries outperformed GeCKOv2 and TKO in length sgRNAs (Fu et al., 2014). However, the performances of identifying known essential genes (Supplementary Fig. S4c–d). truncated sgRNAs in screens compared with full-length sgRNAs, as These results provide support for our refined CRISPR screen library well as the optimal length of truncated sgRNAs, have yet to been design rules. fully determined. Therefore, in our small screening library, we designed sgRNAs with 20 nt spacers for each ribosomal gene and AAVS1-targeting sgRNAs and then truncated them to 19 nt, 18 nt and 17 nt (Section 2). We found that 19 nt sgRNAs give significantly 4 Discussion stronger LFCs in ribosomal genes, reflecting higher cleavage efficien- The CRISPR-cas9 knockout screen has been used to interrogate the cies (Fig. 3a). If we use the difference between positive-control functions of coding genes and non-coding elements systemically, but sgRNAs (sgRNAs targeting ribosomal genes) and negative-control library design is still in their early stage. We first applied MAGeCK- sgRNA (AAVS1-targeting sgRNAs) as a metric for signal-to-noise, VISPR to public genome-wide screen data and identified a set of 19 nt spacers on average give the best performance (Supplementary ‘strong negative outlier’ sgRNAs and their sequence characteristics: Fig. S4a) in 11 of 12 screens. Moreover, for each ribosomal gene, higher G-nucleotide counts especially in regions distal from PAM 19 nt sgRNAs gave lower relative SD (i.e. SD divided by mean; motif. Unexpectedly, the effect of the outliers is independent of the Supplementary Material.) of LFCs, indicating a more stable behav- count of C-nucleotide, different from previous studies that suggest ior (and potentially less off-target cleavages) of gene knockout the role of ‘GC’ content in determining cleavage efficiencies effects (Fig. 3b). (Doench et al., 2014; Haeussler et al., 2016; Wang et al., 2014). Since G-C hybridization strengths in DNA-RNA and RNA-DNA 3.4 A new genome-wide library improved screen hybrids are similar, the distinct effect of G- and C-nucleotides sug- performance gests a more crucial role of DNA-endonuclease rather than DNA- Using the rules we uncovered in this study and our previous work RNA interaction in determining outlier effects. Moreover, sgRNAs (Xu et al., 2015), we designed two sub-libraries that target 18, 493 with higher G-contents in regions distal from PAM motif have stron- human coding genes (named ‘H1’ and ‘H2’; Supplementary Tables ger off-target activities. It is worth noting that the off-target activity S4, S5). Each sub-library includes sgRNAs with 19 nt-long spacers of each sgRNA in Figure 1f was measured between one sgRNA- and contains 134 AAVS1-targeting sgRNAs, 349 non-targeting DNA pair, and the seemly minor difference between sgRNAs with high and low G-contents will be multiplied by the enormous mis- matched sgRNA-DNA pairs in the genome and lead to sgRNA out- (a)(b) liers in screens. Although toxicity from CRISPR cutting has been reported, using non-targeting control for normalization is still a common practice in published literature (Aguirre et al., 2016; Wang et al., 2014). We found that normalization using non-targeting sgRNAs, as compared to using all sgRNAs or sgRNAs targeting non-essential genes, could lead to higher false positives (Supplementary Fig. S3d) in calling essential genes. The reason might be because cleavages in non- essential regions can still induce toxicity in cell growth, in consist- ency with two recent studies showing false positive hits from highly amplified regions in cancer genomes (Aguirre et al., 2016; Munoz et al., 2016). Through CRISPR screening experiments, we confirmed Fig. 3. Comparing cleavage efficiencies and signal-to-noise ratios between that sgRNAs targeting non-essential genes or safe-harbor regions different lengths of sgRNA spacers. (a) The LFCs of sgRNAs with spacer could serve as better negative controls and result in fewer false posi- lengths ranging from 17- to 20-nts, including non-targeting sgRNAs and tives compared with non-targeting sgRNAs. Since a single chroma- sgRNAs targeting ribosomal genes. For each spacer length, there are 100 tin region may be subject to copy number variations in different cell non-targeting sgRNAs and 1020 ribosomal genes-targeting sgRNAs. P-values types, sgRNAs targeting multiple non-essential regions will serve as were calculated using two-sided Student’s t-test. (b) The relative SD of LFCs more robust negative controls. For instance, only 5% (57/1, 043) of sgRNAs targeting ribosomal genes with spacer lengths ranging from 17- to CCLE cell lines have copy number gains in AAVS1 locus, such as 20-nts. There are 612 data points (51 ribosomal genes repeated in 12 screens) for each spacer length. P-values were calculated using two-sided Student’s HCC1937 and MDAMB157, suggesting that though chance is low, t-test caution should be used when using single region as negative 4100 C.-H.Chen et al. DeKelver,R.C. et al. (2010) Functional genomics, proteomics, and regulatory controls. Including correct negative controls is also necessary for DNA analysis in isogenic settings using zinc finger nuclease-driven transgen- custom-designed screens where genes are pre-selected and normal- esis into a safe harbor locus in the human genome. Genome Res., 20, ization using total read counts is inappropriate. We proposed a solu- 1133–1142. tion to reduce the biases by using either multiple non-essential genes Diao,Y. et al. (2016) A new class of temporarily phenotypic enhancers identi- or AAVS1-targeting guides. fied by CRISPR/Cas9-mediated genetic screening. Genome Res., 26, Finally, sgRNAs with shorter lengths have been shown to be po- 397–405. tent in efficiency and specificity (Fu et al., 2014), but the optimal Doench,J.G. et al. (2016) Optimized sgRNA design to maximize activity performance of truncated sgRNAs with different lengths has not and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol., 34, been systematically investigated in screen setting. We discovered 184–191. that 19 nt sgRNAs consistently provide better cleavage efficiencies Doench,J.G. et al. (2014) Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat. Biotechnol., 32, 1262–1267. and signal-to-noise separations compared with other lengths (17, Friedman,J. et al. (2010) Regularization paths for generalized linear models 18, 20 nt). Therefore, using 19 nt sgRNAs in either low-throughput via coordinate descent. J. Stat. Softw., 33, 1–22. experiments or high-throughput screens may give rise to a more ac- Fu,Y. et al. (2014) Improving CRISPR-Cas nuclease specificity using truncated curate inference of gene knockout effects. guide RNAs. Nat. Biotechnol., 32, 279–284. We demonstrated that H1/H2 libraries have improved perform- Haeussler,M. et al. (2016) Evaluation of off-target and on-target scoring algo- ance in identifying known essential genes with less outlier sgRNAs. rithms and integration into the guide RNA selection tool CRISPOR. However, the fact that comparisons were not performed in the same Genome Biol., 17, 148. cellular context might contribute to the observed differences. Also, Hart,T. et al. (2014) Measuring error rates in genomic perturbation screens: since different libraries used distinct approaches to improve screen gold standards for human functional genomics. Mol. Syst. Biol., 10, 733. performance, integrating their respective advantages might further Hart,T. et al. (2015) High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell, 163, 1515–1526. improve the next generation library design. Hart,T. and Moffat,J. (2016) BAGEL: a computational framework for identi- Although we characterized multiple features of CRISPR screens fying essential genes from pooled library screens. BMC Bioinformatics, using computational approaches, the exact mechanisms behind 17, 164. these findings remain unknown. First, it is unclear how sgRNAs Horlbeck,M.A. et al. (2016) Compact and highly active next-generation with higher G-nucleotide content are associated with stronger out- libraries for CRISPR-mediated gene repression and activation. Elife, 5, liers. We suspected that outlier gRNAs with high G-nucleotides e19760. have promiscuous off-target binding and cutting at many CpG Hsu,P.D. et al. (2013) DNA targeting specificity of RNA-guided Cas9 nucle- islands in the genome. Existing experimental approaches to detect ases. Nat. Biotechnol., 31, 827–832. off-target cleavages (Tsai et al.,2015) may be limited to study Jiang,P. et al. (2015) Network analysis of gene essentiality in functional gen- these gRNAs, as the cleavages in each binding site may be low. omics experiments. Genome Biol., 16, 239. Jinek,M. et al. (2012) A programmable dual-RNA-guided DNA endonuclease Second, although we have shown the advantages of using 19 bp in adaptive bacterial immunity. Science, 337, 816–821. sgRNA spacers from statistical perspectives, how different lengths Knight,S.C. et al. (2015) Dynamics of CRISPR-Cas9 genome interrogation in of sgRNA spacers give rise to various cleavage strengths and living cells. Science, 350, 823–826. off-targets remain to be determined. Last but not least, all the Koike-Yusa,H. et al. (2014) Genome-wide recessive genetic screening in mam- above findings are derived in the SpCas9 system, and the rules in malian cells with a lentiviral CRISPR-guide RNA library. Nat. Biotechnol., different RNA-guided DNA endonuclease systems require further 32, 267–273. investigations. Kolde,R. et al. (2012) Robust rank aggregation for gene list integration and Collectively, our study provided novel insights into the meta-analysis. Bioinformatics, 28, 573–580. properties of CRISPR and the design of both high- and low Korkmaz,G. et al. (2016) Functional genetic screens for enhancer elements throughput CRISPR experiments. We designed two genome-wide in the human genome using CRISPR-Cas9. Nat. Biotechnol., 34, 192–198. libraries and showed the improved performance using the rules we Li,W. et al. (2015) Quality control, modeling, and visualization of CRISPR uncovered. The characterized features and design rules, as well screens with MAGeCK-VISPR. Genome Biol., 16, 281. as the libraries, will benefit and expedite the application of Li,W. et al. (2014) MAGeCK enables robust identification of essential genes CRISPR techniques. from genome-scale CRISPR/Cas9 knockout screens. Genome Biol., 15, Love,M.I. et al. (2014) Moderated estimation of fold change and dispersion Funding for RNA-seq data with DESeq2. Genome Biol., 15, 550. Mali,P. et al. (2013) CAS9 transcriptional activators for target specificity This work was supported by the NIH R01 HG008728, NIH R01 HG008927 screening and paired nickases for cooperative genome engineering. Nat. and the Breast Cancer Research Foundation (BCRF). Funding for open access Biotechnol., 31, 833–838. charge: National Institutes of Health. WL is supported in part by funds of the Center for Genetic Medicine Research and the Gilbert Family Meyers,R.M. et al. (2017) Computational correction of copy number effect Neurofibromatosis Institute at Children’s National Health System. improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet., 49, 1779–1784. Conflict of Interest: none declared. Morgens,D.W. et al. (2017) Genome-scale measurement of off-target activity using Cas9 toxicity in high-throughput screens. Nat. Commun., 8, 15178. Munoz,D.M. et al. (2016) CRISPR screens provide a comprehensive assess- References ment of cancer vulnerabilities but generate false-positive hits for highly Aguirre,A.J. et al. (2016) Genomic copy number dictates a gene-independent amplified genomic regions. Cancer Discov., 6, 900. cell response to CRISPR-Cas9 targeting. Cancer Discov., 6, 914. Ogata,T. et al. (2003) Identification of an insulator in AAVS1, a preferred Canver,M.C. et al. (2015) BCL11A enhancer dissection by Cas9-mediated in region for integration of adeno-associated virus DNA. J. Virol., 77, situ saturating mutagenesis. Nature, 527, 192–197. 9000–9007. Cong,L. et al. (2013) Multiplex genome engineering using CRISPR/Cas sys- Ong,S.H. et al. (2017) Optimised metrics for CRISPR-KO screens with tems. Science, 339, 819–823. second-generation gRNA libraries. Sci. Rep., 7, 7384. Improved design and analysis of CRISPR knockout screens 4101 Parnas,O. et al. (2015) A genome-wide CRISPR screen in primary immune Wang,T. et al. (2014) Genetic screens in human cells using the CRISPR-Cas9 cells to dissect regulatory networks. Cell, 162, 675–686. system. Science, 343, 80–84. Sadelain,M. et al. (2011) Safe harbours for the integration of new DNA in the Wang,T. et al. (2017) Gene essentiality profiling reveals gene networks human genome. Nat. Rev. Cancer, 12, 51–58. and synthetic lethal interactions with oncogenic Ras. Cell, 168, 890–903, Shalem,O. et al. (2014) Genome-scale CRISPR-Cas9 knockout screening in e815. human cells. Science, 343, 84–87. Xu,H. et al. (2015) Sequence determinants of improved CRISPR sgRNA de- Shi,J. et al. (2015) Discovery of cancer drug targets by CRISPR-Cas9 screening sign. Genome Res., 25, 1147–1157. of protein domains. Nat. Biotechnol., 33, 661–667. Zhou,Y. et al. (2014) High-throughput screening of a CRISPR/Cas9 library Tsai,S.Q. et al. (2015) GUIDE-seq enables genome-wide profiling of off-target for functional genomics in human cells. Nature, 509, 487–491. cleavage by CRISPR-Cas nucleases. Nat. Biotechnol., 33, 187–197. Zhu,S. et al. (2016) Genome-scale deletion screening of human long Wang,T. et al. (2015) Identification and characterization of essential genes in non-coding RNAs using a paired-guide RNA CRISPR-Cas9 library. Nat. the human genome. Science, 350, 1096–1101. Biotechnol., 34, 1279–1286. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Improved design and analysis of CRISPR knockout screens

Loading next page...
 
/lp/ou_press/improved-design-and-analysis-of-crispr-knockout-screens-Ai01Nc80u0

References (45)

Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected]
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bty450
Publisher site
See Article on Publisher Site

Abstract

Motivation: Genome-wide clustered, regularly interspaced, short palindromic repeat (CRISPR)- Cas9 screen has been widely used to interrogate gene functions. However, the rules to design bet- ter libraries beg further refinement. Results: We found single guide RNA (sgRNA) outliers are characterized by higher G-nucleotide counts, especially in regions distal from the PAM motif and are associated with stronger off-target activities. Furthermore, using non-targeting sgRNAs as negative controls lead to strong bias, which can be mitigated by using sgRNAs targeting multiple ‘safe harbor’ regions. Custom-designed screens confirmed our findings and further revealed that 19 nt sgRNAs consistently gave the best signal-to-noise ratio. Collectively, our analysis motivated the design of a new genome-wide CRISPR/Cas9 screen library and uncovered some intriguing properties of the CRISPR-Cas9 system. Availability and implementation: The MAGeCK workflow is available open source at https://bit bucket.org/liulab/mageck_nest under the MIT license. Contact: [email protected] or [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction CRISPR-Cas9 loss-of-function screens can interrogate the functions The clustered, regularly interspaced, short palindromic repeat of coding genes (Koike-Yusa et al., 2014; Shalem et al., 2014; (CRISPR)-Cas9 system is a new genome editing technology that Wang et al., 2014; Zhou et al., 2014) and non-coding elements becomes prominent in many biomedical research areas. In this sys- (Canver et al., 2015; Korkmaz et al., 2016; Zhu et al., 2016), and tem, single guide RNAs (sgRNAs) direct Cas9 nucleases to induce generate hypotheses on cell dependency, drug response, and gene double-strand breaks at targeted genomic regions (Cong et al., regulation in a high-throughput and unbiased manner (Diao et al., 2013; Jinek et al., 2012; Mali et al., 2013). Based on this system, 2016; Hart et al., 2015; Parnas et al., 2015; Wang et al., 2015). V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 4095 4096 C.-H.Chen et al. From a computational biology perspective, several algorithms have matrix with its element d ¼ 1 if sample j is affected by condition r jr been developed to characterize sgRNAs with high specificity and ef- and 0 otherwise. The objective function is a form of regularization: ficiency (Doench et al., 2016; Doench et al., 2014; Hsu et al., 2013; X * * Xu et al., 2015) that can be used in designing CRISPR screen libra- b ¼ argmax logf K ; l ðbÞ; a þ K b ðÞ 1 ij i gr NB ij ij ries. Despite these efforts, methods for designing CRISPR screens are still being refined from different aspects. First, sgRNA outliers, Where f is the probabilistic density function (PDF) of the NB dis- NB or sgRNAs with discrepant behaviors from other sgRNAs targeting tribution, and the same gene, are common in screen data, but their features and mechanisms remain poorly characterized. Second, it’s known that * gr KðbÞ¼ spacer length may vary in the CRISPR-Cas9 system (Fu et al., 2014; 2r r r Morgens et al., 2017), but the optimal length was only studied in The estimated SD, r , was calculated using the naive estimators single guide and single target. Furthermore, it remains unclear how r of b . spacer lengths affect signal-to-noise ratio (the extent of the fold gr changes of guides compared to their variances) in the screening settings. We studied both issues based on the MAGeCK-VISPR model we 2.2 Identifying sgRNA outliers previously developed (Jiang et al., 2015; Li et al., 2015). By examin- sgRNA outliers are those that have different behaviors compared ing published screens (Wang et al., 2014, 2015), we identified out- with other sgRNAs targeting the same gene. A single outlier that lier sgRNAs and uncovered their sequence features to inform future does not fit the assumed distributions can overly influence the esti- library design. We further showed stronger off-target cleavages con- mations of the beta score. Therefore, we tried to identify these out- tribute to the outlier behaviors. We also found a strong bias in liers using three-step approach: candidate outlier prediction, CRISPR screen when normalizing read counts with commonly used candidate outlier validation and outlier detection. non-targeting sgRNAs and proposed an alternative normalization to mitigate such bias. We performed custom-designed screens to valid- ate these findings, and further explored sgRNA design rules that can Step-1: Candidate outlier prediction improve the screening results, including the optimal spacer length A sgRNA is likely to be an outlier if its log fold is extremely different for higher cutting efficiencies and better signal-to-noise ratios. from other sgRNAs. Therefore, in the first step, candidate outlier Finally, we designed a genome-wide CRISPR/Cas9 screening library prediction, we identified the potential sgRNAs outliers by consider- based on these new rules and demonstrated its performance in iden- ing their LFCs. For each paired condition, we calculated the median tifying known essential genes in different cell types. and SD of the LFCs and defined the candidate outliers if their LFCs fall beyond median6 1.5 SD estimation (rÞ. Specifically, we fol- lowed the ‘quantile matching’ approach in DESeq2 (Love et al., 2014): r is chosen such that the (1–p) empirical quantile of the abso- lute values of LFC (Q ) matches the (1–p/2) theoretical quantile 2 Materials and methods jLFCj of N 0; r (Q ), where p is set as 0.32: 2.1 The MAGeCK and MAGeCK-VISPR model Our laboratory has previously developed algorithms MAGeCK and Q ðÞ 1  p jLFCj r ¼ MAGeCK-VISPR for identifying CRISPR screen hits in different Q 1 scenarios(Li et al., 2014, 2015). In two-condition comparisons, Note that for a distribution with a long tail, the traditional estima- MAGeCK uses a negative binomial (NB) model to assess the degree tion of SD will be distorted. Assuming that samples with beta scores of selections of individual sgRNAs and adopts robust rank aggrega- close to 0 follows normal distribution, we set a value of p ¼ 0.32 to tion (RRA) algorithm (Kolde et al., 2012) to aggregate multiple calculate SD using only the 68% of samples (samples within 1 SD) sgRNAs on a gene to evaluate gene selection. MAGeCK-VISPR (Li closes to zero. In this way, the samples with beta scores far from et al., 2015) further quantitatively estimates gene selections by opti- zero will not distort the estimation of SD. mizing a joint likelihood function of observing the read counts of different sgRNAs with varying behaviors in multiple conditions. The output of MAGeCK-VISPR is a ‘beta score’ for gene g in condi- Step-2: Candidate outlier in silico validation tion r, b , analogous to the ‘log fold change (LFC)’ in differential gr Noticing that a sgRNA outlier may significantly influence the beta gene expression analysis. More specifically, the read count of score estimation, a candidate outlier is validated if there is a signifi- sgRNA i in sample j,orK ; is modeled as: ij cant change of beta score; b , after removing the candidate outlier. gr K  NB l ; a ij i Therefore, in the second step, the candidate outlier in silico valid- ij ation, we calculated the beta score with and without the candidate Where l and a are the mean and over-dispersion factor of the NB ij outlier respectively using Equation (1). Define: distribution, respectively. The mean value l is further modeled as: ij raw b ¼ b , when all sgRNAs are used; gr b ¼ b , when sgRNA i is excluded. gr l ðbÞ¼ s exp d b j jr ij gr Then candidate outlier i is in silico validated if: raw i Where s is the size factor of sample j for adjusting sequencing log absðÞ b =abs b >ðÞ 5  0:2number of sgRNAs depths of the samples, and b is the vector of all beta scores for gene g. To deal with complex experimental settings, we included design With outlier removal, we could prevent the beta score estimation matrix (D). With J samples affected by R conditions, D is a binary from distortion by strong outliers. Improved design and analysis of CRISPR knockout screens 4097 Step-3: Outlier detection (a) (b)(c) With previous two steps, we could estimate the beta scores robustly. However, some moderate outliers cannot be identified if sufficient sgRNAs prevent the beta score from distortion by a single outlier. Therefore, with robust estimators of beta scores, in the final step, we re-defined a sgRNA as an outlier if the probability of observing its count conditioned on pre-calculated beta score falls below a cer- tain threshold. In other words, sgRNA i is an outlier if: (d)(e)(f) X * logf K ; l ðbÞ; a < T ij i NB ij where f is the PDF of the NB distribution. The threshold T was NB determined such that 90% of the validated outliers defined in Step 2 can be removed. Fig. 1. Identifying and characterizing stronger negative sgRNAs outliers. (a, b) 2.3 Extracting sequence features using elastic-net The LFCs of 10 sgRNAs targeting FARP1 and RPSA in 4 screens (KBM7, K562, regression Jiyoye and Raji). The red lines represent sgRNAs outliers, and the blue lines represent other sgRNAs. (c) Identifying and removing aberrantly stronger To identify the sequence features that associate with stronger negative outliers (red dots). Each row of dots represents the LFCs of sgRNAs sgRNA outliers, we applied Elastic-Net regression to extract the se- targeting the same gene. (d) The G-nucleotide counts of sgRNAs in three quence features as our previous work (Xu et al., 2015). Suppose X ¼ groups: stronger negative outliers (red), non-outliers (blue) and all sgRNAs fX ; X ; .. . ; X g is the set of encoded sequence vectors and 1 2 n (green). (e) The sequence features of stronger negative outliers versus non- Y ¼fY ; Y ; ... ; Y g is the set of outputs representing whether the 1 2 n outliers derived by elastic-net regression. The ‘Seed’ and ‘Non-seed’ regions sgRNAs are stronger outliers, where n is the number of sgRNAs are defined as a 10-nucleotide window proximal to and distal from the PAM motif, respectively. The data for Figure 1a–e is from a public screening data- samples for training. If sgRNA i is an outlier, the corresponding set (Wang et al., 2015). (f) The knockout of CD33 expression with different Y ¼ 1 and 0 otherwise. Let M be the length of the input vectors; groups of sgRNAs. The ‘Perfect Match’ are 65 perfect-match sgRNAs with an the Elastic-Net regression computes the parameters NGG PAM that produced effective CD33 knockout defined in (Doench et al., b ¼½ b ; b ; .. . ; b that minimize an object function E: 1 2 M 2016). The ‘Negative Controls’ are the same set of sgRNAs with non-NGG PAM. Those in ‘5 ‘G’s in Non-seed region’ and ‘<5 ‘G’ in Non-seed region’ T 2 1 2 E ¼jjY  b Xjj þ k ajjbjj þð1  aÞjjbjj are sgRNAs with an NGG PAM but 1-nt mismatch compared to the ‘Perfect Match’ sgRNAs Where a and k are parameters estimated using cross validation, jjbjj ¼ X X 2 2 jb j and jjbjj ¼ b .We usedglmnetinRpackagetoimple- i i ii i ment the Elastic-Net regression (Friedman et al., 2010). Based on the MAGeCK-VISPR model, we implemented an ap- proach to identify such outliers, which tests whether one sgRNA has 2.4 CRISPR screening design and experimental big effects on the gene-level beta score estimates or the probability procedure of observing the sgRNA conditioned on the gene-level beta score is We designed and performed a CRISPR screening experiment to low (Section 2). This outlier detection and removal approach did study the effects of different normalization methods and different identify sgRNAs with aberrant LFC on a gene (Fig. 1c). In published sgRNA lengths. The screening library has four types of sgRNAs: screens on four leukemia cell lines (Wang et al., 2015), 9000 out of sgRNAs targeting AAVS1 (a region whose disruption does not have 182 K sgRNAs on average were identified as outliers. Among them, any lethal phenotype), non-targeting sgRNAs, sgRNAs targeting 51 911 sgRNAs are outliers that are consistent in all four screens ribosomal genes and 503 cancer-related genes that are considered to (Supplementary Fig. S1a) and 80% of these outliers (729/911) are be lethal. The details of the library design and the experiment are in ‘strong negative outliers’ with stronger negative selection compared Supplementary Material. with other gRNAs on the same gene (as Fig. 1a). To rule out the pos- sibility that these sgRNAs knockout their intended targets with ex- tremely high efficiencies, we further limited our analysis to 564 3 Results outliers (Supplementary Table S1) that target known non-essential 3.1 sgRNAs outlier identification and characterization genes (Hart and Moffat, 2016), as inactivating these genes is unlike- Different sgRNAs targeting the same gene can lead to varying phe- ly to affect cell growth. notypes or selection levels in the screen due to different cleavage and Comparing the sequence features of these 564 ‘strong negative repair efficiencies, local chromatin structure, protein domains and outliers’ with all 18 000 sgRNAs in the library, we found that they potential off-target effects, etc. (Hsu et al., 2013; Knight et al., have higher G-nucleotide but lower C-nucleotide counts in the target 2015; Shi et al., 2015). Some sgRNAs with outlier phenotypes com- DNA sequence (Fig. 1d, Supplementary Fig. S1b–d). To identify po- pared with other sgRNAs on the same gene, regardless of the causes, tential sequence features that can distinguish outliers and non- behave consistently in multiple screen conditions (Wang et al., outliers, we trained an elastic net model (Friedman et al., 2010), a 2015; Fig. 1a and b), suggesting that the discrepant phenotypes regularized regression method that considers both the L1 and L2 could arise from intrinsic features of the sgRNA in addition to ran- penalties of the lasso and ridge methods. In the training dataset, the dom variances in the experiments. We are especially interested in predictor variable is a binary vector representing the presence or ab- ‘strong negative outliers’ (as Fig. 1a), which are defined as having sence of the nucleotides, and the response variable is a binary vari- much larger negative LFCs compared with other sgRNAs targeting able indicating whether the gRNA is an outlier. Our model showed the same gene and are more likely caused by off-target cleavages. that outliers tend to contain more G-nucleotides in the 10-nucleotide 4098 C.-H.Chen et al. non-seed region distal from the PAM motif (Fig. 1e). To exclude (a)(b) possible biases of a single library, we confirmed our finding using another screen dataset (Meyers et al., 2017; Supplementary Fig. S2a–b). We further tested our predictive model on other CRISPR- Cas9 knockout (Wang et al., 2014) or CRISPR-dCas9 inhibition screening (Horlbeck et al., 2016) datasets. The output of the model is an ‘outlier score’, indicating how likely the input sgRNA is an out- lier. We found that ‘strong negative outliers’ in both datasets have significantly higher outlier scores than non-outliers (Supplementary (c)(d) Fig. S2c and d), suggesting outlier features we found are consistent across different datasets. These findings also suggest that a better CRISPR sgRNA design should at least avoid extreme G content in the non-seed region in case of potential off-target effects. Considering that strong off-target activities can lead to ‘strong negative outliers’, we re-analyzed a previous study that measured the off-target activities between mismatched sgRNA: DNA pairs, defined as the decrease of CD33 protein level by sgRNAs with 1 nu- cleotide mismatch compared to the target DNA in CD33 locus (Doench et al., 2016). Instead of modeling off-target activities as Fig. 2. Normalizing read counts using sgRNAs targeting non-essential genes functions of mismatched nucleotide pair and position as in (Doench or AAVS1. (a-b) The distribution of beta scores in public dataset (Wang et al., et al., 2016), we tested how the nucleotide compositions in the Non- 2015) using non-targeting sgRNAs (a) and sgRNAs targeting non-essential seed region affect the off-target activities. SgRNAs with more ‘G’s genes (b) for normalization. (c) The LFC distribution of 349 non-targeting (5) in Non-seed region have significantly higher off-target activ- sgRNAs, 467 non-essential genes-targeting sgRNAs, 133 AAVS1-targeting ities than those with fewer ‘G’s (<5) (Fig. 1f). In contrast, there is no sgRNAs and 725 essential genes-targeting sgRNAs. P-values were calculated difference in off-target activities between sgRNAs with more (5) using two-sided Student’s t-test. (d) The distribution of beta score using all sgRNAs (black), non-essential genes-targeting sgRNAs (green), AAVS1- and fewer (<5) ‘C’s (Supplementary Fig. S2e). These findings sug- targeting sgRNAs (red) and non-targeting sgRNAs (blue) for normalizing read gest sgRNAs targeting sequences with high G-content in the non- counts, respectively seed region have stronger off-target activities, which can lead to strong outlier phenotypes. harbor’ site preferred for gene knock-ins (DeKelver et al., 2010; 3.2 SgRNAs targeting multiple non-essential genes as Sadelain et al., 2011). This region appears to be epigenetically open for efficient cleavage, yet cutting or modification at this site results negative controls reduce false positives in the screen in no phenotypic changes (Ogata et al., 2003). To test whether Correct interpretations of genome-wide screens require proper read sgRNAs targeting AAVS1 could serve as good negative controls, we count normalization. Since most sgRNAs should generate knockouts first designed a genome-wide screen library containing 134 AAVS1- without causing phenotype, a straightforward approach is to nor- targeting sgRNAs, 349 non-targeting sgRNAs, as well as five malize based on the total read counts of all sgRNAs (Love et al., sgRNAs per gene in the human genome and performed screening in 2014; ‘total normalization’). Alternatively, many screen libraries in- a prostate cancer LNCaP-abl cell line. SgRNAs targeting AAVS1 or clude ‘non-targeting’ negative control sgRNAs, which match no- non-essential genes induced similar LFCs that are stronger than non- where in the genome, for normalization (‘non-targeting sgRNA targeting sgRNAs, confirming the existence of cleavage toxicity in normalization’). In public datasets (Wang et al., 2014, 2015), ‘total non-essential regions (Fig. 2c). Also, by comparing normalization normalization’ resulted in a beta-score distribution centered on zero methods using different sets of sgRNAs (all, non-targeting, AAVS1- (Supplementary Fig. S3a), while ‘non-targeting sgRNA normaliza- targeting and non-essential-gene-targeting sgRNAs, respectively), tion’ led to a skewed distribution of beta scores where most of the we found normalization using the AAVS1- and non-essential-genes genes appear as negatively selected (Fig. 2a). The bias of ‘non-target- targeting sgRNAs result in almost identical distributions of beta ing sgRNA normalization’ is introduced when sgRNAs targeting scores (Fig. 2d). Moreover, both ‘all sgRNA normalization’ and non-essential genes impede cell growth from genome cleavage tox- ‘non-targeting sgRNA normalization’ lead to biases, though to dif- icity (Aguirre et al., 2016; Munoz et al., 2016), regardless of the ferent degrees (Fig. 2d). Since normalization using control guides is gene knockout effects. Therefore, a more appropriate choice of an essential step in many computational methods including negative controls is a set of sgRNAs targeting non-essential DNA MAGeCK-VISPR and CRISPR Score (CS; Wang et al., 2014), the regions. These sgRNAs have already been included in recent library results of these methods will also be affected by the choice of nega- design (Wang et al., 2017). Indeed, when normalizing read counts tive controls (Supplementary Fig. S3b). While methods that only using sgRNAs targeting the ‘gold standard’ 927 non-essential genes rely on gRNA ranks such as MAGeCK-RRA (Li et al., 2014) will previously derived from pooled shRNA screens (Hart et al., 2014), not be affected, the rankings could not clearly distinguish genes that the beta score distribution is centered on zero (Fig. 2b). are negatively, positively, or not selected, which are important when In genome-wide screens, normalizations using either sgRNAs comparing screens over multiple conditions. targeting non-essential genes or all genes lead to similar results To evaluate the normalization methods in a focused screen, we (Fig. 2b, Supplementary Fig. S3a), as the majority of the genes are also designed a small screening library that targets 600 genes, assumed to be non-essential. Such assumption may fail in focused including ribosomal genes and well-known cancer-related genes (or custom) screens where many targeted genes may be under selec- (Section 2, Supplementary Tables S2 and S3). The library also tion, which necessitates the selection of better negative control includes the same set of AAVS1-targeting and non-targeting sgRNAs. sgRNAs. AAVS1 (adeno-associated virus integration site 1) is a ‘safe Improved design and analysis of CRISPR knockout screens 4099 Similar to genome-wide screens, AAVS1-targeting sgRNAs induced sgRNAs, as well as five sgRNAs targeting each gene in the human stronger negative selections compared with non-targeting sgRNAs genome. After removing sgRNAs that are enriched in G-nucleotide (Supplementary Fig. S3c). Furthermore, using AAVS1-targeting (>40%) and have perfect matches to other coding regions, we pri- sgRNAs as negative controls in our MAGeCK algorithm substantially oritized the remaining sgRNAs based on their predicted cleavage increases the sensitivity of the screen, while keeping the same level of efficiencies(Xu et al., 2015) and the number of perfect matches in false positives (Supplementary Fig. S3d). These results validated the the whole genome (see Methods). We conducted screens in LNCaP, applicability of including AAVS1-targeting sgRNAs in genome-wide, abl and T47D cell lines using the H1/H2 library and compared to and more importantly in focused screen libraries. other genome-wide screen datasets, including Brunello library (Doench et al., 2016), TKO library (Hart et al., 2015) and Ong li- brary (Ong et al., 2017). We found H1/H2 is among the libraries 3.3 19 nt spacers give rise to higher cutting efficiencies with fewest outlier sgRNA rates (Supplementary Fig. S4b). and better signal-to-noise ratio Assuming that a good library should be able to rank known essential In spCas9 gene editing systems, truncated sgRNAs have been genes as most negatively selected ones, we found that H1/H2, reported to have a better cleavage specificity compared with full- Brunello and Ong libraries outperformed GeCKOv2 and TKO in length sgRNAs (Fu et al., 2014). However, the performances of identifying known essential genes (Supplementary Fig. S4c–d). truncated sgRNAs in screens compared with full-length sgRNAs, as These results provide support for our refined CRISPR screen library well as the optimal length of truncated sgRNAs, have yet to been design rules. fully determined. Therefore, in our small screening library, we designed sgRNAs with 20 nt spacers for each ribosomal gene and AAVS1-targeting sgRNAs and then truncated them to 19 nt, 18 nt and 17 nt (Section 2). We found that 19 nt sgRNAs give significantly 4 Discussion stronger LFCs in ribosomal genes, reflecting higher cleavage efficien- The CRISPR-cas9 knockout screen has been used to interrogate the cies (Fig. 3a). If we use the difference between positive-control functions of coding genes and non-coding elements systemically, but sgRNAs (sgRNAs targeting ribosomal genes) and negative-control library design is still in their early stage. We first applied MAGeCK- sgRNA (AAVS1-targeting sgRNAs) as a metric for signal-to-noise, VISPR to public genome-wide screen data and identified a set of 19 nt spacers on average give the best performance (Supplementary ‘strong negative outlier’ sgRNAs and their sequence characteristics: Fig. S4a) in 11 of 12 screens. Moreover, for each ribosomal gene, higher G-nucleotide counts especially in regions distal from PAM 19 nt sgRNAs gave lower relative SD (i.e. SD divided by mean; motif. Unexpectedly, the effect of the outliers is independent of the Supplementary Material.) of LFCs, indicating a more stable behav- count of C-nucleotide, different from previous studies that suggest ior (and potentially less off-target cleavages) of gene knockout the role of ‘GC’ content in determining cleavage efficiencies effects (Fig. 3b). (Doench et al., 2014; Haeussler et al., 2016; Wang et al., 2014). Since G-C hybridization strengths in DNA-RNA and RNA-DNA 3.4 A new genome-wide library improved screen hybrids are similar, the distinct effect of G- and C-nucleotides sug- performance gests a more crucial role of DNA-endonuclease rather than DNA- Using the rules we uncovered in this study and our previous work RNA interaction in determining outlier effects. Moreover, sgRNAs (Xu et al., 2015), we designed two sub-libraries that target 18, 493 with higher G-contents in regions distal from PAM motif have stron- human coding genes (named ‘H1’ and ‘H2’; Supplementary Tables ger off-target activities. It is worth noting that the off-target activity S4, S5). Each sub-library includes sgRNAs with 19 nt-long spacers of each sgRNA in Figure 1f was measured between one sgRNA- and contains 134 AAVS1-targeting sgRNAs, 349 non-targeting DNA pair, and the seemly minor difference between sgRNAs with high and low G-contents will be multiplied by the enormous mis- matched sgRNA-DNA pairs in the genome and lead to sgRNA out- (a)(b) liers in screens. Although toxicity from CRISPR cutting has been reported, using non-targeting control for normalization is still a common practice in published literature (Aguirre et al., 2016; Wang et al., 2014). We found that normalization using non-targeting sgRNAs, as compared to using all sgRNAs or sgRNAs targeting non-essential genes, could lead to higher false positives (Supplementary Fig. S3d) in calling essential genes. The reason might be because cleavages in non- essential regions can still induce toxicity in cell growth, in consist- ency with two recent studies showing false positive hits from highly amplified regions in cancer genomes (Aguirre et al., 2016; Munoz et al., 2016). Through CRISPR screening experiments, we confirmed Fig. 3. Comparing cleavage efficiencies and signal-to-noise ratios between that sgRNAs targeting non-essential genes or safe-harbor regions different lengths of sgRNA spacers. (a) The LFCs of sgRNAs with spacer could serve as better negative controls and result in fewer false posi- lengths ranging from 17- to 20-nts, including non-targeting sgRNAs and tives compared with non-targeting sgRNAs. Since a single chroma- sgRNAs targeting ribosomal genes. For each spacer length, there are 100 tin region may be subject to copy number variations in different cell non-targeting sgRNAs and 1020 ribosomal genes-targeting sgRNAs. P-values types, sgRNAs targeting multiple non-essential regions will serve as were calculated using two-sided Student’s t-test. (b) The relative SD of LFCs more robust negative controls. For instance, only 5% (57/1, 043) of sgRNAs targeting ribosomal genes with spacer lengths ranging from 17- to CCLE cell lines have copy number gains in AAVS1 locus, such as 20-nts. There are 612 data points (51 ribosomal genes repeated in 12 screens) for each spacer length. P-values were calculated using two-sided Student’s HCC1937 and MDAMB157, suggesting that though chance is low, t-test caution should be used when using single region as negative 4100 C.-H.Chen et al. DeKelver,R.C. et al. (2010) Functional genomics, proteomics, and regulatory controls. Including correct negative controls is also necessary for DNA analysis in isogenic settings using zinc finger nuclease-driven transgen- custom-designed screens where genes are pre-selected and normal- esis into a safe harbor locus in the human genome. Genome Res., 20, ization using total read counts is inappropriate. We proposed a solu- 1133–1142. tion to reduce the biases by using either multiple non-essential genes Diao,Y. et al. (2016) A new class of temporarily phenotypic enhancers identi- or AAVS1-targeting guides. fied by CRISPR/Cas9-mediated genetic screening. Genome Res., 26, Finally, sgRNAs with shorter lengths have been shown to be po- 397–405. tent in efficiency and specificity (Fu et al., 2014), but the optimal Doench,J.G. et al. (2016) Optimized sgRNA design to maximize activity performance of truncated sgRNAs with different lengths has not and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol., 34, been systematically investigated in screen setting. We discovered 184–191. that 19 nt sgRNAs consistently provide better cleavage efficiencies Doench,J.G. et al. (2014) Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat. Biotechnol., 32, 1262–1267. and signal-to-noise separations compared with other lengths (17, Friedman,J. et al. (2010) Regularization paths for generalized linear models 18, 20 nt). Therefore, using 19 nt sgRNAs in either low-throughput via coordinate descent. J. Stat. Softw., 33, 1–22. experiments or high-throughput screens may give rise to a more ac- Fu,Y. et al. (2014) Improving CRISPR-Cas nuclease specificity using truncated curate inference of gene knockout effects. guide RNAs. Nat. Biotechnol., 32, 279–284. We demonstrated that H1/H2 libraries have improved perform- Haeussler,M. et al. (2016) Evaluation of off-target and on-target scoring algo- ance in identifying known essential genes with less outlier sgRNAs. rithms and integration into the guide RNA selection tool CRISPOR. However, the fact that comparisons were not performed in the same Genome Biol., 17, 148. cellular context might contribute to the observed differences. Also, Hart,T. et al. (2014) Measuring error rates in genomic perturbation screens: since different libraries used distinct approaches to improve screen gold standards for human functional genomics. Mol. Syst. Biol., 10, 733. performance, integrating their respective advantages might further Hart,T. et al. (2015) High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell, 163, 1515–1526. improve the next generation library design. Hart,T. and Moffat,J. (2016) BAGEL: a computational framework for identi- Although we characterized multiple features of CRISPR screens fying essential genes from pooled library screens. BMC Bioinformatics, using computational approaches, the exact mechanisms behind 17, 164. these findings remain unknown. First, it is unclear how sgRNAs Horlbeck,M.A. et al. (2016) Compact and highly active next-generation with higher G-nucleotide content are associated with stronger out- libraries for CRISPR-mediated gene repression and activation. Elife, 5, liers. We suspected that outlier gRNAs with high G-nucleotides e19760. have promiscuous off-target binding and cutting at many CpG Hsu,P.D. et al. (2013) DNA targeting specificity of RNA-guided Cas9 nucle- islands in the genome. Existing experimental approaches to detect ases. Nat. Biotechnol., 31, 827–832. off-target cleavages (Tsai et al.,2015) may be limited to study Jiang,P. et al. (2015) Network analysis of gene essentiality in functional gen- these gRNAs, as the cleavages in each binding site may be low. omics experiments. Genome Biol., 16, 239. Jinek,M. et al. (2012) A programmable dual-RNA-guided DNA endonuclease Second, although we have shown the advantages of using 19 bp in adaptive bacterial immunity. Science, 337, 816–821. sgRNA spacers from statistical perspectives, how different lengths Knight,S.C. et al. (2015) Dynamics of CRISPR-Cas9 genome interrogation in of sgRNA spacers give rise to various cleavage strengths and living cells. Science, 350, 823–826. off-targets remain to be determined. Last but not least, all the Koike-Yusa,H. et al. (2014) Genome-wide recessive genetic screening in mam- above findings are derived in the SpCas9 system, and the rules in malian cells with a lentiviral CRISPR-guide RNA library. Nat. Biotechnol., different RNA-guided DNA endonuclease systems require further 32, 267–273. investigations. Kolde,R. et al. (2012) Robust rank aggregation for gene list integration and Collectively, our study provided novel insights into the meta-analysis. Bioinformatics, 28, 573–580. properties of CRISPR and the design of both high- and low Korkmaz,G. et al. (2016) Functional genetic screens for enhancer elements throughput CRISPR experiments. We designed two genome-wide in the human genome using CRISPR-Cas9. Nat. Biotechnol., 34, 192–198. libraries and showed the improved performance using the rules we Li,W. et al. (2015) Quality control, modeling, and visualization of CRISPR uncovered. The characterized features and design rules, as well screens with MAGeCK-VISPR. Genome Biol., 16, 281. as the libraries, will benefit and expedite the application of Li,W. et al. (2014) MAGeCK enables robust identification of essential genes CRISPR techniques. from genome-scale CRISPR/Cas9 knockout screens. Genome Biol., 15, Love,M.I. et al. (2014) Moderated estimation of fold change and dispersion Funding for RNA-seq data with DESeq2. Genome Biol., 15, 550. Mali,P. et al. (2013) CAS9 transcriptional activators for target specificity This work was supported by the NIH R01 HG008728, NIH R01 HG008927 screening and paired nickases for cooperative genome engineering. Nat. and the Breast Cancer Research Foundation (BCRF). Funding for open access Biotechnol., 31, 833–838. charge: National Institutes of Health. WL is supported in part by funds of the Center for Genetic Medicine Research and the Gilbert Family Meyers,R.M. et al. (2017) Computational correction of copy number effect Neurofibromatosis Institute at Children’s National Health System. improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet., 49, 1779–1784. Conflict of Interest: none declared. Morgens,D.W. et al. (2017) Genome-scale measurement of off-target activity using Cas9 toxicity in high-throughput screens. Nat. Commun., 8, 15178. Munoz,D.M. et al. (2016) CRISPR screens provide a comprehensive assess- References ment of cancer vulnerabilities but generate false-positive hits for highly Aguirre,A.J. et al. (2016) Genomic copy number dictates a gene-independent amplified genomic regions. Cancer Discov., 6, 900. cell response to CRISPR-Cas9 targeting. Cancer Discov., 6, 914. Ogata,T. et al. (2003) Identification of an insulator in AAVS1, a preferred Canver,M.C. et al. (2015) BCL11A enhancer dissection by Cas9-mediated in region for integration of adeno-associated virus DNA. J. Virol., 77, situ saturating mutagenesis. Nature, 527, 192–197. 9000–9007. Cong,L. et al. (2013) Multiplex genome engineering using CRISPR/Cas sys- Ong,S.H. et al. (2017) Optimised metrics for CRISPR-KO screens with tems. Science, 339, 819–823. second-generation gRNA libraries. Sci. Rep., 7, 7384. Improved design and analysis of CRISPR knockout screens 4101 Parnas,O. et al. (2015) A genome-wide CRISPR screen in primary immune Wang,T. et al. (2014) Genetic screens in human cells using the CRISPR-Cas9 cells to dissect regulatory networks. Cell, 162, 675–686. system. Science, 343, 80–84. Sadelain,M. et al. (2011) Safe harbours for the integration of new DNA in the Wang,T. et al. (2017) Gene essentiality profiling reveals gene networks human genome. Nat. Rev. Cancer, 12, 51–58. and synthetic lethal interactions with oncogenic Ras. Cell, 168, 890–903, Shalem,O. et al. (2014) Genome-scale CRISPR-Cas9 knockout screening in e815. human cells. Science, 343, 84–87. Xu,H. et al. (2015) Sequence determinants of improved CRISPR sgRNA de- Shi,J. et al. (2015) Discovery of cancer drug targets by CRISPR-Cas9 screening sign. Genome Res., 25, 1147–1157. of protein domains. Nat. Biotechnol., 33, 661–667. Zhou,Y. et al. (2014) High-throughput screening of a CRISPR/Cas9 library Tsai,S.Q. et al. (2015) GUIDE-seq enables genome-wide profiling of off-target for functional genomics in human cells. Nature, 509, 487–491. cleavage by CRISPR-Cas nucleases. Nat. Biotechnol., 33, 187–197. Zhu,S. et al. (2016) Genome-scale deletion screening of human long Wang,T. et al. (2015) Identification and characterization of essential genes in non-coding RNAs using a paired-guide RNA CRISPR-Cas9 library. Nat. the human genome. Science, 350, 1096–1101. Biotechnol., 34, 1279–1286.

Journal

BioinformaticsOxford University Press

Published: Jun 1, 2018

There are no references for this article.