SV-plaudit: A cloud-based framework for manually curating thousands of structural variants

SV-plaudit: A cloud-based framework for manually curating thousands of structural variants GigaScience, 7, 2018, 1–7 doi: 10.1093/gigascience/giy064 Advance Access Publication Date: 31 June 2018 Research RESEARCH SV-plaudit: A cloud-based framework for manually curating thousands of structural variants 1,2 1,2 1,2 Jonathan R. Belyeu , Thomas J. Nicholas , Brent S. Pedersen , Thomas 1,2 1,2 1,2 1 A. Sasani , James M. Havrilla , Stephanie N. Kravitz , Megan E. Conway , 1,2 1,2,3,* 1,2,* Brian K. Lohman , Aaron R. Quinlan and Ryan M. Layer 1 2 Department of Human Genetics, University of Utah, 15 S 2030 E, Salt Lake City, UT, USA;, USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, UT, USA; and Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA Correspondence address. Ryan M. Layer. Department of Human Genetics, University of Utah, 15 S 2030 E, Salt Lake City, UT, 84112, USA, E-mail: ryan.layer@gmail.com or Aaron R. Quinlan. Department of Human Genetics, University of Utah, 15 S 2030 E, Salt Lake City, UT, 84112, USA, E-mail: aaron.quinlan@utah.edu http://orcid.org/0000-0002-5823-3232 ABSTRACT SV-plaudit is a framework for rapidly curating structural variant (SV) predictions. For each SV, we generate an image that visualizes the coverage and alignment signals from a set of samples. Images are uploaded to our cloud framework where users assess the quality of each image using a client-side web application. Reports can then be generated as a tab-delimited file or annotated Variant Call Format (VCF) file. As a proof of principle, nine researchers collaborated for 1 hour to evaluate 1,350 SVs each. We anticipate that SV-plaudit will become a standard step in variant calling pipelines and the crowd-sourced curation of other biological results. Code available at https://github.com/jbelyeu/SV-plaudit Demonstration video available at https://www.youtube.com/watch?v=ono8kHMKxDs Keywords: structural variants; visualization; manual curation the generality of the accuracy obtained in typical quality WGS Background datasets (30x with PCR amplification). Large genomic rearrangements, or structural variants (SVs), Given the high false positive rate of SV calls from genome are an abundant form of genetic variation within the hu- and exome sequencing, manual inspection is a critical quality man genome [1, 2], and they play an important role in both control step, especially in clinical cases. Scrutiny of the evidence species evolution [3, 4] and human disease phenotypes [5–9]. supporting an SV is considered to be a reliable “dry bench” val- While many methods have been developed to identify SVs from idation technique, as the human eye can rapidly distinguish a whole-genome sequencing (WGS) data [10–14], the accuracy of true SV signal from alignment artifacts. In principle, we could SV prediction remains far below that of single-nucleotide and improve the accuracy of SV call sets by visually validating every insertion-deletion variants [1]. Improvements to SV detection al- variant. In practice, however, current genomic data visualiza- gorithms have, in part, been limited by the availability and appli- tion methods [16–21] were designed primarily for spot checking cability of high-quality truth sets. While the Genome in a Bottle a small number of variants and are difficult to scale to the thou- [15] consortium has made considerable progress toward a gold- sands of SVs in typical call sets. Therefore, a curated set of SVs standard variant truth set, the incredibly high quality of the data requires a new framework that scales to thousands of SVs, min- underlying this project (300x and PCR free) calls into question Received: 20 March 2018; Revised: 25 April 2018; Accepted: 25 May 2018 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 2 SV-plaudit imizes the time needed to adjudicate individual variants, and types SVs for a given sample by comparing the number of dis- manages the collective judgment of large and often geographi- cordant paired-end alignments and split-read alignments that cally dispersed teams. support the SV to the number of pairs and reads that support Here we present SV-plaudit, a fast, highly scalable framework the reference allele. CNVNATOR uses sequence coverage to es- enabling teams of any size to collaborate on the rapid, web- timate copy number for the region affected by the SV. Both of based curation of thousands of SVs. In the web interface, users these methods confirm the voting results (Fig. 2D). Considering consider a curation question for a series of pre-computed im- the set of “unambiguous” deletions, SVTYPER and CNVNATOR ages (Fig. 1, Supplementary Fig. S1) that contain the coverage, agree with the SV-plaudit curation score in 92.3% and 81.7% of paired-end alignments, and split-read alignments for the region cases, respectively. Here, agreement means that unambiguous surrounding a candidate SV for a set of relevant samples (e.g., tu- false SVs (curation score <0.2) have a CNVNATOR copy num- mor and matched normal samples). The curation question is de- ber near 2 (between 1.4 and 2.4) or an SYTYPER genotype of fined by the researcher to match the larger experimental design homozygous reference. Unambiguous true SVs (curation score (e.g., a cancer study may ask if the variant is a somatic variant, >0.8) have a CNVNATOR copy number near 1 or 0 (<1.4), or an a germline variant, or a false positive). Responses are collected SYTYPER genotype of nonreference (heterozygous or homozy- and returned as a report that can be used to identify high-quality gous alternate). variants. Despite this consistency, using either SVTYPER or CNVNA- While a team of curators is not required, collecting multi- TOR to validate SVs can lead to false positives or false negatives. ple opinions for each SV allows SV-plaudit to report the consen- For example, CNVNATOR reported a copy number loss for 44.2% sus view (i.e., a “curation score”) of each variant. This consensus of the deletions that were scored as unanimously BAD, and is less susceptible to human error and does not require expert SVTYPER called 30.7% of the deletions that were unanimously users to score variants. With SV-plaudit, it is practical to inspect GOOD as homozygous reference. Conversely, CNVNATOR had and score every variant in a call set, thereby improving the ac- few false negatives (2.4% of unanimously GOOD deletions were curacy of SV predictions in individual genomes and allowing cu- called as copy neutral), and SVTYPER had few false positives ration of high quality-truth sets for SV method tuning. (0.2% of nonreference variants were unanimously BAD). This comparison is meant to demonstrate that different methods have distinct strengths and weaknesses and should not be taken Results as a direct comparison between SVTYPER and CNVNATOR, since To assess SV-plaudit’s utility for curating SVs, nine researchers CNVNATOR was one of nine methods used by the 1000 Genomes in the Quinlan laboratory at the University of Utah manually project while SVYTPER was not. inspected and scored the 1,350 SVs (1,310 deletions, 8 duplica- These results demonstrate that, with SV-plaudit, manual cu- tions, 4 insertions, and 28 inversions) that the 1000 Genomes ration can be a cost-effective and robust part of the SV detection Project [1] identified in the NA12878 genome (Supplemental File process. While we anticipate that automated SV detection meth- 1). Since we expect trio analysis to be a common use case of ods will continue to improve, due in part to the improved truth SV-plaudit, we included alignments from NA12878 and her par- sets that SV-plaudit will provide, directly viewing SVs will remain ents (NA12891 and NA12892), and participants considered the an essential validation technique. By extending this validation curation question “The SV in the top sample (NA12878) is:” and to full call sets, SV-plaudit not only improves specificity but can answers “GOOD,” “BAD,” or “DE NOVO.” In total, the full exper- also enhance sensitivity by allowing users to relax quality filters iment took less than 2 hours with Amazon costs totaling less and rapidly screen large sets of calls. Beyond demonstrating SV- plaudit’s utility, our curation of SVs for NA12878 is useful as a than $0.05. The images (Supplemental File 2) were generated in 3 minutes (20 threads, 2.7 seconds per image) and uploading to high-quality truth set for method development and tuning. A S3 required 5 minutes (full command list in Supplemental File Variant Call Format (VCF) file of these variants annotated with 3). The mean time to score all images was 60.1 minutes (2.67 sec- their curation score is available in Supplementary File 5. onds per image) (Fig. 2A, reports in Supplemental Files 4, 5). In the scoring process, no de novo variants were identified. Forty images did not render correctly due to issues in the alignment Discussion files (e.g., coverage gaps) and were removed from the subsequent SV-plaudit is an efficient, scalable, and flexible framework for the analysis (Supplemental File 6). manual curation of large-scale SV call sets. Backed by Amazon For this experiment, we used a curation score that mapped S3 and DynamoDB, SV-plaudit is easy to deploy and scales to “GOOD” and “DE NOVO” to the value one, “BAD” to the value teams of any size. Each instantiation of SV-plaudit is completely zero, and the mean as the aggregation function (Fig. 2B). Most independent and can be deployed locally for private or sensi- (70.5%) variants were scored unanimously, with 67.1% being tive datasets or be distributed publicly to maximize participa- unanimously “GOOD” (score = 1.0, e.g., Fig. 1A) and 3.4% being tion. By rapidly providing a direct view of the raw data underly- unanimously “BAD” (score = 0.0, e.g., Fig. 1B). Since we had nine ing candidate SVs, SV-plaudit delivers the infrastructure to man- scores for each variant, we expanded our definition of “unam- ually inspect full SV call sets. SV-plaudit also allows researchers biguous” variants to be those with at most one dissenting vote to specify the questions and answers that users consider to en- (score <0.2 or >0.8), which accounted for 87.1% of the variants. sure that the curation outcome supports the larger experimen- The 12.9% of SVs that were “ambiguous” (more than one dis- tal design. This functionality is vital to a wide range of WGS ex- senting vote, 0.2 <= score <= 0.8) were generally small (median periments, from method development to the interpretation of size of 310.5 bp vs 899.5 bp for all variants, Fig. 2C) or contained disease genomes. We are actively working on machine learning conflicting evidence (e.g., paired-end and split-read evidence in- methods that will leverage the curation scores for thousands of dicated an inversion and the read-depth evidence indicated a SV predictions as training data. deletion, e.g., Fig. 1C.). Other methods, such as SVTYPER [22] and CNVNATOR [23], can independently assess the validity of SV calls. SVTYPER geno- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 Belyeu et al. 3 Figure 1: Example samplot images of putative deletion calls that were scored as (A) unanimously GOOD, (B) unanimously BAD, and (C) ambiguous with a mix of GOOD and BAD scores with respect to the top sample (NA12878) in each plot. The black bar at the top of the figure indicates the genomic position of the predicted SV, and the following subfigures visualize the alignments and sequence coverage of each sample. Subplots report paired-end (square ends connected by a solid lin e, annotated as concordant and discordant paired-end reads in A) and split-read (circle ends connected by a dashed line, annotated in A) alignments by their genomic position (x-axis) and the distance between mapped ends (insert size, left y-axis). Colors indicate the type of event the alignment supports (black for deletion, red for duplication, and blue and green for inversion) and intensity indicates the concentration of alignments. The grey filled shapes report the sequence coverage distribut ion in the locus for each sample (right y-axis, annotated in A). The samples in each panel are a trio of father (NA12891), mother (NA12892), and daughter (NA12878). any type. Therefore, we anticipate that this framework will facil- Conclusions itate “crowd-sourced” curation of many other biological images. SV-plaudit was designed to judge how well the data in an align- ment file corroborate a candidate SV. The question of whether a particular SV is a false positive due to artifacts from sequenc- Methods ing or alignment is a broader issue that must be answered in the context of other data sources such as mappability and re- Overview peat annotations. While this second level of analysis is crucial, SV-plaudit (Fig. 3) is based on two software packages: samplot for it is beyond the scope of this paper, and we argue this analysis SV image generation and PlotCritic for staging the Amazon cloud be performed only for those SVs that are fully supported by the environment and managing user input. Once the environment alignment data. While SV-plaudit combines samplot and PlotCritic is staged, users log into the system and are presented with a to enable the curation of structural variant images, we empha- series of SV images in either a random or predetermined order. size that the PlotCritic framework can be used to score images of For each image, the user answers the curation question and re- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 4 SV-plaudit Unambiguously Ambiguous Unambiguously BAD GOOD A B C D Figure 2: (A) The distribution of the time elapsed from when an image was presented to when it was scored. (B) The distribution of curation scores. (C) The SV size distribution for all, unanimous (score 0 or 1), unambiguous (score <0.2 or >0.8), and ambiguous (score >= 0.2 and <= 0.8) variants. (D) A comparison of predictions for deletions between CNVNATOR copy number calls (y-axis), SVTYPER genotypes (color, “Ref.” is homozygous reference and “Non-ref.” is heterozygous or homozygous alternate), and curation scores (x-axis). This demonstrates a general agreement between all methods with a concentration of reference genotypes and copy number 2 (no evidence for a deletion) at curation score <0.2, and non-reference and copy number one or zero events (evidence for a deletion) at curation score >0.8. There are also false positives for CNVNATOR (copy number <2 at score = 0) and false negatives for SVTYPER (reference genotype at score = 1). sponses are logged. Reports on the progress of a project can be by the set of related reference genomes tracks (e.g., parents and quickly generated at any point in the process. siblings, matched normal sample). Users may specify the exact order by using command line parameters to samplot. A visual- ization of genome annotations and genes and exons within the locus is displayed below the alignment plots to provide context Samplot for assessing the SV’s relevance to phenotypes. Rendering time Samplot is a Python program that uses pysam [24] to extract align- depends on the number of samples, sequence coverage, and the ment data from a set of BAM or CRAM files and matplotlib [25]to size of the SV, but most images will require less than 5 seconds, visualize the raw data for the genomic region surrounding a can- and samplot rendering can be parallelized by SV call. didate SV (Fig. 3A). For each alignment file, samplot renders the depth of sequencing coverage, paired-end alignments, and split- read alignments where paired-end and split-read alignments PlotCritic are color-coded based by the type of SV they support (e.g., black for deletion, red for a duplication, etc.) (Fig. 1, Supplementary PlotCritic (Fig. 3B) provides a simple web interface for scoring Fig. S2, which considers variants at different sequencing cov- images and viewing reports that summarize the results from erages, and Supplementary Fig. S3, which depicts variants sup- multiple users and SV images. PlotCritic is both highly scalable ported by long-read sequencing) [26, 27]. Alignments are posi- and easy to deploy. Images are stored on Amazon Web Ser- tioned along the x-axis by genomic location and along the left y- vices (AWS) S3 and DynamoDB tables store project configuration axis by the distance between the ends (insert size), which helps metadata and user responses. These AWS services allow Plot- users to differentiate normal alignments from discordant align- Critic to dynamically scale to any number of users. It also pre- ments that support an SV. Depth of sequencing coverage is also cludes the need for hosting a dedicated server, thereby facilitat- displayed on the right y-axis to allow users to inspect whether ing deployment. putative copy number changes are supported by the expected After samplot generates the SV images, PlotCritic manages changes in coverage. To improve performance for large events, their transfer to S3 and configures tables in DynamoDB based we downsample “normal” paired-end alignments (a +/- orien- on a JSON configuration file (config.json file in Fig. 3B). In this tation and an insert size range that is within Z standard devi- configuration file, one defines the curation questions posed to ations from the mean; by default Z = 4). Plots for each align- reviewers as well as the allowed answers and associated key- ment file are stacked and share a common x-axis that reports board bindings to allow faster responses (curationQandA field the chromosomal position. By convention, the sample of inter- in Fig. 3B). In turn, these dictate the text and buttons that ap- est (e.g., proband or tumor) is displayed as the top track, followed pear on the resulting web interface. As such, it allows the inter- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 Unanimous Unanimous Belyeu et al. 5 Figure 3: The SV-plaudit process. (A) Samplot generates an image for each SV from VCF considering a set of alignment (BAM or CRAM) files. ( B) PlotCritic uploads the images to an Amazon S3 bucket and prepares DynamoDB tables. Users select a curation answer (“GOOD,” “BAD,” or “DE NOVO”) for each SV image. DynamoDB logs user responses and generates reports. Within a report, a curation score function can be specified by mapping answer options to values and selecting an a ggregation function. Here “GOOD” and “DE NOVO” were mapped to 1, “BAD” to 0, and the mean was used. An especially useful output option is a VCF annotated with the curation scores (shown here in bold as a SVP). face to be easily customized to support a wide variety of curation dard deviation, min, and max to satisfy most use cases, users scenarios. For example, a cancer experiment may display a tu- can implement custom scores by operating on the tab-delimited mor sample and matched normal sample and ask users if the SV report. appears in both samples (i.e., a germline variant) or just in the Each PlotCritic project is protected by AWS Cognito user au- tumor sample (i.e., a somatic variant). To accomplish this, the thentication, which securely restricts access to the project web- curation question (question field in Fig. 3B) could be “In which site to authenticated users. A project manager is the only au- samples does the SV appear?”, and the answer options (answers thorized user at startup and can authenticate other users using field in Fig. 3B) could be “TUMOR,” “BOTH,” “NORMAL,” or “NEI- Cognito’s secure services. The website can be further secured us- THER.” Alternatively, in the case of a rare disease, the interface ing HTTPS, and additional controls, such as IP restrictions, can could display a proband and parents and ask if the SV is only be put in place by configuring AWS IAM access controls directly in the proband (i.e., de novo) or if it is also in a parent (i.e., in- for S3 and DynamoDB. herited). Since there is no limit to the length of a question or number of answer options, PlotCritic can support more complex experimental scenarios. Availability of source code and requirements Once results are collected, PlotCritic can generate a tab- delimited report or annotated VCF that, for each SV image, de- Project name: SV-plaudit tails the number of times the image was scored and the full set of Project home page: https://github.com/jbelyeu/SV-plaudit answers it received. Additionally, a curation score can be calcu- Operating systems: Mac OS and Linux lated for each image by providing a value for each answer option Programing language: Python, bash and an aggregation function (e.g., mean, median, mode, stan- License: MIT dard deviation, min, max). For example, consider the cancer ex- Research Resource Initiative Identification ID: SCR 01 6285 ample from above where the values 3, 2, 1, and 0 mapped to the answers “TUMOR,” “BOTH,” “NORMAL,” and “NEITHER,” respec- tively. If “mode” were selected as the curation function, then the Availability of supporting data and material curation score would reflect the opinion of a plurality of users. The mean would reflect the consensus among all users, and The datasets generated and/or analyzed during the current the standard deviation would capture the level of disagreement study are available in the 1000 Genomes Project repository, ftp: about each image. While we expect mean, median, mode, stan- //ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/ Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 6 SV-plaudit All test data used or generated during this study, and a snap- caused by Alu insertions are associated with risks for many shot of the code, are available in the GigaScience GigaDB reposi- human diseases. Proc Natl Acad Sci U S A 2017;114:E3984–92. tory [28]. 6. Schubert C. The genomic basis of the Williams-Beuren syn- drome. Cell Mol Life Sci 2009;66:1178–97. 7. Pleasance ED, Cheetham RK, Stephens PJ et al. A comprehen- Additional files sive catalogue of somatic mutations from a human cancer Supplemental Figure 1. Plots for different structural variant genome. Nature 2010;463:191–96. 8. Venkitaraman AR. Cancer susceptibility and the functions of types shown in sample NA12878. (A) A region is shown where a duplication event was called. (B) A region is shown where an BRCA1 and BRCA2. Cell 2002;108:171–82. 9. Zhang F, Gu W, Hurles ME et al. Copy number variation in inversion event was called. Supplemental Figure 2. A deletion call for sample NA12878 using human health, disease, and evolution. Annu Rev Genomics Hum Genet 2009;10:451–81. different sequencing data to compare variant plots from high, 10. Ye K, Schulz MH, Long Q et al. Pindel: a pattern growth ap- medium, and low coverage levels. Mean sequencing depth of the BAM files used was ( A) 58x (1000 Genomes Project, high cover- proach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics age), (B) 33x (Genome in a Bottle Consortium), (C) and 5x (1000 Genomes Project, low coverage). 2009;25:2865–71. 11. Rausch T, Zichner T, Schlattl A et al. DELLY: structural variant Supplemental Figure 3. A selection of structural variant visual- izations from the Genome in a (A and B), “LongReadHomRef” in discovery by integrated paired-end and split-read analysis. Bioinformatics 2012;28:i333–39. (C), and “NoConsensusGT” in (D). Supplemental File 1.vcf 12. Handsaker RE, Korn JM, Nemesh J et al. Discovery and geno- typing of genome structural polymorphism by sequencing Supplemental File 3.sh Supplemental File 4.csv on a population scale. Nat Genet 2011;43:269–76. 13. Kronenberg ZN, Osborne EJ, Cone KR et al. Wham: Identify- Supplemental File 5.vcf Supplemental File 6.txt ing structural variants of biological consequence. PLoS Com- put Biol 2015;11:e1004572. 14. Layer RM, Chiang C, Quinlan AR et al. LUMPY: a probabilis- SV: structural variant; VCF: Variant Call Format; WGS: Whole tic framework for structural variant discovery. Genome Biol. Genome Sequencing. 2014;15:R84. 15. Zook JM, Chapman B, Wang J et al. Integrating human se- Ethics approval and consent to participate quence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 2014;32:246–51. Not applicable 16. Thorvaldsdottir ´ H, Robinson JT, Mesirov JP. Integrative Ge- nomics Viewer (IGV): high-performance genomics data visu- Consent for publication alization and exploration. Brief Bioinform 2013;14:178–92. 17. Fiume M, Williams V, Brook A et al. Savant: genome Not applicable browser for high-throughput sequencing data. Bioinformat- ics 2010;26:1938–44. The authors declare that they have no competing interests. 18. Munro JE, Dunwoodie SL, Giannoulatou E. SVPV: a struc- tural variant prediction viewer for paired-end sequencing datasets. Bioinformatics 2017;33:2032–3. Funding 19. O’Brien TM, Ritz AM, Raphael BJ et al. Gremlin: an interactive This research was supported by US National Human Genome Re- visualization model for analyzing genomic rearrangements. search Institute awards to R.M.L. (NIH K99HG009532) and A.R.Q. IEEE Trans Vis Comput Graph 2010;16:918–26. (NIH R01HG006693 and NIH R01GM124355) as well as a US Na- 20. Wyczalkowski MA, Wylie KM, Cao S et al. BreakPoint Sur- tional Cancer Institute award to A.R.Q. (NIH U24CA209999). veyor: a pipeline for structural variant visualization. Bioin- formatics 2017;33:3121–2. 21. Spies N, Zook JM, Salit M et al. svviz: a read viewer for vali- Authors’ contributions dating structural variants. Bioinformatics 2015;31:3994–6. J.R.B. and R.M.L. developed the software. J.R.B., T.J.N., B.S.P., T.A.S., 22. Chiang C, Layer RM, Faust GG et al. SpeedSeq: ultra-fast J.M.H., S.N.K., M.E.C., B.K.L., and R.M.L. scored variants for the ex- personal genome analysis and interpretation. Nat Methods periment. J.R.B., A.R.Q., and R.M.L. wrote the manuscript. A.R.Q. 2015;12:966–8. 23. Abyzov A, Urban AE, Snyder M et al. CNVnator: an approach and R.M.L. conceived the study. to discover, genotype, and characterize typical and atypi- cal CNVs from family and population genome sequencing. References Genome Res 2011;21:974–84. 24. [PDF]pysam documentation - Read the Docs. https://github 1. Sudmant PH, Rausch T, Gardner EJ et al. An integrated map of structural variation in 2,504 human genomes. Nature .com/pysam-developers/pysam 25. Hunter JD. Matplotlib: A 2D Graphics Environment. Comput- 2015;526:75–81. 2. Redon R, Ishikawa S, Fitch KR et al. Global variation in copy ing in Science Engineering 2007;9:90–95. 26. Zook JM, Catoe D, McDaniel J et al. Extensive sequencing of number in the human genome. Nature 2006;444:444–54. 3. Newman TL, Tuzun E, Morrison VA et al. A genome-wide sur- seven human genomes to characterize benchmark reference materials. Sci Data 2016;3:160025. vey of structural variation between human and chimpanzee. 27. Daniel Kortschak R, S Pedersen B, L Adelson D. bıogo/hts: Genome Res. 2005;15:1344–56. ´ high throughput sequence handling for the Go language. 4. Bailey JA, Eichler EE. Primate segmental duplications: cru- cibles of evolution, diversity and disease. Nat Rev Genet 2006;7:552–64. 5. Payer LM, Steranka JP, Yang WR et al. Structural variants Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 Belyeu et al. 7 JOSS 2017;2:168. ing thousands of structural variants” GigaScience Database. 28. Belyeu JR, Nicholas TJ, Pedersen BS et al. Supporting data for 2018. http://dx.doi.org/10.5524/100450. “SV-plaudit: A cloud-based framework for manually curat- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

SV-plaudit: A cloud-based framework for manually curating thousands of structural variants

GigaScience , Volume Advance Article (7) – May 31, 2018
Free
7 pages

Loading next page...
 
/lp/ou_press/sv-plaudit-a-cloud-based-framework-for-manually-curating-thousands-of-dqO8NYl2Qj
Publisher
BGI
Copyright
© The Author(s) 2018. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/giy064
Publisher site
See Article on Publisher Site

Abstract

GigaScience, 7, 2018, 1–7 doi: 10.1093/gigascience/giy064 Advance Access Publication Date: 31 June 2018 Research RESEARCH SV-plaudit: A cloud-based framework for manually curating thousands of structural variants 1,2 1,2 1,2 Jonathan R. Belyeu , Thomas J. Nicholas , Brent S. Pedersen , Thomas 1,2 1,2 1,2 1 A. Sasani , James M. Havrilla , Stephanie N. Kravitz , Megan E. Conway , 1,2 1,2,3,* 1,2,* Brian K. Lohman , Aaron R. Quinlan and Ryan M. Layer 1 2 Department of Human Genetics, University of Utah, 15 S 2030 E, Salt Lake City, UT, USA;, USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, UT, USA; and Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA Correspondence address. Ryan M. Layer. Department of Human Genetics, University of Utah, 15 S 2030 E, Salt Lake City, UT, 84112, USA, E-mail: ryan.layer@gmail.com or Aaron R. Quinlan. Department of Human Genetics, University of Utah, 15 S 2030 E, Salt Lake City, UT, 84112, USA, E-mail: aaron.quinlan@utah.edu http://orcid.org/0000-0002-5823-3232 ABSTRACT SV-plaudit is a framework for rapidly curating structural variant (SV) predictions. For each SV, we generate an image that visualizes the coverage and alignment signals from a set of samples. Images are uploaded to our cloud framework where users assess the quality of each image using a client-side web application. Reports can then be generated as a tab-delimited file or annotated Variant Call Format (VCF) file. As a proof of principle, nine researchers collaborated for 1 hour to evaluate 1,350 SVs each. We anticipate that SV-plaudit will become a standard step in variant calling pipelines and the crowd-sourced curation of other biological results. Code available at https://github.com/jbelyeu/SV-plaudit Demonstration video available at https://www.youtube.com/watch?v=ono8kHMKxDs Keywords: structural variants; visualization; manual curation the generality of the accuracy obtained in typical quality WGS Background datasets (30x with PCR amplification). Large genomic rearrangements, or structural variants (SVs), Given the high false positive rate of SV calls from genome are an abundant form of genetic variation within the hu- and exome sequencing, manual inspection is a critical quality man genome [1, 2], and they play an important role in both control step, especially in clinical cases. Scrutiny of the evidence species evolution [3, 4] and human disease phenotypes [5–9]. supporting an SV is considered to be a reliable “dry bench” val- While many methods have been developed to identify SVs from idation technique, as the human eye can rapidly distinguish a whole-genome sequencing (WGS) data [10–14], the accuracy of true SV signal from alignment artifacts. In principle, we could SV prediction remains far below that of single-nucleotide and improve the accuracy of SV call sets by visually validating every insertion-deletion variants [1]. Improvements to SV detection al- variant. In practice, however, current genomic data visualiza- gorithms have, in part, been limited by the availability and appli- tion methods [16–21] were designed primarily for spot checking cability of high-quality truth sets. While the Genome in a Bottle a small number of variants and are difficult to scale to the thou- [15] consortium has made considerable progress toward a gold- sands of SVs in typical call sets. Therefore, a curated set of SVs standard variant truth set, the incredibly high quality of the data requires a new framework that scales to thousands of SVs, min- underlying this project (300x and PCR free) calls into question Received: 20 March 2018; Revised: 25 April 2018; Accepted: 25 May 2018 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 2 SV-plaudit imizes the time needed to adjudicate individual variants, and types SVs for a given sample by comparing the number of dis- manages the collective judgment of large and often geographi- cordant paired-end alignments and split-read alignments that cally dispersed teams. support the SV to the number of pairs and reads that support Here we present SV-plaudit, a fast, highly scalable framework the reference allele. CNVNATOR uses sequence coverage to es- enabling teams of any size to collaborate on the rapid, web- timate copy number for the region affected by the SV. Both of based curation of thousands of SVs. In the web interface, users these methods confirm the voting results (Fig. 2D). Considering consider a curation question for a series of pre-computed im- the set of “unambiguous” deletions, SVTYPER and CNVNATOR ages (Fig. 1, Supplementary Fig. S1) that contain the coverage, agree with the SV-plaudit curation score in 92.3% and 81.7% of paired-end alignments, and split-read alignments for the region cases, respectively. Here, agreement means that unambiguous surrounding a candidate SV for a set of relevant samples (e.g., tu- false SVs (curation score <0.2) have a CNVNATOR copy num- mor and matched normal samples). The curation question is de- ber near 2 (between 1.4 and 2.4) or an SYTYPER genotype of fined by the researcher to match the larger experimental design homozygous reference. Unambiguous true SVs (curation score (e.g., a cancer study may ask if the variant is a somatic variant, >0.8) have a CNVNATOR copy number near 1 or 0 (<1.4), or an a germline variant, or a false positive). Responses are collected SYTYPER genotype of nonreference (heterozygous or homozy- and returned as a report that can be used to identify high-quality gous alternate). variants. Despite this consistency, using either SVTYPER or CNVNA- While a team of curators is not required, collecting multi- TOR to validate SVs can lead to false positives or false negatives. ple opinions for each SV allows SV-plaudit to report the consen- For example, CNVNATOR reported a copy number loss for 44.2% sus view (i.e., a “curation score”) of each variant. This consensus of the deletions that were scored as unanimously BAD, and is less susceptible to human error and does not require expert SVTYPER called 30.7% of the deletions that were unanimously users to score variants. With SV-plaudit, it is practical to inspect GOOD as homozygous reference. Conversely, CNVNATOR had and score every variant in a call set, thereby improving the ac- few false negatives (2.4% of unanimously GOOD deletions were curacy of SV predictions in individual genomes and allowing cu- called as copy neutral), and SVTYPER had few false positives ration of high quality-truth sets for SV method tuning. (0.2% of nonreference variants were unanimously BAD). This comparison is meant to demonstrate that different methods have distinct strengths and weaknesses and should not be taken Results as a direct comparison between SVTYPER and CNVNATOR, since To assess SV-plaudit’s utility for curating SVs, nine researchers CNVNATOR was one of nine methods used by the 1000 Genomes in the Quinlan laboratory at the University of Utah manually project while SVYTPER was not. inspected and scored the 1,350 SVs (1,310 deletions, 8 duplica- These results demonstrate that, with SV-plaudit, manual cu- tions, 4 insertions, and 28 inversions) that the 1000 Genomes ration can be a cost-effective and robust part of the SV detection Project [1] identified in the NA12878 genome (Supplemental File process. While we anticipate that automated SV detection meth- 1). Since we expect trio analysis to be a common use case of ods will continue to improve, due in part to the improved truth SV-plaudit, we included alignments from NA12878 and her par- sets that SV-plaudit will provide, directly viewing SVs will remain ents (NA12891 and NA12892), and participants considered the an essential validation technique. By extending this validation curation question “The SV in the top sample (NA12878) is:” and to full call sets, SV-plaudit not only improves specificity but can answers “GOOD,” “BAD,” or “DE NOVO.” In total, the full exper- also enhance sensitivity by allowing users to relax quality filters iment took less than 2 hours with Amazon costs totaling less and rapidly screen large sets of calls. Beyond demonstrating SV- plaudit’s utility, our curation of SVs for NA12878 is useful as a than $0.05. The images (Supplemental File 2) were generated in 3 minutes (20 threads, 2.7 seconds per image) and uploading to high-quality truth set for method development and tuning. A S3 required 5 minutes (full command list in Supplemental File Variant Call Format (VCF) file of these variants annotated with 3). The mean time to score all images was 60.1 minutes (2.67 sec- their curation score is available in Supplementary File 5. onds per image) (Fig. 2A, reports in Supplemental Files 4, 5). In the scoring process, no de novo variants were identified. Forty images did not render correctly due to issues in the alignment Discussion files (e.g., coverage gaps) and were removed from the subsequent SV-plaudit is an efficient, scalable, and flexible framework for the analysis (Supplemental File 6). manual curation of large-scale SV call sets. Backed by Amazon For this experiment, we used a curation score that mapped S3 and DynamoDB, SV-plaudit is easy to deploy and scales to “GOOD” and “DE NOVO” to the value one, “BAD” to the value teams of any size. Each instantiation of SV-plaudit is completely zero, and the mean as the aggregation function (Fig. 2B). Most independent and can be deployed locally for private or sensi- (70.5%) variants were scored unanimously, with 67.1% being tive datasets or be distributed publicly to maximize participa- unanimously “GOOD” (score = 1.0, e.g., Fig. 1A) and 3.4% being tion. By rapidly providing a direct view of the raw data underly- unanimously “BAD” (score = 0.0, e.g., Fig. 1B). Since we had nine ing candidate SVs, SV-plaudit delivers the infrastructure to man- scores for each variant, we expanded our definition of “unam- ually inspect full SV call sets. SV-plaudit also allows researchers biguous” variants to be those with at most one dissenting vote to specify the questions and answers that users consider to en- (score <0.2 or >0.8), which accounted for 87.1% of the variants. sure that the curation outcome supports the larger experimen- The 12.9% of SVs that were “ambiguous” (more than one dis- tal design. This functionality is vital to a wide range of WGS ex- senting vote, 0.2 <= score <= 0.8) were generally small (median periments, from method development to the interpretation of size of 310.5 bp vs 899.5 bp for all variants, Fig. 2C) or contained disease genomes. We are actively working on machine learning conflicting evidence (e.g., paired-end and split-read evidence in- methods that will leverage the curation scores for thousands of dicated an inversion and the read-depth evidence indicated a SV predictions as training data. deletion, e.g., Fig. 1C.). Other methods, such as SVTYPER [22] and CNVNATOR [23], can independently assess the validity of SV calls. SVTYPER geno- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 Belyeu et al. 3 Figure 1: Example samplot images of putative deletion calls that were scored as (A) unanimously GOOD, (B) unanimously BAD, and (C) ambiguous with a mix of GOOD and BAD scores with respect to the top sample (NA12878) in each plot. The black bar at the top of the figure indicates the genomic position of the predicted SV, and the following subfigures visualize the alignments and sequence coverage of each sample. Subplots report paired-end (square ends connected by a solid lin e, annotated as concordant and discordant paired-end reads in A) and split-read (circle ends connected by a dashed line, annotated in A) alignments by their genomic position (x-axis) and the distance between mapped ends (insert size, left y-axis). Colors indicate the type of event the alignment supports (black for deletion, red for duplication, and blue and green for inversion) and intensity indicates the concentration of alignments. The grey filled shapes report the sequence coverage distribut ion in the locus for each sample (right y-axis, annotated in A). The samples in each panel are a trio of father (NA12891), mother (NA12892), and daughter (NA12878). any type. Therefore, we anticipate that this framework will facil- Conclusions itate “crowd-sourced” curation of many other biological images. SV-plaudit was designed to judge how well the data in an align- ment file corroborate a candidate SV. The question of whether a particular SV is a false positive due to artifacts from sequenc- Methods ing or alignment is a broader issue that must be answered in the context of other data sources such as mappability and re- Overview peat annotations. While this second level of analysis is crucial, SV-plaudit (Fig. 3) is based on two software packages: samplot for it is beyond the scope of this paper, and we argue this analysis SV image generation and PlotCritic for staging the Amazon cloud be performed only for those SVs that are fully supported by the environment and managing user input. Once the environment alignment data. While SV-plaudit combines samplot and PlotCritic is staged, users log into the system and are presented with a to enable the curation of structural variant images, we empha- series of SV images in either a random or predetermined order. size that the PlotCritic framework can be used to score images of For each image, the user answers the curation question and re- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 4 SV-plaudit Unambiguously Ambiguous Unambiguously BAD GOOD A B C D Figure 2: (A) The distribution of the time elapsed from when an image was presented to when it was scored. (B) The distribution of curation scores. (C) The SV size distribution for all, unanimous (score 0 or 1), unambiguous (score <0.2 or >0.8), and ambiguous (score >= 0.2 and <= 0.8) variants. (D) A comparison of predictions for deletions between CNVNATOR copy number calls (y-axis), SVTYPER genotypes (color, “Ref.” is homozygous reference and “Non-ref.” is heterozygous or homozygous alternate), and curation scores (x-axis). This demonstrates a general agreement between all methods with a concentration of reference genotypes and copy number 2 (no evidence for a deletion) at curation score <0.2, and non-reference and copy number one or zero events (evidence for a deletion) at curation score >0.8. There are also false positives for CNVNATOR (copy number <2 at score = 0) and false negatives for SVTYPER (reference genotype at score = 1). sponses are logged. Reports on the progress of a project can be by the set of related reference genomes tracks (e.g., parents and quickly generated at any point in the process. siblings, matched normal sample). Users may specify the exact order by using command line parameters to samplot. A visual- ization of genome annotations and genes and exons within the locus is displayed below the alignment plots to provide context Samplot for assessing the SV’s relevance to phenotypes. Rendering time Samplot is a Python program that uses pysam [24] to extract align- depends on the number of samples, sequence coverage, and the ment data from a set of BAM or CRAM files and matplotlib [25]to size of the SV, but most images will require less than 5 seconds, visualize the raw data for the genomic region surrounding a can- and samplot rendering can be parallelized by SV call. didate SV (Fig. 3A). For each alignment file, samplot renders the depth of sequencing coverage, paired-end alignments, and split- read alignments where paired-end and split-read alignments PlotCritic are color-coded based by the type of SV they support (e.g., black for deletion, red for a duplication, etc.) (Fig. 1, Supplementary PlotCritic (Fig. 3B) provides a simple web interface for scoring Fig. S2, which considers variants at different sequencing cov- images and viewing reports that summarize the results from erages, and Supplementary Fig. S3, which depicts variants sup- multiple users and SV images. PlotCritic is both highly scalable ported by long-read sequencing) [26, 27]. Alignments are posi- and easy to deploy. Images are stored on Amazon Web Ser- tioned along the x-axis by genomic location and along the left y- vices (AWS) S3 and DynamoDB tables store project configuration axis by the distance between the ends (insert size), which helps metadata and user responses. These AWS services allow Plot- users to differentiate normal alignments from discordant align- Critic to dynamically scale to any number of users. It also pre- ments that support an SV. Depth of sequencing coverage is also cludes the need for hosting a dedicated server, thereby facilitat- displayed on the right y-axis to allow users to inspect whether ing deployment. putative copy number changes are supported by the expected After samplot generates the SV images, PlotCritic manages changes in coverage. To improve performance for large events, their transfer to S3 and configures tables in DynamoDB based we downsample “normal” paired-end alignments (a +/- orien- on a JSON configuration file (config.json file in Fig. 3B). In this tation and an insert size range that is within Z standard devi- configuration file, one defines the curation questions posed to ations from the mean; by default Z = 4). Plots for each align- reviewers as well as the allowed answers and associated key- ment file are stacked and share a common x-axis that reports board bindings to allow faster responses (curationQandA field the chromosomal position. By convention, the sample of inter- in Fig. 3B). In turn, these dictate the text and buttons that ap- est (e.g., proband or tumor) is displayed as the top track, followed pear on the resulting web interface. As such, it allows the inter- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 Unanimous Unanimous Belyeu et al. 5 Figure 3: The SV-plaudit process. (A) Samplot generates an image for each SV from VCF considering a set of alignment (BAM or CRAM) files. ( B) PlotCritic uploads the images to an Amazon S3 bucket and prepares DynamoDB tables. Users select a curation answer (“GOOD,” “BAD,” or “DE NOVO”) for each SV image. DynamoDB logs user responses and generates reports. Within a report, a curation score function can be specified by mapping answer options to values and selecting an a ggregation function. Here “GOOD” and “DE NOVO” were mapped to 1, “BAD” to 0, and the mean was used. An especially useful output option is a VCF annotated with the curation scores (shown here in bold as a SVP). face to be easily customized to support a wide variety of curation dard deviation, min, and max to satisfy most use cases, users scenarios. For example, a cancer experiment may display a tu- can implement custom scores by operating on the tab-delimited mor sample and matched normal sample and ask users if the SV report. appears in both samples (i.e., a germline variant) or just in the Each PlotCritic project is protected by AWS Cognito user au- tumor sample (i.e., a somatic variant). To accomplish this, the thentication, which securely restricts access to the project web- curation question (question field in Fig. 3B) could be “In which site to authenticated users. A project manager is the only au- samples does the SV appear?”, and the answer options (answers thorized user at startup and can authenticate other users using field in Fig. 3B) could be “TUMOR,” “BOTH,” “NORMAL,” or “NEI- Cognito’s secure services. The website can be further secured us- THER.” Alternatively, in the case of a rare disease, the interface ing HTTPS, and additional controls, such as IP restrictions, can could display a proband and parents and ask if the SV is only be put in place by configuring AWS IAM access controls directly in the proband (i.e., de novo) or if it is also in a parent (i.e., in- for S3 and DynamoDB. herited). Since there is no limit to the length of a question or number of answer options, PlotCritic can support more complex experimental scenarios. Availability of source code and requirements Once results are collected, PlotCritic can generate a tab- delimited report or annotated VCF that, for each SV image, de- Project name: SV-plaudit tails the number of times the image was scored and the full set of Project home page: https://github.com/jbelyeu/SV-plaudit answers it received. Additionally, a curation score can be calcu- Operating systems: Mac OS and Linux lated for each image by providing a value for each answer option Programing language: Python, bash and an aggregation function (e.g., mean, median, mode, stan- License: MIT dard deviation, min, max). For example, consider the cancer ex- Research Resource Initiative Identification ID: SCR 01 6285 ample from above where the values 3, 2, 1, and 0 mapped to the answers “TUMOR,” “BOTH,” “NORMAL,” and “NEITHER,” respec- tively. If “mode” were selected as the curation function, then the Availability of supporting data and material curation score would reflect the opinion of a plurality of users. The mean would reflect the consensus among all users, and The datasets generated and/or analyzed during the current the standard deviation would capture the level of disagreement study are available in the 1000 Genomes Project repository, ftp: about each image. While we expect mean, median, mode, stan- //ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/ Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 6 SV-plaudit All test data used or generated during this study, and a snap- caused by Alu insertions are associated with risks for many shot of the code, are available in the GigaScience GigaDB reposi- human diseases. Proc Natl Acad Sci U S A 2017;114:E3984–92. tory [28]. 6. Schubert C. The genomic basis of the Williams-Beuren syn- drome. Cell Mol Life Sci 2009;66:1178–97. 7. Pleasance ED, Cheetham RK, Stephens PJ et al. A comprehen- Additional files sive catalogue of somatic mutations from a human cancer Supplemental Figure 1. Plots for different structural variant genome. Nature 2010;463:191–96. 8. Venkitaraman AR. Cancer susceptibility and the functions of types shown in sample NA12878. (A) A region is shown where a duplication event was called. (B) A region is shown where an BRCA1 and BRCA2. Cell 2002;108:171–82. 9. Zhang F, Gu W, Hurles ME et al. Copy number variation in inversion event was called. Supplemental Figure 2. A deletion call for sample NA12878 using human health, disease, and evolution. Annu Rev Genomics Hum Genet 2009;10:451–81. different sequencing data to compare variant plots from high, 10. Ye K, Schulz MH, Long Q et al. Pindel: a pattern growth ap- medium, and low coverage levels. Mean sequencing depth of the BAM files used was ( A) 58x (1000 Genomes Project, high cover- proach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics age), (B) 33x (Genome in a Bottle Consortium), (C) and 5x (1000 Genomes Project, low coverage). 2009;25:2865–71. 11. Rausch T, Zichner T, Schlattl A et al. DELLY: structural variant Supplemental Figure 3. A selection of structural variant visual- izations from the Genome in a (A and B), “LongReadHomRef” in discovery by integrated paired-end and split-read analysis. Bioinformatics 2012;28:i333–39. (C), and “NoConsensusGT” in (D). Supplemental File 1.vcf 12. Handsaker RE, Korn JM, Nemesh J et al. Discovery and geno- typing of genome structural polymorphism by sequencing Supplemental File 3.sh Supplemental File 4.csv on a population scale. Nat Genet 2011;43:269–76. 13. Kronenberg ZN, Osborne EJ, Cone KR et al. Wham: Identify- Supplemental File 5.vcf Supplemental File 6.txt ing structural variants of biological consequence. PLoS Com- put Biol 2015;11:e1004572. 14. Layer RM, Chiang C, Quinlan AR et al. LUMPY: a probabilis- SV: structural variant; VCF: Variant Call Format; WGS: Whole tic framework for structural variant discovery. Genome Biol. Genome Sequencing. 2014;15:R84. 15. Zook JM, Chapman B, Wang J et al. Integrating human se- Ethics approval and consent to participate quence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 2014;32:246–51. Not applicable 16. Thorvaldsdottir ´ H, Robinson JT, Mesirov JP. Integrative Ge- nomics Viewer (IGV): high-performance genomics data visu- Consent for publication alization and exploration. Brief Bioinform 2013;14:178–92. 17. Fiume M, Williams V, Brook A et al. Savant: genome Not applicable browser for high-throughput sequencing data. Bioinformat- ics 2010;26:1938–44. The authors declare that they have no competing interests. 18. Munro JE, Dunwoodie SL, Giannoulatou E. SVPV: a struc- tural variant prediction viewer for paired-end sequencing datasets. Bioinformatics 2017;33:2032–3. Funding 19. O’Brien TM, Ritz AM, Raphael BJ et al. Gremlin: an interactive This research was supported by US National Human Genome Re- visualization model for analyzing genomic rearrangements. search Institute awards to R.M.L. (NIH K99HG009532) and A.R.Q. IEEE Trans Vis Comput Graph 2010;16:918–26. (NIH R01HG006693 and NIH R01GM124355) as well as a US Na- 20. Wyczalkowski MA, Wylie KM, Cao S et al. BreakPoint Sur- tional Cancer Institute award to A.R.Q. (NIH U24CA209999). veyor: a pipeline for structural variant visualization. Bioin- formatics 2017;33:3121–2. 21. Spies N, Zook JM, Salit M et al. svviz: a read viewer for vali- Authors’ contributions dating structural variants. Bioinformatics 2015;31:3994–6. J.R.B. and R.M.L. developed the software. J.R.B., T.J.N., B.S.P., T.A.S., 22. Chiang C, Layer RM, Faust GG et al. SpeedSeq: ultra-fast J.M.H., S.N.K., M.E.C., B.K.L., and R.M.L. scored variants for the ex- personal genome analysis and interpretation. Nat Methods periment. J.R.B., A.R.Q., and R.M.L. wrote the manuscript. A.R.Q. 2015;12:966–8. 23. Abyzov A, Urban AE, Snyder M et al. CNVnator: an approach and R.M.L. conceived the study. to discover, genotype, and characterize typical and atypi- cal CNVs from family and population genome sequencing. References Genome Res 2011;21:974–84. 24. [PDF]pysam documentation - Read the Docs. https://github 1. Sudmant PH, Rausch T, Gardner EJ et al. An integrated map of structural variation in 2,504 human genomes. Nature .com/pysam-developers/pysam 25. Hunter JD. Matplotlib: A 2D Graphics Environment. Comput- 2015;526:75–81. 2. Redon R, Ishikawa S, Fitch KR et al. Global variation in copy ing in Science Engineering 2007;9:90–95. 26. Zook JM, Catoe D, McDaniel J et al. Extensive sequencing of number in the human genome. Nature 2006;444:444–54. 3. Newman TL, Tuzun E, Morrison VA et al. A genome-wide sur- seven human genomes to characterize benchmark reference materials. Sci Data 2016;3:160025. vey of structural variation between human and chimpanzee. 27. Daniel Kortschak R, S Pedersen B, L Adelson D. bıogo/hts: Genome Res. 2005;15:1344–56. ´ high throughput sequence handling for the Go language. 4. Bailey JA, Eichler EE. Primate segmental duplications: cru- cibles of evolution, diversity and disease. Nat Rev Genet 2006;7:552–64. 5. Payer LM, Steranka JP, Yang WR et al. Structural variants Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018 Belyeu et al. 7 JOSS 2017;2:168. ing thousands of structural variants” GigaScience Database. 28. Belyeu JR, Nicholas TJ, Pedersen BS et al. Supporting data for 2018. http://dx.doi.org/10.5524/100450. “SV-plaudit: A cloud-based framework for manually curat- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/7/giy064/5026174 by Ed 'DeepDyve' Gillespie user on 03 July 2018

Journal

GigaScienceOxford University Press

Published: May 31, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off