Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
Center for Integrative Bioinformatics Vienna, Max F. Perutz
Laboratories, University of Vienna, Medical University of Vienna, Vienna, Austria.
Simons Center for Quantitative Biology, Cold Spring Harbor
Laboratory, Cold Spring Harbor, NY, USA.
Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, Austria.
Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA.
These authors contributed equally: Fritz J. Sedlazeck,
Philipp Rescheneder. *e-mail: firstname.lastname@example.org; email@example.com
tructural variations (SVs), including insertions, deletions,
duplications, inversions, and translocations at least 50 bp in
size, account for the greatest number of divergent base pairs
across human genomes
. SVs contribute to polymorphic variation;
pathogenic conditions; large-scale chromosome evolution
human diseases such as cancer
, and Alzheimer’s
. SVs also
affect phenotypes in many other organisms
. In one of the first
reports of SV prevalence, published in 2004, Sebat et al.
ered in a microarray study that large-scale copy-number polymor-
phisms are common across healthy human genomes. Today, SV
detection most often uses short paired-end reads. Copy-number
variations are observed as decreases (deletions) or increases
(amplifications) in aligned read coverage
, and other types of SVs
are identified by the arrangement of paired-end reads or split-
. Short-read approaches, however, have been
reported to lack sensitivity (only 10%
of SVs detected),
exhibit very high false positive rates (up to 89%)
misinterpret complex or nested SVs
Long-read single-molecule sequencing has the potential to sub
stantially increase the reliability and resolution of SV detection.
With average read lengths of 10 kbp or higher, the reads can be
more confidently aligned to repetitive sequences that often medi
ate the formation of SVs
. Long reads are also more likely to span
SV breakpoints with high-confidence alignments. Despite these
advantages, long reads introduce new challenges. Most impor
tant, they have a high sequencing error rate—currently 10–15%
for Pacific Biosciences (PacBio) and 5–20% for Oxford Nanopore
—which necessitates new methods. A few aligners are
available, including LAST
, and minimap2
. Only one stand-alone method,
, is available to detect all types of SVs from long-read
data, although others such as SMRT-SV
have been proposed for
a subset of SV types.
To address these challenges, we introduce two open-source algo
rithms, NGMLR and Sniffles, for comprehensive long-read align-
ment and SV detection (Fig. 1). NGMLR is a fast and accurate
aligner for long reads based on extension of our previous short-read
, with a new convex gap-cost scoring model to align
long reads across SV breakpoints. Sniffles successively scans the
alignments to identify all types of SVs. Its SV-scoring scheme evalu
ates candidate SVs on the basis of their size, position, type, coverage,
and breakpoint consistency, and thus overcomes the high insertion/
deletion (indel) error rates in long-read sequencing.
We applied our methods to simulated and genuine datasets
for Arabidopsis, healthy human genomes, and a cancerous human
genome to demonstrate their increased accuracy compared with
that of alternate short- and long-read callers. A particularly inno
vative feature of Sniffles is its ability to detect nested SVs, such as
inverted tandem duplications (INVDUPs) and inversions flanked
by indels (INVDELs). These are poorly studied classes of SVs;
although both have been previously associated with genomic disor
, they could not be routinely detected, and so their full sig-
nificance is currently unknown. Finally, we show that our methods
reduce the sequencing and computational costs required per sam
ple, and thus make the application of long reads to large numbers of
samples increasingly feasible.
Accurate mapping and detection of SVs with long reads. Unlike
most aligners, NGMLR uses a convex gap-scoring model
rately align reads spanning genuine indels in the presence of small
observed indels (1–10 bp) that commonly occur as sequencing errors
(Fig. 2, Methods, and Supplementary Note 1). Larger or more complex
SVs are captured through split-read alignments. To achieve both high
performance and accuracy, NGMLR first partitions the long reads into
256-bp subsegments and aligns them independently to the reference
Accurate detection of complex structural
variations using single-molecule sequencing
Fritz J. Sedlazeck
*, Philipp Rescheneder
, Moritz Smolka
, Han Fang
, Maria Nattestad
Arndt von Haeseler
and Michael C. Schatz
Structural variations are the greatest source of genetic variation, but they remain poorly understood because of technologi-
cal limitations. Single-molecule long-read sequencing has the potential to dramatically advance the field, although high error
rates are a challenge with existing methods. Addressing this need, we introduce open-source methods for long-read alignment
(NGMLR; https://github.com/philres/ngmlr) and structural variant identification (Sniffles; https://github.com/fritzsedlazeck/
Sniffles) that provide unprecedented sensitivity and precision for variant detection, even in repeat-rich regions and for complex
nested events that can have substantial effects on human health. In several long-read datasets, including healthy and cancer-
ous human genomes, we discovered thousands of novel variants and categorized systematic errors in short-read approaches.
NGMLR and Sniffles can automatically filter false events and operate on low-coverage data, thereby reducing the high costs
that have hindered the application of long reads in clinical and research settings.
NATURE METHODS | VOL 15 | JUNE 2018 | 461–468 | www.nature.com/naturemethods
© 2018 Nature America Inc., part of Springer Nature. All rights reserved.