Genome U-Plot: a whole genome visualization

Genome U-Plot: a whole genome visualization Abstract Motivation The ability to produce and analyze whole genome sequencing (WGS) data from samples with structural variations (SV) generated the need to visualize such abnormalities in simplified plots. Conventional two-dimensional representations of WGS data frequently use either circular or linear layouts. There are several diverse advantages regarding both these representations, but their major disadvantage is that they do not use the two-dimensional space very efficiently. We propose a layout, termed the Genome U-Plot, which spreads the chromosomes on a two-dimensional surface and essentially quadruples the spatial resolution. We present the Genome U-Plot for producing clear and intuitive graphs that allows researchers to generate novel insights and hypotheses by visualizing SVs such as deletions, amplifications, and chromoanagenesis events. The main features of the Genome U-Plot are its layered layout, its high spatial resolution and its improved aesthetic qualities. We compare conventional visualization schemas with the Genome U-Plot using visualization metrics such as number of line crossings and crossing angle resolution measures. Based on our metrics, we improve the readability of the resulting graph by at least 2-fold, making apparent important features and making it easy to identify important genomic changes. Results A whole genome visualization tool with high spatial resolution and improved aesthetic qualities. Availability and implementation An implementation and documentation of the Genome U-Plot is publicly available at https://github.com/gaitat/GenomeUPlot. Contact vasmatzis.george@mayo.edu Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Visualization is an important aspect of genome-wide analysis as it yields new insights into the genomic data. It provides a fast inspection of the data and identifies areas that require further investigation. Precisely identifying genome wide variations is very important in patient outcomes. A chromosome rearrangement is a missing, extra, or irregular portion of chromosomal DNA. The DNA structure can be altered when the total amount of genetic information is decreased (i.e. deletions), increased (i.e. duplications, insertions) or rearranged (i.e. inversions, translocations). Copy number variations (CNVs) are gains and losses of a genomic sequence and are also important forms of genetic variation. A breakpoint is the location left or right of a break in the genome. A junction is the reunion of a break where two distal breakpoints are now adjacent. Current visualization approaches of genomic data use either linear or circular display of the data. The linear layouts represent the chromosomes in a linear fashion end to end. The circular layouts also represent the chromosomes end to end but the start of the first chromosome and the end of the last chromosome are brought together. Furthermore, there are the karyogram type layouts where each chromosome occupies one line (or column). None of these types of layouts provide the resolution required to perform investigative research. There are many applications that use linear layouts to visualize genomic data; IGV (Robinson et al., 2011; Thorvaldsdóttir et al., 2013) a desktop application and a JavaScript package, pileup.js (Vanderkam et al., 2016) a JavaScript package, GenVisR (Skidmore et al., 2016) an R package, SVGenes (Etherington and MacLean, 2013) a Ruby package, ChromoZoom (Pak and Roth, 2013) a Javascript package, Dalliance (Down et al., 2011) a JavaScript package, Sushi.R (Phanstiel et al., 2014) an R package, are just a few of them. Hardly any of these packages are capable of displaying whole-genome data but rather display data for one chromosome at a time or much even smaller genomic ranges. Ideogram (Weitz, 2015, https://github.com/eweitz/ideogram) is a karyogram type viewer whose drawback is that half of the plot space remains empty. These linear layout packages have all in common the notion of a track; one track is used for each genomic data-type element and tracks are all stacked on top of each other genomically aligned. Depending on how many tracks are used, it is common that a researcher would be required to scroll up and down the page in order to view and make sense of all of his/hers data. Some applications that produce circular layouts of genomic data include Circos (Krzywinski et al., 2009) a Perl package, BioCircos.js (Cui et al., 2016) a JavaScript & D3 package, GenomeD3Plot (Laird et al., 2015) a JavaScript & D3 package, OmicCircos (Hu et al., 2014) an R package, JCircos (An et al., 2015) a Java package, RCircos (Zhang et al., 2013) an R package. The R package ggbio (Yin et al., 2012) can produce both linear and circular representations. On the positive side, circular layouts depict multi-dimensional data in a compact way. On the negative side, they also share the notion of a track but as the layout is circular, there is a finite amount of tracks that can be added before the data becomes illegible. Linear or scatter data are added radially inward to the circular layout. As a result the data resolution is reduced as the arc length of each chromosome is diminished when data moves towards the center of the circle. Furthermore, junctions, which are depicted as circular arcs, very often overlap each other as there is only limited radial pixel space per genomic position. Dense graphs with numerous line crossings are virtually impossible to interpret. Circular layouts provide just an overview of the data without giving the ability to the investigator to drill into the data. We present a novel alternative, the Genome U-Plot, which provides a U-shape layout of genomic data. The proposed method addresses the issue of graph “aesthetics”. Specifically, in order to optimize a graph, de-tangle complex lines to produce an easily interpretable graph and maximize readability, it is essential that we minimize the number of line crossings, while at the same time maximize the crossing angle resolution, i.e. increase the minimum average angle formed by any two crossing lines as proposed by Didimo et al., 2009, Purchase, 2002 and Bennett et al., 2007. We will show that the Genome U-Plot reduces the number of crossing lines >4-fold and maximizes the crossing angle resolution by at least 2-fold. 2 Materials and methods The Genome U-Plot (see Fig. 1) addresses the issue of resolution. The whole-genome view, incorporates an overview of all the chromosomes, maximizes media (i.e. screen or paper) space usage and at the same time provides the necessary resolution for investigative research, as it takes advantage of the entire two-dimensional space of the media. The chromosomes are laid out in a U-shape pattern so as to maximize use of the rectangular plane with very little space going to waste. The large chromosomes 1–12 and X are placed on the left side of the view while the smaller chromosomes 13–22 and Y are placed on the right side of the view, filling in the space that the large chromosomes leave behind. As a result, in a 1000 pixel window about 2 49 000 genomic positions (the size of chromosome 1 divided by 1000) are condensed to 1 pixel; a 4-fold improvement over the circular layouts. Fig. 1. View largeDownload slide Whole Genome U-Plot. Visible are the 24 human chromosomes arranged in a U-shape, the cytobands, the chromosome junctions and the copy number variations (CNVs). The axes at the bottom right of the graph are respectively for the chromosomes on the right side of the plot Fig. 1. View largeDownload slide Whole Genome U-Plot. Visible are the 24 human chromosomes arranged in a U-shape, the cytobands, the chromosome junctions and the copy number variations (CNVs). The axes at the bottom right of the graph are respectively for the chromosomes on the right side of the plot The U-shaped plot is a visualization of multivariate genomic data that is all layered on top of each other with clear and precise representations. The approach selected for the visualization of multi-dimensional data was the use of layers, that can be turned on or off, for each data type. Overlaying different data types that are all based on genomic positions helps the user in keeping a common context of information instead of scrolling up and down to see data that corresponds to the same genomic regions. Cytobands, structural alterations and copy number variations are just some of its layers. The plot incorporates views for interpretation of detected structural rearrangements of chromosomes like chromosomal alterations and copy number variations (CNVs). Each magenta line in the plot represents a junction, connecting the breakpoints of the junction. The endpoints of the line point to breakpoint A and breakpoint B. The thickness of the line is relative to the number of supporting fragments. There are many samples though, where there is a big number of overlaps of lines as the events are intra-chromosomal and cover small genomic ranges close to each other. There are also samples where inter-chromosomal junctions overlap. So we opted for using circular arcs to connect the breakpoints of junctions and by modifying the magnitude of the curvature of the circular arc we produce graphs that minimize the overlap of junctions. Neutral copy number is shown in gray color line above the ideogram for each chromosome. The line height indicates the copy number level. Lines are colored when a statistical deviation from neutral is detected, red for losses, blue for gains. The Genome U-Plot is a tool that provides researchers and clinicians the ability to explore sample-based large datasets (DNA, RNA, Exome) to identify which genetic processes are likely implicated in disease for a particular patient or sample. Features of the Genome U-Plot include the visualization of structural alterations represented as arcs, instead of lines, giving the ability to differentiate between overlapping alterations as the curvature magnitude of the circular arc is user defined. Copy number variations are color coded to not only represent gains and losses but also multiple levels of amplifications or deletions. In addition, the user can filter the data displayed by number of supporting fragments of each junction. Finally, the application allows for mouse-centric infinite zooming. When the zoom level increases data representations move around the mouse pointer so as to not lose context in the plethora of information. With an extremely dynamic viewport displaying multi-dimensional data visualizations, bioinformatics specialists will use this tool to both analyze and report on potentially pathogenic events within a patient’s genome, specifically structural alterations and copy number variations (CNV) during the first stage, followed by gene variants, indel variants, LOH variants, single nucleotide variants (SNV), gene fusions, mutations, gene expression and exon expression at a later point. The application is designed and developed using the latest web technologies like the D3.js API Bostock (2012). D3.js extends the capability of Scalable Vector Graphics (SVG Consortium et al., 2000) to allow it to render 2D visualizations in any compatible web browser without the use of any plug-ins. This ensures the operation of the software not only on multiple device types like desktops, tablets and mobile devices but also on multiple operating systems. To quantify the issue of graph “aesthetics” and determine the number of junction crossings and the crossing angle resolution, as discussed by Didimo et al., 2009, Purchase, 2002 and Bennett et al., 2007 we need to be able to compute the intersection point of two circular arcs and the angles that the two arcs form around that point. 2.1 Arc intersection computation To calculate the intersection between two circular arcs A and B (see Fig. 2), A being defined by the end points (x1, y1) and (x2, y2), and B by the end points (x3, y3) and (x4, y4), we first compute the circle that is defined by each circular arc. We define the radius of the circle of the circular arc A to be:   rA=c·(x1−x2)2+(y1−y2)2 (1) where c is the magnitude of the curvature of the circular arc. Fig. 2. View largeDownload slide Arc intersection computation Fig. 2. View largeDownload slide Arc intersection computation Using the Pythagorean theorem, we calculate the vertex VA of the isosceles triangle that is formed by the chord of the circular arc A [i.e. the points (x1, y1) and (x2, y2)] and the two radii rA which connect VA with the two endpoints of the circular arc A. VA is the center (cAx,cAy) of the unique circle that contains the circular arc A. Similarly we compute the center (cBx,cBy) of the circle that contains the circular arc B and whose radius is rB. We proceed to calculate the intersection points of the two circles (cAx,cAy,rA) and (cBx,cBy,rB) (Bourke, 1997, http://paulbourke.net/geometry/circlesphere/). If an intersection point belongs to both arcs, we use this as the circular arc intersection point. Then we need to determine if the point(s) of intersection of the two circles belong(s) to any of the circular arcs A and B. In order to calculate whether an intersection point (ix, iy) of the two circles belongs to one of the arcs, e.g. arc A, we compute the angles θAs, θAe and θAi. θAs is the angle formed by the x-axis and the line segment defined by the start of the arc (x1, y1) and the center of the circle (cAx,cAy). Similarly, θAe is the angle formed by the x-axis and the line segment defined by the end of the arc (x2, y2) and the center of the circle (cAx,cAy). Finally, θAi is the angle formed by the x-axis and the line segment defined by the intersection point (ix, iy) and the center of the circle (cAx,cAy). If θAs≤θAi≤θAe, then the intersection point (ix, iy) lies within the circular arc A. In order to calculate the angle between the two intersecting arcs, we calculate the slopes (mA, mB) of the tangents to the arcs at the intersection point using the formulas:   mA=−cot(θAi),mB=−cot(θBi) (2) Then we calculate the angle between the two slopes:   θAB=a tan((mA−mB)/(1+mA·mB)) (3) Finally, the angle between the two arcs A and B is defined as the min(θAB,(180−θAB)) (see Fig. 2). 3 Results We ran a simulation, minimizing the number of intersecting lines and maximizing the crossing angle resolution, in order to determine the best curvature magnitude of the circular arc. We show the results for one particular sample that has several crossing junctions as well as several overlapping junctions. In Figure 3 (and in Supplementary Figs S9, S13, S17, S21, S25, S29, S33, S37, S41, S45), we show the sample without any optimization. At this state, it has 76 crossing lines and the crossing angle resolution is 25.46°. It is clear that the intra-chromosomal junctions on chromosomes 6 & 7 are not at all discernible. It is also very hard to determine the inter-chromosomal junctions between chromosomes 6 & 7. There exists an additional ambiguity between chromosomes 1 & 6 and 1 & 12 as the junction lines overlap since they have the same slope. Fig. 3. View largeDownload slide The sample junctions depicted as magenta lines Fig. 3. View largeDownload slide The sample junctions depicted as magenta lines In Figure 4 (and in Supplementary Figs S10, S14, S18, S22, S26, S30, S34, S38, S42, S46), we show the sample junctions as magenta circular arcs after setting the curvature magnitude to the optimized value of 0.28 (as determined by the simulation; see Fig. 5) for which the number of crossing lines is 64 and the crossing angle resolution is 36.44°. Fig. 4. View largeDownload slide The sample junctions depicted as magenta circular arcs after optimizing the magnitude of the curvature Fig. 4. View largeDownload slide The sample junctions depicted as magenta circular arcs after optimizing the magnitude of the curvature Fig. 5. View largeDownload slide Curvature magnitude simulation plot Fig. 5. View largeDownload slide Curvature magnitude simulation plot In Figure 5 (and in Supplementary Figs S11, S15, S19, S23, S27, S31, S35, S39, S43, S47), we show a plot of the optimization score for all curvature magnitudes that we ran. The highest score is selected and used for the Genome U-Plot (see Fig. 4). In Figure 6 (and in Supplementary Figs S12, S16, S20, S24, S28, S32, S36, S40, S44, S48), we show the circular plot for the same sample which has 407 crossing circular arcs and the crossing angle resolution is 13.1°. It is clear that the interactions between chromosomes 6 & 7 are not at all discernible. In this case, the junctions between chromosomes 1 & 6 and 1 & 12 are actually discernible but they are only a small proportion of the total that are not distinct. Fig. 6. View largeDownload slide Corresponding circular plot using CircosJS (Girault, 2015, https://github.com/nicgirault/circosJS) Fig. 6. View largeDownload slide Corresponding circular plot using CircosJS (Girault, 2015, https://github.com/nicgirault/circosJS) Then we ran the same simulation for many samples. From a pool of about 3000 samples, we randomly picked 45 samples with a bias towards samples that had a significant number of chromosomal rearrangements. In Figure 7, we show a plot (in log scale) of the number of circular arc crossings on the Genome U-Plot versus the number of circular arc crossings on the circular plot. We observe that for any given sample, there are approximately four times more circular arc crossings for the circular plot than for the Genome U-Plot. Fig. 7. View largeDownload slide Number of circular arc crossings (log10) over multiple samples Fig. 7. View largeDownload slide Number of circular arc crossings (log10) over multiple samples In Figure 8, we show a plot of the crossing angle resolution on the Genome U-Plot versus the crossing angle resolution on the circular plot. Again we observe that for any given sample, the crossing angle resolution in the Genome U-Plot is about two times higher than the one in the circular plot. Fig. 8. View largeDownload slide Crossing angle resolution over multiple samples Fig. 8. View largeDownload slide Crossing angle resolution over multiple samples 4 Discussion As we have shown, the proposed layout improvements result in optimization of the aesthetic constraints of a graph, leading to a less tangled, more readable graph with 4-fold improvement in spatial resolution. In the case of the circular layout, the available linear space to lay out the genomic data is the circumference of a circle of length 2·π·r (where r is the outer radius of the circular layout). On the other hand, the corresponding linear space of the Genome U-Plot is 13 rows each of length 2·r. Therefore, with our suggested layout of chromosomes and pertinent data-elements in a U-shape representation, we attain an improved use of the two-dimensional space by a factor of 13/π. To elaborate more on the data resolution of circular layouts, given a circle of radius equal to 500 pixels the circumference of the circle would be about 3141 pixels. As the entire human genome is about 3 × 109 base pairs, 106 genomic positions are condensed in just 1 pixel. As more tracks of data are added to a continuously decreasing circle radius, more and more genomic positions, and therefore data, are compacted to a single pixel making it possible to view only exaggerated genomic variations. While circular layouts are visually pleasing, interpretation of the data is cumbersome due to changing orientations of the layered information. Particularly for gains and losses, at the top of the plot indentations due to copy number variations (in Fig. 6 chromosome 1) are at opposite locations, with respect to the normal values, compared to where they appear at the bottom of the plot (see Fig. 6 chromosome 9). The Genome U-Plot offers left to right reading of the data which is more intuitive and natural and lends itself to zooming operations. A track based approach for the presentation of data requires the user to scroll up and down the page to get an entire view of his/her data. The layered design of the Genome U-Plot allows the user to concentrate at a particular genomic position and know that all relevant data will be in proximity and in consistent orientations. Furthermore, as the Genome U-Plot has much more visual space, names of important genes can be presented much more clearly, whereas in circular plots such representations would be much more crowded. Ideas for future work include a tabular view that will create a connection between the chromosome junctions and the tabular rows; when hovering over the junctions in the plot the corresponding row in the table will highlight. Furthermore, multiple gene annotation representations that are DNA strand specific will be possible, like the RefSeq database (Brister et al., 2015; O’Leary et al., 2015; Tatusova et al., 2016) and the Ensembl database (Yates et al., 2016). Moreover, searching for particular data given a gene name or a cytoband or a genomic position will enhance the functionality of the application. Finally, a continuous feedback of genomic location (i.e. chromosome, coordinates and locus) will keep the user oriented. In cases where data overflows the specific pixel dimensions of the plot, visible clues will be given so that a user can pay more attention to those specific regions. Funding This study was partially funded by the Mayo Clinic Center for Individualized Medicine and the Mayo Clinic Genomics Systems Unit. Conflict of Interest: AG, SHJ, JBS and GV hold stock in WholeGenome LLC, which is a company that produces visualization tools for whole genome analysis and are the makers of the GenomoscopeTM application. References An J. et al.   ( 2015) J-circos: an interactive circos plotter. Bioinformatics , 31, 1463– 1465. Google Scholar CrossRef Search ADS PubMed  Bennett C. et al.   ( 2007) The aesthetics of graph visualization. Comput. Aesthetics , 57– 64. Bostock M. ( 2012) D3.js. Data Driven Docum ., 492, 701. Bourke P. ( 1997) Intersection of two circles. http://paulbourke.net/geometry/circlesphere/ (20 July 2017, date last accessed). Brister J.R. et al.   ( 2015) NCBI viral genomes resource. Nucleic Acids Res ., 43, D571– D577. Google Scholar CrossRef Search ADS PubMed  Consortium W.W.W. et al.   ( 2000) Scalable vector graphics (svg) 1.1 specification. W3C Candidate Recommendation , 2. Cui Y. et al.   ( 2016) Biocircos. js: an interactive circos javascript library for biological data visualization on web applications. Bioinformatics , 32, 1740– 1742. Google Scholar CrossRef Search ADS PubMed  Didimo W. et al.   ( 2009) Drawing graphs with right angle crossings. In: Workshop on Algorithms and Data Structures . Springer, pp. 206– 217. Google Scholar CrossRef Search ADS   Down T.A. et al.   ( 2011) Dalliance: interactive genome viewing on the web. Bioinformatics , 27, 889– 890. Google Scholar CrossRef Search ADS PubMed  Etherington G.J., MacLean D. ( 2013) Svgenes: a library for rendering genomic features in scalable vector graphic format. Bioinformatics , 29, 1890– 1892. Google Scholar CrossRef Search ADS PubMed  Girault N. ( 2015) Circosjs. doi: 10.1016/j.electacta.2015.07.153. Hu Y. et al.   ( 2014) Omiccircos: a simple-to-use r package for the circular visualization of multidimensional omics data. Cancer Informatics , 13, 13. Google Scholar CrossRef Search ADS PubMed  Krzywinski M. et al.   ( 2009) Circos: an information aesthetic for comparative genomics. Genome Res ., 19, 1639– 1645. Google Scholar CrossRef Search ADS PubMed  Laird M.R. et al.   ( 2015) Genomed3plot: a library for rich, interactive visualizations of genomic data in web applications. Bioinformatics , 31, 3348– 3349. Google Scholar CrossRef Search ADS PubMed  O’leary N.A. et al.   ( 2015) Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res ., 44, D733– D745. Google Scholar CrossRef Search ADS PubMed  Pak T.R., Roth F.P. ( 2013) Chromozoom: a flexible, fluid, web-based genome browser. Bioinformatics , 29, 384– 386. Google Scholar CrossRef Search ADS PubMed  Phanstiel D.H. et al.   ( 2014) Sushi. r: flexible, quantitative and integrative genomic visualizations for publication-quality multi-panel figures. Bioinformatics , 30, 2808– 2810. Google Scholar CrossRef Search ADS PubMed  Purchase H.C. ( 2002) Metrics for graph drawing aesthetics. J. Vis. Lang. Comput ., 13, 501– 516. Google Scholar CrossRef Search ADS   Robinson J.T. et al.   ( 2011) Integrative genomics viewer. Nat. Biotechnol ., 29, 24– 26. Google Scholar CrossRef Search ADS PubMed  Skidmore Z.L. et al.   ( 2016) Genvisr: genomic visualizations in r. Bioinformatics , 32, 3012– 3014. Google Scholar CrossRef Search ADS PubMed  Tatusova T. et al.   ( 2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res ., 44, 6614– 6624. Google Scholar CrossRef Search ADS PubMed  Thorvaldsdóttir H. et al.   ( 2013) Integrative genomics viewer (igv): high-performance genomics data visualization and exploration. Brief. Bioinformatics , 14, 178– 192. Google Scholar CrossRef Search ADS PubMed  Vanderkam D. et al.   ( 2016) pileup.js: a javascript library for interactive and in-browser visualization of genomic data. Bioinformatics , 32, 2378– 2379. Weitz E. ( 2015) Ideogram. https://github.com/eweitz/ideogram (20 July 2017, date last accessed). Yates A. et al.   ( 2016) Ensembl 2016. Nucleic Acids Res ., 44, D710– D716. Google Scholar CrossRef Search ADS PubMed  Yin T. et al.   ( 2012) ggbio: an r package for extending the grammar of graphics for genomic data. Genome Biol ., 13, R77. Google Scholar CrossRef Search ADS PubMed  Zhang H. et al.   ( 2013) Rcircos: an r package for circos 2d track plots. BMC Bioinformatics , 14, 244. Google Scholar CrossRef Search ADS PubMed  © The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Genome U-Plot: a whole genome visualization

Loading next page...
 
/lp/ou_press/genome-u-plot-a-whole-genome-visualization-K9ijfYcmrX
Publisher
Oxford University Press
Copyright
© The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/btx829
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation The ability to produce and analyze whole genome sequencing (WGS) data from samples with structural variations (SV) generated the need to visualize such abnormalities in simplified plots. Conventional two-dimensional representations of WGS data frequently use either circular or linear layouts. There are several diverse advantages regarding both these representations, but their major disadvantage is that they do not use the two-dimensional space very efficiently. We propose a layout, termed the Genome U-Plot, which spreads the chromosomes on a two-dimensional surface and essentially quadruples the spatial resolution. We present the Genome U-Plot for producing clear and intuitive graphs that allows researchers to generate novel insights and hypotheses by visualizing SVs such as deletions, amplifications, and chromoanagenesis events. The main features of the Genome U-Plot are its layered layout, its high spatial resolution and its improved aesthetic qualities. We compare conventional visualization schemas with the Genome U-Plot using visualization metrics such as number of line crossings and crossing angle resolution measures. Based on our metrics, we improve the readability of the resulting graph by at least 2-fold, making apparent important features and making it easy to identify important genomic changes. Results A whole genome visualization tool with high spatial resolution and improved aesthetic qualities. Availability and implementation An implementation and documentation of the Genome U-Plot is publicly available at https://github.com/gaitat/GenomeUPlot. Contact vasmatzis.george@mayo.edu Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Visualization is an important aspect of genome-wide analysis as it yields new insights into the genomic data. It provides a fast inspection of the data and identifies areas that require further investigation. Precisely identifying genome wide variations is very important in patient outcomes. A chromosome rearrangement is a missing, extra, or irregular portion of chromosomal DNA. The DNA structure can be altered when the total amount of genetic information is decreased (i.e. deletions), increased (i.e. duplications, insertions) or rearranged (i.e. inversions, translocations). Copy number variations (CNVs) are gains and losses of a genomic sequence and are also important forms of genetic variation. A breakpoint is the location left or right of a break in the genome. A junction is the reunion of a break where two distal breakpoints are now adjacent. Current visualization approaches of genomic data use either linear or circular display of the data. The linear layouts represent the chromosomes in a linear fashion end to end. The circular layouts also represent the chromosomes end to end but the start of the first chromosome and the end of the last chromosome are brought together. Furthermore, there are the karyogram type layouts where each chromosome occupies one line (or column). None of these types of layouts provide the resolution required to perform investigative research. There are many applications that use linear layouts to visualize genomic data; IGV (Robinson et al., 2011; Thorvaldsdóttir et al., 2013) a desktop application and a JavaScript package, pileup.js (Vanderkam et al., 2016) a JavaScript package, GenVisR (Skidmore et al., 2016) an R package, SVGenes (Etherington and MacLean, 2013) a Ruby package, ChromoZoom (Pak and Roth, 2013) a Javascript package, Dalliance (Down et al., 2011) a JavaScript package, Sushi.R (Phanstiel et al., 2014) an R package, are just a few of them. Hardly any of these packages are capable of displaying whole-genome data but rather display data for one chromosome at a time or much even smaller genomic ranges. Ideogram (Weitz, 2015, https://github.com/eweitz/ideogram) is a karyogram type viewer whose drawback is that half of the plot space remains empty. These linear layout packages have all in common the notion of a track; one track is used for each genomic data-type element and tracks are all stacked on top of each other genomically aligned. Depending on how many tracks are used, it is common that a researcher would be required to scroll up and down the page in order to view and make sense of all of his/hers data. Some applications that produce circular layouts of genomic data include Circos (Krzywinski et al., 2009) a Perl package, BioCircos.js (Cui et al., 2016) a JavaScript & D3 package, GenomeD3Plot (Laird et al., 2015) a JavaScript & D3 package, OmicCircos (Hu et al., 2014) an R package, JCircos (An et al., 2015) a Java package, RCircos (Zhang et al., 2013) an R package. The R package ggbio (Yin et al., 2012) can produce both linear and circular representations. On the positive side, circular layouts depict multi-dimensional data in a compact way. On the negative side, they also share the notion of a track but as the layout is circular, there is a finite amount of tracks that can be added before the data becomes illegible. Linear or scatter data are added radially inward to the circular layout. As a result the data resolution is reduced as the arc length of each chromosome is diminished when data moves towards the center of the circle. Furthermore, junctions, which are depicted as circular arcs, very often overlap each other as there is only limited radial pixel space per genomic position. Dense graphs with numerous line crossings are virtually impossible to interpret. Circular layouts provide just an overview of the data without giving the ability to the investigator to drill into the data. We present a novel alternative, the Genome U-Plot, which provides a U-shape layout of genomic data. The proposed method addresses the issue of graph “aesthetics”. Specifically, in order to optimize a graph, de-tangle complex lines to produce an easily interpretable graph and maximize readability, it is essential that we minimize the number of line crossings, while at the same time maximize the crossing angle resolution, i.e. increase the minimum average angle formed by any two crossing lines as proposed by Didimo et al., 2009, Purchase, 2002 and Bennett et al., 2007. We will show that the Genome U-Plot reduces the number of crossing lines >4-fold and maximizes the crossing angle resolution by at least 2-fold. 2 Materials and methods The Genome U-Plot (see Fig. 1) addresses the issue of resolution. The whole-genome view, incorporates an overview of all the chromosomes, maximizes media (i.e. screen or paper) space usage and at the same time provides the necessary resolution for investigative research, as it takes advantage of the entire two-dimensional space of the media. The chromosomes are laid out in a U-shape pattern so as to maximize use of the rectangular plane with very little space going to waste. The large chromosomes 1–12 and X are placed on the left side of the view while the smaller chromosomes 13–22 and Y are placed on the right side of the view, filling in the space that the large chromosomes leave behind. As a result, in a 1000 pixel window about 2 49 000 genomic positions (the size of chromosome 1 divided by 1000) are condensed to 1 pixel; a 4-fold improvement over the circular layouts. Fig. 1. View largeDownload slide Whole Genome U-Plot. Visible are the 24 human chromosomes arranged in a U-shape, the cytobands, the chromosome junctions and the copy number variations (CNVs). The axes at the bottom right of the graph are respectively for the chromosomes on the right side of the plot Fig. 1. View largeDownload slide Whole Genome U-Plot. Visible are the 24 human chromosomes arranged in a U-shape, the cytobands, the chromosome junctions and the copy number variations (CNVs). The axes at the bottom right of the graph are respectively for the chromosomes on the right side of the plot The U-shaped plot is a visualization of multivariate genomic data that is all layered on top of each other with clear and precise representations. The approach selected for the visualization of multi-dimensional data was the use of layers, that can be turned on or off, for each data type. Overlaying different data types that are all based on genomic positions helps the user in keeping a common context of information instead of scrolling up and down to see data that corresponds to the same genomic regions. Cytobands, structural alterations and copy number variations are just some of its layers. The plot incorporates views for interpretation of detected structural rearrangements of chromosomes like chromosomal alterations and copy number variations (CNVs). Each magenta line in the plot represents a junction, connecting the breakpoints of the junction. The endpoints of the line point to breakpoint A and breakpoint B. The thickness of the line is relative to the number of supporting fragments. There are many samples though, where there is a big number of overlaps of lines as the events are intra-chromosomal and cover small genomic ranges close to each other. There are also samples where inter-chromosomal junctions overlap. So we opted for using circular arcs to connect the breakpoints of junctions and by modifying the magnitude of the curvature of the circular arc we produce graphs that minimize the overlap of junctions. Neutral copy number is shown in gray color line above the ideogram for each chromosome. The line height indicates the copy number level. Lines are colored when a statistical deviation from neutral is detected, red for losses, blue for gains. The Genome U-Plot is a tool that provides researchers and clinicians the ability to explore sample-based large datasets (DNA, RNA, Exome) to identify which genetic processes are likely implicated in disease for a particular patient or sample. Features of the Genome U-Plot include the visualization of structural alterations represented as arcs, instead of lines, giving the ability to differentiate between overlapping alterations as the curvature magnitude of the circular arc is user defined. Copy number variations are color coded to not only represent gains and losses but also multiple levels of amplifications or deletions. In addition, the user can filter the data displayed by number of supporting fragments of each junction. Finally, the application allows for mouse-centric infinite zooming. When the zoom level increases data representations move around the mouse pointer so as to not lose context in the plethora of information. With an extremely dynamic viewport displaying multi-dimensional data visualizations, bioinformatics specialists will use this tool to both analyze and report on potentially pathogenic events within a patient’s genome, specifically structural alterations and copy number variations (CNV) during the first stage, followed by gene variants, indel variants, LOH variants, single nucleotide variants (SNV), gene fusions, mutations, gene expression and exon expression at a later point. The application is designed and developed using the latest web technologies like the D3.js API Bostock (2012). D3.js extends the capability of Scalable Vector Graphics (SVG Consortium et al., 2000) to allow it to render 2D visualizations in any compatible web browser without the use of any plug-ins. This ensures the operation of the software not only on multiple device types like desktops, tablets and mobile devices but also on multiple operating systems. To quantify the issue of graph “aesthetics” and determine the number of junction crossings and the crossing angle resolution, as discussed by Didimo et al., 2009, Purchase, 2002 and Bennett et al., 2007 we need to be able to compute the intersection point of two circular arcs and the angles that the two arcs form around that point. 2.1 Arc intersection computation To calculate the intersection between two circular arcs A and B (see Fig. 2), A being defined by the end points (x1, y1) and (x2, y2), and B by the end points (x3, y3) and (x4, y4), we first compute the circle that is defined by each circular arc. We define the radius of the circle of the circular arc A to be:   rA=c·(x1−x2)2+(y1−y2)2 (1) where c is the magnitude of the curvature of the circular arc. Fig. 2. View largeDownload slide Arc intersection computation Fig. 2. View largeDownload slide Arc intersection computation Using the Pythagorean theorem, we calculate the vertex VA of the isosceles triangle that is formed by the chord of the circular arc A [i.e. the points (x1, y1) and (x2, y2)] and the two radii rA which connect VA with the two endpoints of the circular arc A. VA is the center (cAx,cAy) of the unique circle that contains the circular arc A. Similarly we compute the center (cBx,cBy) of the circle that contains the circular arc B and whose radius is rB. We proceed to calculate the intersection points of the two circles (cAx,cAy,rA) and (cBx,cBy,rB) (Bourke, 1997, http://paulbourke.net/geometry/circlesphere/). If an intersection point belongs to both arcs, we use this as the circular arc intersection point. Then we need to determine if the point(s) of intersection of the two circles belong(s) to any of the circular arcs A and B. In order to calculate whether an intersection point (ix, iy) of the two circles belongs to one of the arcs, e.g. arc A, we compute the angles θAs, θAe and θAi. θAs is the angle formed by the x-axis and the line segment defined by the start of the arc (x1, y1) and the center of the circle (cAx,cAy). Similarly, θAe is the angle formed by the x-axis and the line segment defined by the end of the arc (x2, y2) and the center of the circle (cAx,cAy). Finally, θAi is the angle formed by the x-axis and the line segment defined by the intersection point (ix, iy) and the center of the circle (cAx,cAy). If θAs≤θAi≤θAe, then the intersection point (ix, iy) lies within the circular arc A. In order to calculate the angle between the two intersecting arcs, we calculate the slopes (mA, mB) of the tangents to the arcs at the intersection point using the formulas:   mA=−cot(θAi),mB=−cot(θBi) (2) Then we calculate the angle between the two slopes:   θAB=a tan((mA−mB)/(1+mA·mB)) (3) Finally, the angle between the two arcs A and B is defined as the min(θAB,(180−θAB)) (see Fig. 2). 3 Results We ran a simulation, minimizing the number of intersecting lines and maximizing the crossing angle resolution, in order to determine the best curvature magnitude of the circular arc. We show the results for one particular sample that has several crossing junctions as well as several overlapping junctions. In Figure 3 (and in Supplementary Figs S9, S13, S17, S21, S25, S29, S33, S37, S41, S45), we show the sample without any optimization. At this state, it has 76 crossing lines and the crossing angle resolution is 25.46°. It is clear that the intra-chromosomal junctions on chromosomes 6 & 7 are not at all discernible. It is also very hard to determine the inter-chromosomal junctions between chromosomes 6 & 7. There exists an additional ambiguity between chromosomes 1 & 6 and 1 & 12 as the junction lines overlap since they have the same slope. Fig. 3. View largeDownload slide The sample junctions depicted as magenta lines Fig. 3. View largeDownload slide The sample junctions depicted as magenta lines In Figure 4 (and in Supplementary Figs S10, S14, S18, S22, S26, S30, S34, S38, S42, S46), we show the sample junctions as magenta circular arcs after setting the curvature magnitude to the optimized value of 0.28 (as determined by the simulation; see Fig. 5) for which the number of crossing lines is 64 and the crossing angle resolution is 36.44°. Fig. 4. View largeDownload slide The sample junctions depicted as magenta circular arcs after optimizing the magnitude of the curvature Fig. 4. View largeDownload slide The sample junctions depicted as magenta circular arcs after optimizing the magnitude of the curvature Fig. 5. View largeDownload slide Curvature magnitude simulation plot Fig. 5. View largeDownload slide Curvature magnitude simulation plot In Figure 5 (and in Supplementary Figs S11, S15, S19, S23, S27, S31, S35, S39, S43, S47), we show a plot of the optimization score for all curvature magnitudes that we ran. The highest score is selected and used for the Genome U-Plot (see Fig. 4). In Figure 6 (and in Supplementary Figs S12, S16, S20, S24, S28, S32, S36, S40, S44, S48), we show the circular plot for the same sample which has 407 crossing circular arcs and the crossing angle resolution is 13.1°. It is clear that the interactions between chromosomes 6 & 7 are not at all discernible. In this case, the junctions between chromosomes 1 & 6 and 1 & 12 are actually discernible but they are only a small proportion of the total that are not distinct. Fig. 6. View largeDownload slide Corresponding circular plot using CircosJS (Girault, 2015, https://github.com/nicgirault/circosJS) Fig. 6. View largeDownload slide Corresponding circular plot using CircosJS (Girault, 2015, https://github.com/nicgirault/circosJS) Then we ran the same simulation for many samples. From a pool of about 3000 samples, we randomly picked 45 samples with a bias towards samples that had a significant number of chromosomal rearrangements. In Figure 7, we show a plot (in log scale) of the number of circular arc crossings on the Genome U-Plot versus the number of circular arc crossings on the circular plot. We observe that for any given sample, there are approximately four times more circular arc crossings for the circular plot than for the Genome U-Plot. Fig. 7. View largeDownload slide Number of circular arc crossings (log10) over multiple samples Fig. 7. View largeDownload slide Number of circular arc crossings (log10) over multiple samples In Figure 8, we show a plot of the crossing angle resolution on the Genome U-Plot versus the crossing angle resolution on the circular plot. Again we observe that for any given sample, the crossing angle resolution in the Genome U-Plot is about two times higher than the one in the circular plot. Fig. 8. View largeDownload slide Crossing angle resolution over multiple samples Fig. 8. View largeDownload slide Crossing angle resolution over multiple samples 4 Discussion As we have shown, the proposed layout improvements result in optimization of the aesthetic constraints of a graph, leading to a less tangled, more readable graph with 4-fold improvement in spatial resolution. In the case of the circular layout, the available linear space to lay out the genomic data is the circumference of a circle of length 2·π·r (where r is the outer radius of the circular layout). On the other hand, the corresponding linear space of the Genome U-Plot is 13 rows each of length 2·r. Therefore, with our suggested layout of chromosomes and pertinent data-elements in a U-shape representation, we attain an improved use of the two-dimensional space by a factor of 13/π. To elaborate more on the data resolution of circular layouts, given a circle of radius equal to 500 pixels the circumference of the circle would be about 3141 pixels. As the entire human genome is about 3 × 109 base pairs, 106 genomic positions are condensed in just 1 pixel. As more tracks of data are added to a continuously decreasing circle radius, more and more genomic positions, and therefore data, are compacted to a single pixel making it possible to view only exaggerated genomic variations. While circular layouts are visually pleasing, interpretation of the data is cumbersome due to changing orientations of the layered information. Particularly for gains and losses, at the top of the plot indentations due to copy number variations (in Fig. 6 chromosome 1) are at opposite locations, with respect to the normal values, compared to where they appear at the bottom of the plot (see Fig. 6 chromosome 9). The Genome U-Plot offers left to right reading of the data which is more intuitive and natural and lends itself to zooming operations. A track based approach for the presentation of data requires the user to scroll up and down the page to get an entire view of his/her data. The layered design of the Genome U-Plot allows the user to concentrate at a particular genomic position and know that all relevant data will be in proximity and in consistent orientations. Furthermore, as the Genome U-Plot has much more visual space, names of important genes can be presented much more clearly, whereas in circular plots such representations would be much more crowded. Ideas for future work include a tabular view that will create a connection between the chromosome junctions and the tabular rows; when hovering over the junctions in the plot the corresponding row in the table will highlight. Furthermore, multiple gene annotation representations that are DNA strand specific will be possible, like the RefSeq database (Brister et al., 2015; O’Leary et al., 2015; Tatusova et al., 2016) and the Ensembl database (Yates et al., 2016). Moreover, searching for particular data given a gene name or a cytoband or a genomic position will enhance the functionality of the application. Finally, a continuous feedback of genomic location (i.e. chromosome, coordinates and locus) will keep the user oriented. In cases where data overflows the specific pixel dimensions of the plot, visible clues will be given so that a user can pay more attention to those specific regions. Funding This study was partially funded by the Mayo Clinic Center for Individualized Medicine and the Mayo Clinic Genomics Systems Unit. Conflict of Interest: AG, SHJ, JBS and GV hold stock in WholeGenome LLC, which is a company that produces visualization tools for whole genome analysis and are the makers of the GenomoscopeTM application. References An J. et al.   ( 2015) J-circos: an interactive circos plotter. Bioinformatics , 31, 1463– 1465. Google Scholar CrossRef Search ADS PubMed  Bennett C. et al.   ( 2007) The aesthetics of graph visualization. Comput. Aesthetics , 57– 64. Bostock M. ( 2012) D3.js. Data Driven Docum ., 492, 701. Bourke P. ( 1997) Intersection of two circles. http://paulbourke.net/geometry/circlesphere/ (20 July 2017, date last accessed). Brister J.R. et al.   ( 2015) NCBI viral genomes resource. Nucleic Acids Res ., 43, D571– D577. Google Scholar CrossRef Search ADS PubMed  Consortium W.W.W. et al.   ( 2000) Scalable vector graphics (svg) 1.1 specification. W3C Candidate Recommendation , 2. Cui Y. et al.   ( 2016) Biocircos. js: an interactive circos javascript library for biological data visualization on web applications. Bioinformatics , 32, 1740– 1742. Google Scholar CrossRef Search ADS PubMed  Didimo W. et al.   ( 2009) Drawing graphs with right angle crossings. In: Workshop on Algorithms and Data Structures . Springer, pp. 206– 217. Google Scholar CrossRef Search ADS   Down T.A. et al.   ( 2011) Dalliance: interactive genome viewing on the web. Bioinformatics , 27, 889– 890. Google Scholar CrossRef Search ADS PubMed  Etherington G.J., MacLean D. ( 2013) Svgenes: a library for rendering genomic features in scalable vector graphic format. Bioinformatics , 29, 1890– 1892. Google Scholar CrossRef Search ADS PubMed  Girault N. ( 2015) Circosjs. doi: 10.1016/j.electacta.2015.07.153. Hu Y. et al.   ( 2014) Omiccircos: a simple-to-use r package for the circular visualization of multidimensional omics data. Cancer Informatics , 13, 13. Google Scholar CrossRef Search ADS PubMed  Krzywinski M. et al.   ( 2009) Circos: an information aesthetic for comparative genomics. Genome Res ., 19, 1639– 1645. Google Scholar CrossRef Search ADS PubMed  Laird M.R. et al.   ( 2015) Genomed3plot: a library for rich, interactive visualizations of genomic data in web applications. Bioinformatics , 31, 3348– 3349. Google Scholar CrossRef Search ADS PubMed  O’leary N.A. et al.   ( 2015) Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res ., 44, D733– D745. Google Scholar CrossRef Search ADS PubMed  Pak T.R., Roth F.P. ( 2013) Chromozoom: a flexible, fluid, web-based genome browser. Bioinformatics , 29, 384– 386. Google Scholar CrossRef Search ADS PubMed  Phanstiel D.H. et al.   ( 2014) Sushi. r: flexible, quantitative and integrative genomic visualizations for publication-quality multi-panel figures. Bioinformatics , 30, 2808– 2810. Google Scholar CrossRef Search ADS PubMed  Purchase H.C. ( 2002) Metrics for graph drawing aesthetics. J. Vis. Lang. Comput ., 13, 501– 516. Google Scholar CrossRef Search ADS   Robinson J.T. et al.   ( 2011) Integrative genomics viewer. Nat. Biotechnol ., 29, 24– 26. Google Scholar CrossRef Search ADS PubMed  Skidmore Z.L. et al.   ( 2016) Genvisr: genomic visualizations in r. Bioinformatics , 32, 3012– 3014. Google Scholar CrossRef Search ADS PubMed  Tatusova T. et al.   ( 2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res ., 44, 6614– 6624. Google Scholar CrossRef Search ADS PubMed  Thorvaldsdóttir H. et al.   ( 2013) Integrative genomics viewer (igv): high-performance genomics data visualization and exploration. Brief. Bioinformatics , 14, 178– 192. Google Scholar CrossRef Search ADS PubMed  Vanderkam D. et al.   ( 2016) pileup.js: a javascript library for interactive and in-browser visualization of genomic data. Bioinformatics , 32, 2378– 2379. Weitz E. ( 2015) Ideogram. https://github.com/eweitz/ideogram (20 July 2017, date last accessed). Yates A. et al.   ( 2016) Ensembl 2016. Nucleic Acids Res ., 44, D710– D716. Google Scholar CrossRef Search ADS PubMed  Yin T. et al.   ( 2012) ggbio: an r package for extending the grammar of graphics for genomic data. Genome Biol ., 13, R77. Google Scholar CrossRef Search ADS PubMed  Zhang H. et al.   ( 2013) Rcircos: an r package for circos 2d track plots. BMC Bioinformatics , 14, 244. Google Scholar CrossRef Search ADS PubMed  © The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BioinformaticsOxford University Press

Published: Dec 21, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off