Full-length single-cell RNA-seq applied to a viral human cancer: applications to HPV expression and splicing analysis in HeLa S3 cellsWu, Liang; Zhang, Xiaolong; Zhao, Zhikun; Wang, Ling; Li, Bo; Li, Guibo; Dean, Michael; Yu, Qichao; Wang, Yanhui; Lin, Xinxin; Rao, Weijian; Mei, Zhanlong; Li, Yang; Jiang, Runze; Yang, Huan; Li, Fuqiang; Xie, Guoyun; Xu, Liqin; Wu, Kui; Zhang, Jie; Chen, Jianghao; Wang, Ting; Kristiansen, Karsten; Zhang, Xiuqing; Li, Yingrui; Yang, Huanming; Wang, Jian; Hou, Yong; Xu, Xun
doi: 10.1186/s13742-015-0091-4pmid: 26550473
BackgroundViral infection causes multiple forms of human cancer, and HPV infection is the primary factor in cervical carcinomas. Recent single-cell RNA-seq studies highlight the tumor heterogeneity present in most cancers, but virally induced tumors have not been studied. HeLa is a well characterized HPV+ cervical cancer cell line.ResultWe developed a new high throughput platform to prepare single-cell RNA on a nanoliter scale based on a customized microwell chip. Using this method, we successfully amplified full-length transcripts of 669 single HeLa S3 cells and 40 of them were randomly selected to perform single-cell RNA sequencing. Based on these data, we obtained a comprehensive understanding of the heterogeneity of HeLa S3 cells in gene expression, alternative splicing and fusions. Furthermore, we identified a high diversity of HPV-18 expression and splicing at the single-cell level. By co-expression analysis we identified 283 E6, E7 co-regulated genes, including CDC25, PCNA, PLK4, BUB1B and IRF1 known to interact with HPV viral proteins.ConclusionOur results reveal the heterogeneity of a virus-infected cell line. It not only provides a transcriptome characterization of HeLa S3 cells at the single cell level, but is a demonstration of the power of single cell RNA-seq analysis of virally infected cells and cancers.
Investigation into the annotation of protocol sequencing steps in the sequence read archiveAlnasir, Jamie; Shanahan, Hugh P
doi: 10.1186/s13742-015-0064-7pmid: 25960871
BackgroundThe workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined.ResultsWe examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records).ConclusionsThe current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present.
An image database of Drosophila melanogaster wings for phenomic and biometric analysisSonnenschein, Anne; VanderZee, David; Pitchers, William R; Chari, Sudarshan; Dworkin, Ian
doi: 10.1186/s13742-015-0065-6pmid: 27390931
BackgroundExtracting important descriptors and features from images of biological specimens is an ongoing challenge. Features are often defined using landmarks and semi-landmarks that are determined a priori based on criteria such as homology or some other measure of biological significance. An alternative, widely used strategy uses computational pattern recognition, in which features are acquired from the image de novo. Subsets of these features are then selected based on objective criteria. Computational pattern recognition has been extensively developed primarily for the classification of samples into groups, whereas landmark methods have been broadly applied to biological inference.ResultsTo compare these approaches and to provide a general community resource, we have constructed an image database of Drosophila melanogaster wings - individually identifiable and organized by sex, genotype and replicate imaging system - for the development and testing of measurement and classification tools for biological images. We have used this database to evaluate the relative performance of current classification strategies. Several supervised parametric and nonparametric machine learning algorithms were used on principal components extracted from geometric morphometric shape data (landmarks and semi-landmarks). For comparison, we also classified phenotypes based on de novo features extracted from wing images using several computer vision and pattern recognition methods as implemented in the Bioimage Classification and Annotation Tool (BioCAT).ConclusionsBecause we were able to thoroughly evaluate these strategies using the publicly available Drosophila wing database, we believe that this resource will facilitate the development and testing of new tools for the measurement and classification of complex biological phenotypes.
A spectrum of sharing: maximization of information content for brain imaging dataCalhoun, Vince
doi: 10.1186/s13742-014-0042-5pmid: 25653850
Efforts to expand sharing of neuroimaging data have been growing exponentially in recent years. There are several different types of data sharing which can be considered to fall along a spectrum, ranging from simpler and less informative to more complex and more informative. In this paper we consider this spectrum for three domains: data capture, data density, and data analysis. Here the focus is on the right end of the spectrum, that is, how to maximize the information content while addressing the challenges. A summary of associated challenges of and possible solutions is presented in this review and includes: 1) a discussion of tools to monitor quality of data as it is collected and encourage adoption of data mapping standards; 2) sharing of time-series data (not just summary maps or regions); and 3) the use of analytic approaches which maximize sharing potential as much as possible. Examples of existing solutions for each of these points, which we developed in our lab, are also discussed including the use of a comprehensive beginning-to-end neuroinformatics platform and the use of flexible analytic approaches, such as independent component analysis and multivariate classification approaches, such as deep learning.
Improving functional magnetic resonance imaging reproducibilityPernet, Cyril; Poline, Jean-Baptiste
doi: 10.1186/s13742-015-0055-8pmid: 25830019
BackgroundThe ability to replicate an entire experiment is crucial to the scientific method. With the development of more and more complex paradigms, and the variety of analysis techniques available, fMRI studies are becoming harder to reproduce.ResultsIn this article, we aim to provide practical advice to fMRI researchers not versed in computing, in order to make studies more reproducible. All of these steps require researchers to move towards a more open science, in which all aspects of the experimental method are documented and shared.ConclusionOnly by sharing experiments, data, metadata, derived data and analysis workflows will neuroimaging establish itself as a true data science.