Systematic comparative study of computational methods for T-cell receptor sequencing data analysis

Systematic comparative study of computational methods for T-cell receptor sequencing data analysis Abstract High-throughput sequencing technologies have exposed the possibilities for the in-depth evaluation of T-cell receptor (TCR) repertoires. These studies are highly relevant to gain insights into human adaptive immunity and to decipher the composition and diversity of antigen receptors in physiological and disease conditions. The major objective of TCR sequencing data analysis is the identification of V, D and J gene segments, complementarity-determining region 3 (CDR3) sequence extraction and clonality analysis. With the advancement in sequencing technologies, new TCR analysis approaches and programs have been developed. However, there is still a deficit of systematic comparative studies to assist in the selection of an optimal analysis approach. Here, we present a detailed comparison of 10 state-of-the-art TCR analysis tools on samples with different complexities by taking into account many aspects such as clonotype detection [unique V(D)J combination], CDR3 identification or accuracy in error correction. We used our in silico and experimental data sets with known clonalities enabling the identification of potential tool biases. We also established a new strategy, named clonal plane, which allows quantifying and comparing the clonality of multiple samples. Our results provide new insights into the effect of method selection on analysis results, and it will assist users in the selection of an appropriate analysis method. T-cell receptor, high-throughput sequencing, computational analysis, immunology, immunosequencing, benchmarking Introduction The T-cell receptor (TCR), present on the surface of T lymphocytes, is responsible for the recognition of antigens presented within the major histocompatibility complex. Therefore, the TCR population must reflect the high diversity of the antigenic epitopes that will be displayed during a subject’s lifetime. The complementarity-determining region 3 (CDR3) accounts for most of this diversity, as it constitutes the site of V(D)J recombination [1, 2]. In addition to the recombination of the V, D and J gene segments, the random insertion and deletion of nucleotides in junctions further contribute to the diversity in the immune repertoire [1–5]. Consequently, the investigation of TCR diversity and clonality, specifically CDR3 analysis, is critical for unraveling the complexity of the adaptive immune system. Besides, changes in the clonality of specific T cells also provide significant information in certain pathological conditions [6–8]. With the advent of next-generation sequencing, the vast amount of immunogenomics data allows qualitative and quantitative assessment of lymphocyte repertoire and its dynamics at a higher resolution. These studies have implications for medical and clinical research including autoimmune diseases, malignancies, vaccination and transplantation processes [5, 9–11]. In the past decade, many strategies were released to analyze TCR sequence data, from assigning raw reads to the correct V(D)J regions (primary tools class), determining the clinical meaning of that precise CDR3 region (secondary tools class) [12–14]. Recently, Heather et al. published a paper focused on a generalized review of TCR analysis tools belonging to both categories. The primary objective was to provide a brief outline of the steps involved in TCR analysis, as well as the challenges associated with the study [15]. As a further step toward comprehensive TCR analysis, the present work focuses on the dissection of tools belonging to the first class providing a systematic benchmarking of available analysis algorithms. Thorough and comprehensive guidelines allowing general users to choose a reliable method for analysis are still needed. We performed an independent systematic comparison between 10 published tools, MiTCR [16], MiGEC/MiXCR [17, 18], LymAnalyzer [19], IMSEQ [20], TCRklass [21], IMonitor [22], RTCR [23], Vidjil [24], Decombinator [25] and Trig [26]. Other relevant immunosequencing analysis tools V-QUEST/High V-QUEST [27] and IgBlast [28] are also available. The first is a Web-based tool with restrictions on the amount of input data and is not suitable for large-scale and Illumina-based sequencing data analysis. IgBlast is available as a web/stand-alone version, but it is also computationally expensive for large data sets. Additionally, these tools require external postprocessing to obtain clustered CDR3 clonotype outputs. However, we have tested one of these two tools that is IMGT/High V-Quest, on small subsets of the data sets. In our benchmarking analysis, we used the clonality of controlled experimental samples to compare the performances of the methods under investigation. Numerous diversity measures are used to quantify the repertoire clonality of TCR samples (Supplementary Section S1). The concept of diversity was founded in ecological studies, where richness and evenness are used for estimating the clonality of TCR repertoires. Ecological diversity could be schematized as a function of the number of species present at a particular site (richness) and the relative distribution of individuals (evenness) among these species [29–31]. This focus either on the evenness or the richness may lead to misleading conclusions. Shannon–Wiener [32] and Simpson’s [33] indices are based both on richness and evenness, but their sensitivity varies toward rare species. Shannon–Wiener index is more biased toward richness, while Simpson’s index highlights the evenness as the predominant component of diversity [34, 35]. Hulbert showed this opposite trend on an ideal data set and Peet [36] described the reason for the divergence between these two indices. Tóthmérész [37] suggested representing the diversity as a profile of a collection of indices such as the Rényi index family (Rényi entropies). This family is defined by the following equation:   Hα=11−αlog ∑i=1Spiα, Where α ∈[0,+∞], S is the number of species and pi is the probability of the species i. It is easy to prove that Shannon–Wiener index = Hα=1 = H1  and Simpson’s index = Hα=2 = H2. An intersection of the Rényi profiles of two samples within the interval α∈[1, 2] indicates that Sample 1 has a larger diversity as Sample 2 using Shannon and the opposite using Simpson, or vice versa. The significant monotonic property of Rényi entropies can be directed to compare the diversity of different sites. When α=0 is set in the Renyi equation, the formula can be simplified to H0 =log S, obtaining the GR of the sample. With α=+∞, the equation collapse to H∞=-log p' and generalizing the evenness (GE; p' is the maximal proportional abundance among all the species; see the Supplementary Section S1 for details). These two values allow defining an area, the Clonal Plane (Supplementary Figure S1), an area to plot the GR (H0) versus the GE (H∞) of one or more samples. It is remarkable to observe that: (1) only combinations of (GR, GE) where GR≥GE are possible; (2) closer is GE to GR, more even is the sample; (3) if GE ≫ GR, the sample contains a dominant species and the sample is closer to monoclonality (Supplementary Sections S2 and S3). Visually, this translates to the observation of the position of the sample points with respect to the bisector of the first quadrant (maximal theoretical polyclonality) and the x-axis (minimal theoretical monoclonality). Plotting the two extreme Rényi entropies of a sample set in the clonal plane permits to estimate and compare the samples clonality examining at the same time the generalized richness (GR), the generalized evenness and the distance from the maximal theoretical polyclonality/monoclonality (Supplementary Sections S3 and S4). Methods Tools for immunosequencing data analysis For comprehensive details, the user is referred to the original publication and documentation. MiTCR maps the query sequence to a reference by first searching seed n-mers in the query followed by an extension of the alignment in both directions in the sequence. It detects V, D and J segments and performs clonotyping with polymerase chain reaction (PCR) and sequencing error correction. It provides several adjustable parameters for error correction along with quality consideration and also rescues low-quality sequencing reads. MiGEC method is designed for data containing unique molecular identifiers (UMIs). It uses a seed-and-extend algorithm combined with a Smith–Waterman [38] strategy and a modified BLAST version [39]. MiXCR implements a particular version of K-mer algorithm referred to as KAligner [40]. In the case of paired-end reads, it combines pairing information and takes into account indels as well as mismatched information. It identifies V, D, J and C gene segments, performs clonotyping and retrieves information from low-quality data. In the clustering step, a multilayer heuristic approach is used and the resultant clonal hits are aligned by the Smith–Waterman algorithm. MiXCR can be used, along with MiGEC to analyze UMI-based data. LymAnalyzer allows novel single-nucleotide polymorphism and lineage mutation trees analysis for somatic hypermutation detection. It incorporates a fast-tag-searching algorithm for mapping the input sequences to reference gene segments, followed by extraction of CDR3 region. Obtained sequences are classified and clustered to obtain frequency per clonotype. It does not provide configurable parameters for quality control. IMSEQ generates short substrings from the reference gene segments corresponding to the conserved residues Phe/Cys and maps them to the query sequence. It processes selected reads by SWIFT filters [41] and Myers bit-vector algorithm (MBVA) [42], followed by CDR3 identification. In the repertoire generation step, the clonotypes are clustered to correct PCR and sequencing errors followed by identification of ambiguous V or J segments. TCRklass incorporates an algorithm based on k-string to identify CDR3. Initially, it generates two sub-data sets (assembled and unassembled) from the input data. The indexed reference database is required, and the position of the reference k-strings is used to find the CRD3 region in the input data. High- and low-quality reads are then processed in subsequent steps followed by evaluation of PCR, sequencing and mRNA synthesis errors. IMonitor performs trimming and quality control at the first step. Sequences are then aligned to reference by BLAST [43] followed by a second alignment step to identify corresponding genes with higher accuracy. In subsequent steps, the CDR3 region of the query sequence is identified and PCR or sequencing error correction is performed, followed by generation of statistical and graphical output. It allows only FASTA input format for single-end data. RTCR performs mapping of the germline V and J segments with the raw TCR data sequencing reads using Bowtie2 [44]. It identifies and extracts CDR3 regions from sequences and annotate them accordingly with aligned V and J segments. To deal with sequencing and PCR errors, it implements a statistical model to estimate the sequences with errors and percentage of errors in each sequence. RTCR allows analysis with and without UMI containing sequences. Vidjil performs a VDJ recombination analysis by using a seed-based method. This algorithm allows a time-efficient analysis by performing clustering on the V(D)J junction (of 50 bp sequence) in the first stage. It is followed by the assignment of particular V, D and J regions. It further allows downstream analysis including diversity estimation and visualization. Decombinator implements a modified version of Aho–Corasick pattern matching algorithm for detection of V and J segments. It searches for the substrings of V/J regions referred as ‘tag’ (i.e. the nucleotide sequence of 20 bp) by an exhaustive search approach to identify specific gene segments. However, this older version of Decombinator is not updated, and a new unpublished version v3 (used in our study) is available that has been specifically designed to work with UMI-based data mainly. Trig tool performs detection of proper and aberrant TCR and B-cell receptor (BCR) sequences. Trig performs mapping of sequencing reads to complete reference sequence and implements a heuristic strategy to distinguish ambiguous and nonambiguous regions. It allows FASTA input file format only. In silico data sets In silico data sets were generated to perform the comparative assessment of selected methods. Data sets were produced with four different clonality values and varying error rates (Table 1). The first data set (DA) consists of 100% Clone 1, whereas other three data sets (DB, DC and DD) comprised 10, 1 and 0.1% Clone 1, respectively. Clone 2 is present in a constant amount (2%) in each of these three samples, whereas the rest of the reads in each data set is composed of mixed TCR sequences with a uniform distribution. To generate several TCR combinations, we used only the first module of IgSimulator [45], creating the initial base sequence file of equally probable gene segments. It comprised recombinant VDJ strings, where somatic deletions of gene segments (start and end of V and J region, respectively) are simulated along with the addition of random nucleotides at junction regions within gene segments. Reference VDJ human TCR sequences were obtained from IMGT database [46]. These strings were then used to generate preliminary data sets with a different clonal composition of Clones 1 and 2. The read simulator wgsim [47] was used to obtain the final set of synthetic reads. We used three levels of error profiles, a zero, a low and a high error rate. To estimate our lower bound for error frequency, we also evaluated the expected amount of errors that were introduced only by the PCR step (Supplementary Section S5). The calculated error frequency (0.02%) was then scaled, arbitrarily, to 0.1% (one error in 1000 bases) to define the final low error rate. A frequency error rate >10 times (1%), defines the high error rate. We found that errors outside this range fall into the error rate regions that are outside the dynamic range of most of the tools. The number of potential clones present in all DA, DB, DC and DD was 1, 4402, 4852 and 4897, respectively. It is highly important to clarify that IgSimulator produces more nonfunctional than functional sequences. For this reason, the number of functional clones, the true richness of the samples, is expected to be smaller. Each data set among these 12 data sets comprised total 1 000 000 single-end reads with a length of 250 bp. These data sets are available at PowerFolder (https://powerfolder.dkfz.de/dl/fiCka39jmibXokiJyVUFkjBB/InSilicoSamples_BIB.zip). Table 1. Twelve in silico data sets were generated with different clonality composition and error rates (0, 0.001 and 0.01) Data sets  Sequences with error rate 1 (0%)  Sequences with error rate 2 (0.1%) or 0.001  Sequences with error rate 3 (1%) or 0.01  Data set 1  Data set 2  Data set 3  Data set A (100%)  DA1: 1 000 000 Clone 1  DA2: 1 000 000 Clone 1  DA3: 1 000 000 Clone 1  Data set B (10%+2%+Poly)  DB1: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB2: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB3: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  Data set C (1%+2%+Poly)  DC1: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC2: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC3: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  Data set D (0.1%+2%+Poly)  DD1: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD2: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD3: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  Data sets  Sequences with error rate 1 (0%)  Sequences with error rate 2 (0.1%) or 0.001  Sequences with error rate 3 (1%) or 0.01  Data set 1  Data set 2  Data set 3  Data set A (100%)  DA1: 1 000 000 Clone 1  DA2: 1 000 000 Clone 1  DA3: 1 000 000 Clone 1  Data set B (10%+2%+Poly)  DB1: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB2: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB3: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  Data set C (1%+2%+Poly)  DC1: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC2: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC3: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  Data set D (0.1%+2%+Poly)  DD1: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD2: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD3: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  Note: Data sets A, B, C and D comprised 100, 10, 1 and 0.1% Clone 1, respectively, and 2% Clone 2 was present in all data sets except DA. Table 1. Twelve in silico data sets were generated with different clonality composition and error rates (0, 0.001 and 0.01) Data sets  Sequences with error rate 1 (0%)  Sequences with error rate 2 (0.1%) or 0.001  Sequences with error rate 3 (1%) or 0.01  Data set 1  Data set 2  Data set 3  Data set A (100%)  DA1: 1 000 000 Clone 1  DA2: 1 000 000 Clone 1  DA3: 1 000 000 Clone 1  Data set B (10%+2%+Poly)  DB1: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB2: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB3: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  Data set C (1%+2%+Poly)  DC1: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC2: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC3: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  Data set D (0.1%+2%+Poly)  DD1: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD2: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD3: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  Data sets  Sequences with error rate 1 (0%)  Sequences with error rate 2 (0.1%) or 0.001  Sequences with error rate 3 (1%) or 0.01  Data set 1  Data set 2  Data set 3  Data set A (100%)  DA1: 1 000 000 Clone 1  DA2: 1 000 000 Clone 1  DA3: 1 000 000 Clone 1  Data set B (10%+2%+Poly)  DB1: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB2: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB3: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  Data set C (1%+2%+Poly)  DC1: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC2: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC3: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  Data set D (0.1%+2%+Poly)  DD1: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD2: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD3: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  Note: Data sets A, B, C and D comprised 100, 10, 1 and 0.1% Clone 1, respectively, and 2% Clone 2 was present in all data sets except DA. To estimate the effect of different fragment lengths, three distinct in silico data sets were generated. The data set B with 0% error rate was selected as template and reproduced with varying fragment lengths, i.e. 75, 100 and 150 bp each comprising 1 000 000 reads. It is worthwhile to mention that in silico data set file and quality formats were adjusted according to the requirements of each tool, where needed. In this study, all the tools have been used by using default settings in all cases. However, for data sets with highest error rates, the quality values were adjusted to zero, wherever possible. In certain tools, as MiTCR, MiXCR, RTCR, LymAnalyzer, Decombinator and Vidjil, it is possible to estimate the number of clustered clonotypes before and after removal of out-of-frame and stop codon-containing sequences. For tools like IMSEQ and TCRklass, the generated clustered result files containing VJ segment, CDR3 nucleotide sequence and sequence count information do not contain sequences with out-of-frame and stop codon. In the case of IMonitor, the clustered nucleotide sequence file with abundance information does not filter out-of-frame and stop codon sequences. Trig tool does not produce grouped clonotype result files and do not generate CDR3 amino acid-translated sequences. Where possible the statistics were composed by using filtered files (without out-of-frame/stop codon sequences). The expected percentages of Clones 1 and 2 in all samples were corrected for the total number of sequences that are removed by the tools because of producing nonfunctional CDR3 regions. The correction factor applied is given by the total number of reads divided by the number of refined reads (in-frame) detected per tool (except IMonitor and Trig, where the total number of reads were divided by the number of identified reads). In case of Decombinator tool, we have used two main modules for non-UMI-based evaluation. The nucleotide/amino acid sequence information present in the clustered output file was used to estimate the clonal composition. IMonitor outputs two clustered clonotype files: one only based on CDR3 nucleotide sequence with sequence count and other VJ pairing information with sequence count. The first sequence clustered file has been used in this study. In the case of Trig, no clustered file is generated; therefore, for analysis purpose, the output file containing nucleotide sequence and VJ information was manually clustered based on the sequences. The principal aim of this article is to compare the performance of high-throughput sequencing analysis tools that are capable of large-scale time-efficient analysis. The IMGT/High V-Quest cannot practically analyze data sets consisting each of 1.0 M reads. However, to estimate the TCR discrimination performance, we have generated random subsets of our data sets (DB1, DB2 and DB3) comprising 10 000 reads each and evaluated the performance measures by IMGT/High V-Quest. All the comparisons (except RTCR) were conducted on Ubuntu 10.04 LTE 64 bit operating system, with the specifications of 48 GB of RAM and a Xeon CPU 2.40 GHz with eight cores. Only the RTCR analysis was performed on an OpenStack virtual instance with the same hardware specification and CentOS Linux release 7.2. RTCR remained halted in the initial alignment step at aforementioned Linux system. Experimental data sets We generated four samples with controlled clonality to verify the ten methods under evaluation. Human T cells were isolated from healthy donor peripheral blood mononuclear cells with the EasySep™ Human T Cell Isolation Kit (StemCell Technologies), and purity was assessed to be >95% by flow cytometry. Total RNA was isolated from T cells, Jurkat, CCRF-CEM and HeLa. RNA quality was assessed with the Agilent Bioanalyzer. Isolated RNAs were mixed as indicated in Table 3, and each mix was used for both Rapid Amplification of cDNA Ends (RACE) PCR, as previously described with minor modifications [48], and for quantitative PCR (qPCR). Table 3. Four experimental samples generated with different clonalities by varying Jurkat and CCRF-CEM percentages Samples  Jurkat %  CCRF-CEM %  T cells %  Sample 1  10  2  88  Sample 2  1  2  97  Sample 3  0.1  2  97.9  Sample 4  0.01  2  97.99  Samples  Jurkat %  CCRF-CEM %  T cells %  Sample 1  10  2  88  Sample 2  1  2  97  Sample 3  0.1  2  97.9  Sample 4  0.01  2  97.99  Note: Jurkat percentage was decreased from Samples 1 to 4, whereas CCRF-CEM was kept constant. Table 3. Four experimental samples generated with different clonalities by varying Jurkat and CCRF-CEM percentages Samples  Jurkat %  CCRF-CEM %  T cells %  Sample 1  10  2  88  Sample 2  1  2  97  Sample 3  0.1  2  97.9  Sample 4  0.01  2  97.99  Samples  Jurkat %  CCRF-CEM %  T cells %  Sample 1  10  2  88  Sample 2  1  2  97  Sample 3  0.1  2  97.9  Sample 4  0.01  2  97.99  Note: Jurkat percentage was decreased from Samples 1 to 4, whereas CCRF-CEM was kept constant. Table 4. Subjective performance scores of each tool on in silico data sets are presented based on the clonal percentage detection Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMonitor  IMSEQ  TCRklass  Vidjil  Decombinator  Trig  DA1 in silico  10  10  10  10  10  10  10  10  10  –  DA2 in silico  9  10  8  5  3  3  5  7  6  –  DA3 in silico  8  9  10  5  3  3  5  7  6  –  DB1 in silico  10  10  10  10  10  1  10  10  2  10  DB2 in silico  10  10  10  6  6  1  6  7  2  6  DB3 in silico  9  8  10  6  6  1  6  7  6  6  DC1 in silico  10  10  10  10  10  1  10  10  2  10  DC2 in silico  10  10  10  6  6  1  6  10  2  6  DC3 in silico  8  7  10  6  6  1  6  10  6  6  DD1 in silico  10  10  10  10  10  1  10  10  2  10  DD2 in silico  10  10  10  6  6  1  6  10  6  6  DD3 in silico  8  7  10  6  6  1  6  10  6  6  Performance score  112/120  111/120  118/120  86/120  82/120  25/120  86/120  108/120  56/120  66/120  Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMonitor  IMSEQ  TCRklass  Vidjil  Decombinator  Trig  DA1 in silico  10  10  10  10  10  10  10  10  10  –  DA2 in silico  9  10  8  5  3  3  5  7  6  –  DA3 in silico  8  9  10  5  3  3  5  7  6  –  DB1 in silico  10  10  10  10  10  1  10  10  2  10  DB2 in silico  10  10  10  6  6  1  6  7  2  6  DB3 in silico  9  8  10  6  6  1  6  7  6  6  DC1 in silico  10  10  10  10  10  1  10  10  2  10  DC2 in silico  10  10  10  6  6  1  6  10  2  6  DC3 in silico  8  7  10  6  6  1  6  10  6  6  DD1 in silico  10  10  10  10  10  1  10  10  2  10  DD2 in silico  10  10  10  6  6  1  6  10  6  6  DD3 in silico  8  7  10  6  6  1  6  10  6  6  Performance score  112/120  111/120  118/120  86/120  82/120  25/120  86/120  108/120  56/120  66/120  Note: RTCR, MiXCR, Vidjil and MiTCR obtained highest score. Table 4. Subjective performance scores of each tool on in silico data sets are presented based on the clonal percentage detection Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMonitor  IMSEQ  TCRklass  Vidjil  Decombinator  Trig  DA1 in silico  10  10  10  10  10  10  10  10  10  –  DA2 in silico  9  10  8  5  3  3  5  7  6  –  DA3 in silico  8  9  10  5  3  3  5  7  6  –  DB1 in silico  10  10  10  10  10  1  10  10  2  10  DB2 in silico  10  10  10  6  6  1  6  7  2  6  DB3 in silico  9  8  10  6  6  1  6  7  6  6  DC1 in silico  10  10  10  10  10  1  10  10  2  10  DC2 in silico  10  10  10  6  6  1  6  10  2  6  DC3 in silico  8  7  10  6  6  1  6  10  6  6  DD1 in silico  10  10  10  10  10  1  10  10  2  10  DD2 in silico  10  10  10  6  6  1  6  10  6  6  DD3 in silico  8  7  10  6  6  1  6  10  6  6  Performance score  112/120  111/120  118/120  86/120  82/120  25/120  86/120  108/120  56/120  66/120  Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMonitor  IMSEQ  TCRklass  Vidjil  Decombinator  Trig  DA1 in silico  10  10  10  10  10  10  10  10  10  –  DA2 in silico  9  10  8  5  3  3  5  7  6  –  DA3 in silico  8  9  10  5  3  3  5  7  6  –  DB1 in silico  10  10  10  10  10  1  10  10  2  10  DB2 in silico  10  10  10  6  6  1  6  7  2  6  DB3 in silico  9  8  10  6  6  1  6  7  6  6  DC1 in silico  10  10  10  10  10  1  10  10  2  10  DC2 in silico  10  10  10  6  6  1  6  10  2  6  DC3 in silico  8  7  10  6  6  1  6  10  6  6  DD1 in silico  10  10  10  10  10  1  10  10  2  10  DD2 in silico  10  10  10  6  6  1  6  10  6  6  DD3 in silico  8  7  10  6  6  1  6  10  6  6  Performance score  112/120  111/120  118/120  86/120  82/120  25/120  86/120  108/120  56/120  66/120  Note: RTCR, MiXCR, Vidjil and MiTCR obtained highest score. Briefly, RACE PCR was performed on 250 ng RNA through first-strand complementary DNA (cDNA) synthesis using the SMARTScribe reverse transcriptase (Clontech) by RNA incubation with the primer annealing to TCRbeta constant region at 72 °C for 4 min and at 42 °C for 2 min. The template-switching oligo contained both a 12 bp UMI and a 12 bp sample barcode, and template switch was carried out at 42 °C for 60 min followed by incubation with uracil-DNA glycosylase (5 U/reaction, New England Biolabs) at 37 °C for 40 min. Two nested 18 cycles of exponential PCRs were subsequently performed using 5 μl of the single-stranded cDNA or 2 μl of the first exponential PCR, respectively. In the second exponential PCR, barcoded primers were added to enable library preparation. Libraries were purified with AMPure beads and pooled for 400+100 bp paired-end sequencing in the MiSeq platform. For qPCR, cDNA was synthesized from 250 ng RNA using the ProtoScript® II First Strand cDNA Synthesis Kit (NEB) and the NEBNext® Second Strand Synthesis Buffer and Enzyme Mix (NEB) according to manufacturer’s instructions. Real-time amplification was performed with 50 ng cDNA as input in a LightCycler ® 480 Instrument (Roche) using the LightCycler ® 480 SYBR Green I Master. Relative quantification of CCRF-CEM and Jurkat CDR3 and TRBC has been performed using glyceraldehyde 3-phosphate dehydrogenase as housekeeping gene. Primer sequences are listed in Supplementary Table S2. Scoring system The approach in the ranking of 10 tools is based on a simple scoring model composed of two classes of features. One class contains general features and includes installation, usage, output, BCR analysis, paired/single end, quality, error correction and the reference genome flexibility. The second class scores the tools’ performances on our 12 in silico data sets. We have not prioritized any particular feature. For each characteristic, a numeric score s is assigned to the ten software by the performance or presence/absence of the feature. The score s is a numeric value comprised from 0 to 10, where 10 is the best and 0 is the worst value. If n tools have the same performance, the score assigned is the s, and the next score assigned is then s-n. A method that misses a feature or does not return a result is marked with 0. The resulting scores are then added up. The tools are arranged from the highest to the lowest value. Clonal plane A set of R functions were developed to compute and visualize the clonal values according to the Renyi entropies definitions. The code, sufficient to generate the eight clonal planes in this manuscript, and the four in silico data sets are provided as a Supplementary Material. We used an RStudio playbook as the container for the code along with descriptions of its usage (https://github.com/G100DKFZ/ManuscriptNote). Results and discussion General features We have compared the general features for the ten selected tools, which mainly include the installation, usage, customization, reference genome flexibility and generated output to provide a brief overview (Table 2). The installation of MiTCR, MiXCR, LymAnalyzer and Vidjil requires Java. RTCR installation depends on a few Python packages and Bowtie2. IMonitor depends on BLAST (blastall routine) for alignment; however, it is provided by authors within the Linux package. For IMonitor, indexed nucleotide reference sequence files are needed, which requires alignment to find the start position of CDR3 region in V and end of CDR3 in J reference followed by indexing. TCRklass installation also depends on some external programs and libraries, and its use requires index reference database for nucleotide and amino acid V and J reference sequences. An alignment is needed for amino acid references to locate the position of the conserved residue ‘C’ in V and ‘F’ or ‘W’ in J followed by indexing. Installation of Decombinator depends on certain external dependencies, and individual modules need to be executed for stepwise analysis. Trig pipeline depends on an external alignment tool MUMMER. Table 2. General features of 10 tools under investigation are represented Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMSEQ  TCRklass  IMonitor  Vidjil  Decombinator  Trig  Version  1.0.3  2.1.2  1.0  1.2.1  1.1.0  0.7.2  2.0  0.03  3.1 (Unpublished)  1.0  Year published  2013  2015  2016  2015  2015  2015  2015  2016  2017  2016  Operating System  Linux/Windows  Linux/Windows  Linux  Linux/Windows  Linux  Linux  Linux  Linux/Web  Linux  Linux  Linux installation and external tools  Java  Java  Bowtie 2, Python packages  Java  C ++, Zlib  HTQC package, CMake  Perl, R, BLAST (included local version)  C ++, Java/Python  Acora, Biopython, Regex, Python-Levenshtein  MUMMER  Ease of usage and customizability  10  10  6  5  8  5  5  8  1  5  Wide range of adjustable features  Wide range of adjustable features  Certain adjustable features  Few adjustable features  Medium range of adjustable features  Preprocess reference, medium range of adjustable features  Preprocess reference, medium range of adjustable features  Medium range of adjustable features  Individual scripts, few adjustable features  Few adjustable features  aOutput  7  10  9  6  6  6  6  9  2  2  Paired-end  No  10 (Yes)  No  No  10 (Yes)  10 (Yes)  10 (Yes)  No  No  No  SE FASTA file only  PE only for UMI data  SE FASTA file only  Quality  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  No  Only for PE  SE FASTA file only  Error correction  10 (Yes)  10 (Yes)  10 (Yes)  3 (Partial)  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  –  UMI allowed  No  10 (Yes, with MiGEC)  10 (Yes)  No  7 (Considers number of characters, not structure)  No  No  No  10 (Yes)  No  (Specific for a protocol in unpublished version)  TCR gamma/delta  Not mentioned  10 (Yes)  10 (Yes)  Not mentioned  Not mentioned  Not mentioned  Not mentioned  10 (Yes)  10 (Yes)  10 (Yes)  BCR  No  10 (Yes)  No  10 (Yes)  10 (Yes)  10 (Yes)Stated that can be used  10 (Yes)  10 (Yes)  No  10 (Yes)  Reference genome flexibility  No  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  –  –  Human/mouse  Human/mouse  Human/mouse  Human, Mouse, etc.  Human or mouse  Human or mouse  Additional features  NA  NA  Basic statistics  SNP, lineage mutation tree  NA  VJ profiling and statistics  Statistics, profiles and plots  Unexpected recombination  NA  Nonregular TCR detection  License  GNU GPL V3  Restricted or commercial  GNU GPLV3.0  Open source  GPL V2 open source  GNU GPL V3.0  Open source  GNU GPLv3 (For research use only)  Open source  GNU GPLv3  Feature score  37/90  90/90  65/90  34/90  71/90  61/90  61/90  57/90  43/90  27/90  Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMSEQ  TCRklass  IMonitor  Vidjil  Decombinator  Trig  Version  1.0.3  2.1.2  1.0  1.2.1  1.1.0  0.7.2  2.0  0.03  3.1 (Unpublished)  1.0  Year published  2013  2015  2016  2015  2015  2015  2015  2016  2017  2016  Operating System  Linux/Windows  Linux/Windows  Linux  Linux/Windows  Linux  Linux  Linux  Linux/Web  Linux  Linux  Linux installation and external tools  Java  Java  Bowtie 2, Python packages  Java  C ++, Zlib  HTQC package, CMake  Perl, R, BLAST (included local version)  C ++, Java/Python  Acora, Biopython, Regex, Python-Levenshtein  MUMMER  Ease of usage and customizability  10  10  6  5  8  5  5  8  1  5  Wide range of adjustable features  Wide range of adjustable features  Certain adjustable features  Few adjustable features  Medium range of adjustable features  Preprocess reference, medium range of adjustable features  Preprocess reference, medium range of adjustable features  Medium range of adjustable features  Individual scripts, few adjustable features  Few adjustable features  aOutput  7  10  9  6  6  6  6  9  2  2  Paired-end  No  10 (Yes)  No  No  10 (Yes)  10 (Yes)  10 (Yes)  No  No  No  SE FASTA file only  PE only for UMI data  SE FASTA file only  Quality  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  No  Only for PE  SE FASTA file only  Error correction  10 (Yes)  10 (Yes)  10 (Yes)  3 (Partial)  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  –  UMI allowed  No  10 (Yes, with MiGEC)  10 (Yes)  No  7 (Considers number of characters, not structure)  No  No  No  10 (Yes)  No  (Specific for a protocol in unpublished version)  TCR gamma/delta  Not mentioned  10 (Yes)  10 (Yes)  Not mentioned  Not mentioned  Not mentioned  Not mentioned  10 (Yes)  10 (Yes)  10 (Yes)  BCR  No  10 (Yes)  No  10 (Yes)  10 (Yes)  10 (Yes)Stated that can be used  10 (Yes)  10 (Yes)  No  10 (Yes)  Reference genome flexibility  No  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  –  –  Human/mouse  Human/mouse  Human/mouse  Human, Mouse, etc.  Human or mouse  Human or mouse  Additional features  NA  NA  Basic statistics  SNP, lineage mutation tree  NA  VJ profiling and statistics  Statistics, profiles and plots  Unexpected recombination  NA  Nonregular TCR detection  License  GNU GPL V3  Restricted or commercial  GNU GPLV3.0  Open source  GPL V2 open source  GNU GPL V3.0  Open source  GNU GPLv3 (For research use only)  Open source  GNU GPLv3  Feature score  37/90  90/90  65/90  34/90  71/90  61/90  61/90  57/90  43/90  27/90  Note: These tools are scored in a subjective manner per feature. a Output is scored based on how detailed information is available per detected clones. – Information not clear/found, SE; Single-end, PE: Paired-end. Table 2. General features of 10 tools under investigation are represented Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMSEQ  TCRklass  IMonitor  Vidjil  Decombinator  Trig  Version  1.0.3  2.1.2  1.0  1.2.1  1.1.0  0.7.2  2.0  0.03  3.1 (Unpublished)  1.0  Year published  2013  2015  2016  2015  2015  2015  2015  2016  2017  2016  Operating System  Linux/Windows  Linux/Windows  Linux  Linux/Windows  Linux  Linux  Linux  Linux/Web  Linux  Linux  Linux installation and external tools  Java  Java  Bowtie 2, Python packages  Java  C ++, Zlib  HTQC package, CMake  Perl, R, BLAST (included local version)  C ++, Java/Python  Acora, Biopython, Regex, Python-Levenshtein  MUMMER  Ease of usage and customizability  10  10  6  5  8  5  5  8  1  5  Wide range of adjustable features  Wide range of adjustable features  Certain adjustable features  Few adjustable features  Medium range of adjustable features  Preprocess reference, medium range of adjustable features  Preprocess reference, medium range of adjustable features  Medium range of adjustable features  Individual scripts, few adjustable features  Few adjustable features  aOutput  7  10  9  6  6  6  6  9  2  2  Paired-end  No  10 (Yes)  No  No  10 (Yes)  10 (Yes)  10 (Yes)  No  No  No  SE FASTA file only  PE only for UMI data  SE FASTA file only  Quality  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  No  Only for PE  SE FASTA file only  Error correction  10 (Yes)  10 (Yes)  10 (Yes)  3 (Partial)  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  –  UMI allowed  No  10 (Yes, with MiGEC)  10 (Yes)  No  7 (Considers number of characters, not structure)  No  No  No  10 (Yes)  No  (Specific for a protocol in unpublished version)  TCR gamma/delta  Not mentioned  10 (Yes)  10 (Yes)  Not mentioned  Not mentioned  Not mentioned  Not mentioned  10 (Yes)  10 (Yes)  10 (Yes)  BCR  No  10 (Yes)  No  10 (Yes)  10 (Yes)  10 (Yes)Stated that can be used  10 (Yes)  10 (Yes)  No  10 (Yes)  Reference genome flexibility  No  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  –  –  Human/mouse  Human/mouse  Human/mouse  Human, Mouse, etc.  Human or mouse  Human or mouse  Additional features  NA  NA  Basic statistics  SNP, lineage mutation tree  NA  VJ profiling and statistics  Statistics, profiles and plots  Unexpected recombination  NA  Nonregular TCR detection  License  GNU GPL V3  Restricted or commercial  GNU GPLV3.0  Open source  GPL V2 open source  GNU GPL V3.0  Open source  GNU GPLv3 (For research use only)  Open source  GNU GPLv3  Feature score  37/90  90/90  65/90  34/90  71/90  61/90  61/90  57/90  43/90  27/90  Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMSEQ  TCRklass  IMonitor  Vidjil  Decombinator  Trig  Version  1.0.3  2.1.2  1.0  1.2.1  1.1.0  0.7.2  2.0  0.03  3.1 (Unpublished)  1.0  Year published  2013  2015  2016  2015  2015  2015  2015  2016  2017  2016  Operating System  Linux/Windows  Linux/Windows  Linux  Linux/Windows  Linux  Linux  Linux  Linux/Web  Linux  Linux  Linux installation and external tools  Java  Java  Bowtie 2, Python packages  Java  C ++, Zlib  HTQC package, CMake  Perl, R, BLAST (included local version)  C ++, Java/Python  Acora, Biopython, Regex, Python-Levenshtein  MUMMER  Ease of usage and customizability  10  10  6  5  8  5  5  8  1  5  Wide range of adjustable features  Wide range of adjustable features  Certain adjustable features  Few adjustable features  Medium range of adjustable features  Preprocess reference, medium range of adjustable features  Preprocess reference, medium range of adjustable features  Medium range of adjustable features  Individual scripts, few adjustable features  Few adjustable features  aOutput  7  10  9  6  6  6  6  9  2  2  Paired-end  No  10 (Yes)  No  No  10 (Yes)  10 (Yes)  10 (Yes)  No  No  No  SE FASTA file only  PE only for UMI data  SE FASTA file only  Quality  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  No  Only for PE  SE FASTA file only  Error correction  10 (Yes)  10 (Yes)  10 (Yes)  3 (Partial)  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  –  UMI allowed  No  10 (Yes, with MiGEC)  10 (Yes)  No  7 (Considers number of characters, not structure)  No  No  No  10 (Yes)  No  (Specific for a protocol in unpublished version)  TCR gamma/delta  Not mentioned  10 (Yes)  10 (Yes)  Not mentioned  Not mentioned  Not mentioned  Not mentioned  10 (Yes)  10 (Yes)  10 (Yes)  BCR  No  10 (Yes)  No  10 (Yes)  10 (Yes)  10 (Yes)Stated that can be used  10 (Yes)  10 (Yes)  No  10 (Yes)  Reference genome flexibility  No  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  –  –  Human/mouse  Human/mouse  Human/mouse  Human, Mouse, etc.  Human or mouse  Human or mouse  Additional features  NA  NA  Basic statistics  SNP, lineage mutation tree  NA  VJ profiling and statistics  Statistics, profiles and plots  Unexpected recombination  NA  Nonregular TCR detection  License  GNU GPL V3  Restricted or commercial  GNU GPLV3.0  Open source  GPL V2 open source  GNU GPL V3.0  Open source  GNU GPLv3 (For research use only)  Open source  GNU GPLv3  Feature score  37/90  90/90  65/90  34/90  71/90  61/90  61/90  57/90  43/90  27/90  Note: These tools are scored in a subjective manner per feature. a Output is scored based on how detailed information is available per detected clones. – Information not clear/found, SE; Single-end, PE: Paired-end. In TCRklass and IMonitor, several independent programs are available to perform either partial or complete analysis. In addition to CDR3 containing sequences, their output also provides detailed V and J profile for CDR3 negative sequences. TCRklass and MiXCR output complete unclustered files that contain read IDs information. MiXCR and RTCR also generate intermediate files. IMonitor outputs clustered clonotypes files only based on CDR3 region; unlike other tools, it does not generate a clustered file containing VJ gene segments and CDR3 information. Vidjil generates clustered CDR3 output files with read IDs information. Trig does not produce clustered result files for clonotypes, and it also does not output information about CDR3 amino acid sequences. The final output of MiXCR followed by Vidjil provides the most detailed information about each clone, as compared with all other tools. Clonal composition analysis The analysis of 12 in silico samples with different clonal composition and error rates was performed with the tools evaluated in this study. To analyze the sensitivity and specificity of each algorithm, the varying clonal composition of two clones in each sample was estimated. The DA data set only contains Clone 1 (100%), and we assessed it at different error rates (DA1: 0, DA2: 0.001 and DA3: 0.01 error rates). Trig did not execute successfully on DA data sets. With a null error rate, Clone 1 percentage was homogeneously detected, but performance of most tools degraded by almost 50% in its detection on increasing error rates (Figure 1A). Nonetheless, MiXCR and RTCR were able to detect the expected clone frequency, despite the error rates assayed in DA2 and DA3. In the case of MiTCR, its performance was homogeneous for DA1 and DA2 and only decreased when analyzing the DA3 data set. Besides, differences in the number of reads detected for Clone 1 were analyzed and a similar trend was observed (Supplementary Figure S2a). RTCR and MiXCR showed more consistency in all error rates and assigned the highest number of reads to Clone 1. Where possible, we also determined the overall percentage of used reads before and after removal of out-of-frame and stop codon reads, and the data correlated with the trend observed in the analysis of read numbers detected for Clone 1 (Supplementary Figure S3a). Figure 1. View largeDownload slide Clones 1 and 2 percentages detected in samples (A) DA with 100% Clone 1, (B) DB with 10% Clone 1 and 2% Clone 2, (C) DC with 1% Clone 1 and 2% Clone 2 and (D) DD with 0.1% Clone 1 and 2% Clone 2. Each plot represents clonal percentage detection at different error rates (0, 0.001 and 0.01). Overall, RTCR followed by MiXCR, MiTCR and Vidjil showed better performance. Black and gray bars represent the expected percentages of Clones 1 and 2, respectively. Figure 1. View largeDownload slide Clones 1 and 2 percentages detected in samples (A) DA with 100% Clone 1, (B) DB with 10% Clone 1 and 2% Clone 2, (C) DC with 1% Clone 1 and 2% Clone 2 and (D) DD with 0.1% Clone 1 and 2% Clone 2. Each plot represents clonal percentage detection at different error rates (0, 0.001 and 0.01). Overall, RTCR followed by MiXCR, MiTCR and Vidjil showed better performance. Black and gray bars represent the expected percentages of Clones 1 and 2, respectively. DB consists of 10% Clone 1 and 2% Clone 2 sequences. The analysis of Clone 1 percentage in DB1 showed a general good performance for all tools. However, IMSEQ yielded values lower (5%) than expected. In case of Clone 2 within DB data sets and in line with DA results, IMSEQ reported 0.99% and Decombinator detected 1.6%. In case of DB2, MiTCR, MiXCR, RTCR and Vidjil exhibited good performance for Clones 1 and 2. A similar behavior was shown by RTCR in DB3 for Clones 1 and 2. About half of the expected percentages were returned by LymAnalyzer, TCRklass, IMonitor and Decombinator in DB3 (Figure 1B). IMSEQ showed analogous deviations in DB2 and DB3 as for DB1. In assigning number of reads for Clones 1 and 2, a similar behavior was observed for all DB samples as was seen in DA for Clone 1 (Supplementary Figure S2b). The percentage of total detected reads is presented in Supplementary Figure S3b. It is worthwhile to mention that we have analyzed a randomly selected subset of 10 000 reads from all DB samples by IMGT/High V-Quest and postprocessed with Change-O tool [49]. The percentage of Clone 1 in DB1, DB2 and DB3 samples was 9.98, 9.86 and 8.54, respectively, whereas the percentage of Clone 2 was 2.3, 2.2 and 2.0, respectively. We have not included any further analysis with this tool and also with IgBlast because of poor time efficiency and additional postprocessing. In case of DC and DD, comprising 1 and 0.1% Clone 1, respectively, and 2% Clone 2, the trend observed was similar to the one retrieved from DB data sets (Figure 1C and D). In the detection of number of reads for Clones 1 and 2 and the overall read numbers, a similar behavior as DB was observed with all DC and DD data sets (Supplementary Figures S2c, S3c, S2d and S3d, respectively). Overall in all 12 samples, RTCR showed better performance followed by MiXCR and MiTCR in the detection of Clones 1 and 2 percentages (Table 4). In general, RTCR maintained a more consistent behavior across all error rates followed by MiXCR, MiTCR and Vidjil. LymAnalyzer, IMonitor, Trig and TCRklass exhibited a highly similar response in detecting the clones across all data sets. Decombinator slightly underestimated the clonal percentages in some samples, whereas IMSEQ showed a large underestimation. Additionally, to understand the effect of altering a particular TCR sequence in clonal percentage detection, we have also generated and analyzed with MiXCR, RTCR and Decombinator an in silico data set DB2.X (similar to DB2) by using two different TCR sequences as Clones 1 and 2. However, no difference in the clonal composition detection was observed (Supplementary Figure S4). Clonal plane analysis We conducted the clonal plane analysis for in silico data sets DA2, DB2, DC2 and DD2. In the clonal plane representation, we used a nucleotide sequence containing CDR3-clustered file (default output) from tools, after removing out-of-frame and stop codon sequences. This procedure was not possible in the case of (1) IMonitor, as nucleotide CDR3 sequence frequency file does not include VJ information, so only this sequence file was used and (2) Trig, as no clustered file is generated and sequences were grouped manually based on nucleotide sequence. Therefore, in both of these tools, the selection of in-frame reads was not possible and could explain the severe deviation in the detected GR. In the case of Vidjil, the complete read sequence in the returned clustered file was used. In DA2, expected GR and GE values are zero, as only one clone is present in the data set. Accordingly, all tools showed close to zero GE. However, MiXCR (2) performance in calculating GR was superior among all tools. Vidjil reported a GR of about 5 and all other tools between 7 and 8 (Figure 2A). This first data set, on the clonal plane shows visually the capacity of each tool in reducing the excess of detected clones because of the noise injected into the reads. In DB2, DC2 and DD2, the expected GR was 8.39, 8.49 and 8.5 and the GE was 2.3, 3.91 and 3.91, respectively (Figure 2B–D). All tools, except IMSEQ (DB2, DC2 and DD2) and, marginally, Decombinator (DC2 and DD2), exhibited a profoundly similar trend in detecting the samples GE. This response correlates with the findings in the previous section (Figure 1), and indicates that the correction strategies for discarding malformed CDR3 sequences worked well in most tools. On the contrary, the detected GR is highly variable among groups of tools. RTCR, Vidjil, MiTCR and MiXCR are clustered around a value of GR that is smaller than the maximal expected value. This response indicates that these four tools are efficiently similarly removing false CDR3 regions that derive from mutated versions of clones, while all other tools are overestimating the number of unique CDR3 regions. Especially Trig, IMonitor and LymAnalyzer are finding a large excess of clones. Figure 2. View largeDownload slide Clonal plane depicting performance of tools in richness and evenness estimation on in silico data sets with 0.001 error rate and varying clonality (A) DA2 (100% Clone 1), (B) DB2 (10% Clone 1 and 2% Clone 2), (C) DC2 (1% Clone 1 and 2% Clone 2) and (D) DD2 (0.1% Clone 1 and 2% Clone 2). Black circles represent the expected value of GE and maximal expected value of GR, which is computed from the total number of TCR sequences produced by IgSimulator and also takes into account the nonfunctional reads. Figure 2. View largeDownload slide Clonal plane depicting performance of tools in richness and evenness estimation on in silico data sets with 0.001 error rate and varying clonality (A) DA2 (100% Clone 1), (B) DB2 (10% Clone 1 and 2% Clone 2), (C) DC2 (1% Clone 1 and 2% Clone 2) and (D) DD2 (0.1% Clone 1 and 2% Clone 2). Black circles represent the expected value of GE and maximal expected value of GR, which is computed from the total number of TCR sequences produced by IgSimulator and also takes into account the nonfunctional reads. Fragment length analysis The fragment analysis was performed to enhance the understanding of variations in the tool’s performance with different fragment lengths. Here, DB1 data set with 0 error rate was the source for four data sets with lengths of 250, 150, 100 and 75 bp each. The read numbers detected for Clones 1 and 2 were homogeneous for most tools with the read lengths of 250 and 150 bp (Figure 3). IMSEQ did not yield results when the reads are 150 and 100 bp. RTCR, Decombinator and TCRklass failed to detect Clones 1 and 2 within DB 100 bp reads and, for the 75 bp reads length data set, no tools were able to return any clone. Remarkably, all the programs showed a sharp decrease in the number of reads detected for each clone at shorter read lengths. For example, the 150 bp data set yielded ∼50% lower read number as compared with the data set with 250 bp long reads. A similar trend was observed in the percentage of the total used reads (Supplementary Figure S5). Figure 3. View largeDownload slide Number of reads detected for Clones 1 and 2 in sample DB1 (10% Clone 1 and 2% Clone 2 with 0 error rate) with read lengths of 250, 150, 100 and 75 bp. MiTCR, MiXCR, LymAnalyzer, IMonitor and Trig showed better performance allowing the detection of Clones 1 and 2 in three of the aforementioned read lengths. Figure 3. View largeDownload slide Number of reads detected for Clones 1 and 2 in sample DB1 (10% Clone 1 and 2% Clone 2 with 0 error rate) with read lengths of 250, 150, 100 and 75 bp. MiTCR, MiXCR, LymAnalyzer, IMonitor and Trig showed better performance allowing the detection of Clones 1 and 2 in three of the aforementioned read lengths. Analysis of experimental samples To estimate the accuracy of each method within biological samples, we generated four different samples with known clonalities by spiking-in Jurkat and CCRF-CEM clones in polyclonal T cells (Table 3). However, a disadvantage in analyzing T- and B-cell mRNA instead of DNA is that the expression levels of the receptor transcripts can diverge from 1:1 proportionality ratio depending on the cell type or its activation status. Therefore, we additionally performed qPCR (Supplementary Table S3) to independently estimate the proportion biases of the clonal transcripts. TCR beta chains were analyzed by RACE PCR using UMI to enable UMI-based studies using RTCR and MiGEC-MiXCR tools along with non-UMI analysis (Figure 4). We could not apply UMI Decombinator scripts on our samples, as the unpublished version of Decombinator is designed for a particular UMI protocol that was not compatible with the standard UMI structure of our reads, despite our attempt to test different UMI structures. Trig produced shallow sequence detection for CDR3 regions in all experimental samples; however, in the case of in silico data sets, it performed well. In Samples 1 and 2 (Figure 4A and B, respectively), IMSEQ and Trig showed the highest deviations from the expected output. It is worthwhile to mention that IMSEQ also allows barcode-based analysis, where the length of the barcode string in the prefix of the read is required. However, the analysis of our experimental samples by using these parameters showed a strong underestimation of clonal percentages, and the corresponding UMI analysis is not discussed. In Sample 3, IMSEQ maintained the similar behavior (Figure 4C). In the case of Sample 4, Trig and MiGEC-MiXCR with UMI analysis did not detect Clone 1 (Figure 4D). Besides, the number of reads detected for each clone is represented in Supplementary Figure S6. Figure 4. View largeDownload slide Percentage of Jurkat and CCRF-CEM clones detected by the different methods in four experimental samples with controlled clonalities. Estimated by qPCR (A) Sample 1 with Clone 1: 17.80% and Clone 2: 1.32%, (B) Sample 2 with Clone 1: 1.54% and Clone 2: 1.57%, (C) Sample 3 with Clone 1: 0.17% and Clone 2: 1.96% and (D) Sample 4 Clone 1: 0.03% and Clone 2: 2.26%. Black and gray bars represent the expected percentages of Clones 1 and 2, respectively, estimated by qPCR relative quantification. Figure 4. View largeDownload slide Percentage of Jurkat and CCRF-CEM clones detected by the different methods in four experimental samples with controlled clonalities. Estimated by qPCR (A) Sample 1 with Clone 1: 17.80% and Clone 2: 1.32%, (B) Sample 2 with Clone 1: 1.54% and Clone 2: 1.57%, (C) Sample 3 with Clone 1: 0.17% and Clone 2: 1.96% and (D) Sample 4 Clone 1: 0.03% and Clone 2: 2.26%. Black and gray bars represent the expected percentages of Clones 1 and 2, respectively, estimated by qPCR relative quantification. The clonal status of all analyzed samples was projected onto the clonal plane to compare each tools' performance in recognizing the sample clonality. In general, we expected a definite set of values in the clonality components, richness and evenness, from Sample 1 to Sample 4. The GR of our samples is driven by the T-cell amount and is not known a priori. As the amount of the dominant clones was evaluated in all samples via qPCR, we could determine the expected GE value (Renyi Entropy with α = +∞; see Supplementary Section S2 and Supplementary Table S3). It is possible to group the tools response by the position of the sample points on the clonal plane (Figure 5). In all samples, RTCR, MiXCR, MiTCR, Decombinator and IMonitor (Set 1) exhibit a remarkably similar clonality reconstruction. TCRklass and LymAnalyzer (Set 2) group together and merge with Set 1 in Samples 2, 3 and 4. The two methods using UMI, MiGEC-MiXCR-UMI and RTCR-UMI (Set 3) reduce the detected GR and increase the GE. This modification of the returned clonality in respect with the no UMI version (MiXCR and RTCR) is larger for MiGEC-MiXCR-UMI than for RTCR-UMI showing that the two tools are using this additional information differently. Vidjil returns a GR that is in line with the two previous tools without using the UMI but seems to underestimate the GE in all the samples. Trig and IMSEQ are the two methods showing a manifest deviation both in GR and GE. Trig decays in all the samples, whereas IMSEQ returns low values of GE in Samples 1 and 2. In Sample 3, IMSEQ showed an apparently reasonable estimation of GE, but it is because of an aberrant detection of the Jurkat clone instead of the CCRF-CEM clone (Figure 4C). Figure 5. View largeDownload slide RACE-PCR experimental samples are presented on clonal planes. Estimated by qPCR (A) Sample 1 with Clone 1: 17.80% and Clone 2: 1.32%, (B) Sample 2 with Clone 1: 1.54% and Clone 2: 1.57%, (C) Sample 3 with Clone 1: 0.17% and Clone 2: 1.96% and (D) Sample 4 Clone 1: 0.03% and Clone 2: 2.26%. In general, UMI-based analysis of RTCR and MiGEC-MiXCR along with Vidjil exhibited a decreased GR and other tools, except IMSEQ and Trig, shared similar trend with a higher GR. Horizontal gray line represents the expected GE obtained from the qPCR data. The vertical distance from this line indicates the deviation of the detected GE from the expected value. Figure 5. View largeDownload slide RACE-PCR experimental samples are presented on clonal planes. Estimated by qPCR (A) Sample 1 with Clone 1: 17.80% and Clone 2: 1.32%, (B) Sample 2 with Clone 1: 1.54% and Clone 2: 1.57%, (C) Sample 3 with Clone 1: 0.17% and Clone 2: 1.96% and (D) Sample 4 Clone 1: 0.03% and Clone 2: 2.26%. In general, UMI-based analysis of RTCR and MiGEC-MiXCR along with Vidjil exhibited a decreased GR and other tools, except IMSEQ and Trig, shared similar trend with a higher GR. Horizontal gray line represents the expected GE obtained from the qPCR data. The vertical distance from this line indicates the deviation of the detected GE from the expected value. Overall, Set 3 is capturing the sample richness and evenness well. Vidjil underestimates the GE in all the samples but is in line with Set 3 in detecting the GR. Sets 1 and 2 overestimate the clonal abundance but reasonably reconstruct the samples evenness. IMSEQ and Trig deviate widely from the other tools and from the expected GE. However, as previously mentioned Trig overall detected an insufficient number of CDR3 sequences that probably can result in this difference. Runtime efficiency To estimate the runtime performance of tools on varying data sets, we have compared the time consumption by each tool to analyze the in silico samples. Each sample comprised 1.0 M reads depicting an increase in clonality, i.e. DA1, DB1 and DC1 (Figure 6). Figure 6. View largeDownload slide Comparison of time consumption by tools in analyzing in silico data sets with different clonalities DA1, DB1 and DC1 comprising each 1.0 M reads. MiTCR is the most time efficient, and Vidjil is computationally most expensive among all tools, followed by TCRklass, LymAnalyzer and IMonitor. The red bar in Vidjil represents the additional time taken to generate all the translated amino acid CDR3 sequences, and by default, it generates only first 100 sequences. Figure 6. View largeDownload slide Comparison of time consumption by tools in analyzing in silico data sets with different clonalities DA1, DB1 and DC1 comprising each 1.0 M reads. MiTCR is the most time efficient, and Vidjil is computationally most expensive among all tools, followed by TCRklass, LymAnalyzer and IMonitor. The red bar in Vidjil represents the additional time taken to generate all the translated amino acid CDR3 sequences, and by default, it generates only first 100 sequences. MiTCR showed the smallest execution time followed by IMSEQ, Decombinator and MiXCR. Overall LymAnalyzer, IMonitor, RTCR and TCRklass time consumption was comparable, although TCRklass and IMonitor were slightly more computationally expensive. In the case of Vidjil, time efficiency was measured in two situations, once with CDR3 generation for the first 100 clones (recommended by the tool’s authors) and later for the CDR3 generation for all detected clones that required large processing time. Interestingly, MiTCR, MiXCR, IMonitor, Vidjil, Trig and IMSEQ analysis times slightly correlate with sample clonality. LymAnalyzer, Decombinator and RTCR depicted a more linear trend between the read number and the time consumed. Certain tools including MiTCR, MiXCR, RTCR, Trig, IMonitor and IMSEQ also allow partial parallel processing options. Time consideration is a key factor in selecting the analysis tool when a large set of samples has to be analyzed. Conclusion We have carried out a comparative study of 10 tools for TCR-sequencing data analysis by using various data sets to guide users in the selection of an appropriate analysis method. Our study demonstrates that, in general, the analysis results are profoundly affected by the choice of the method, and no single tool maintains the best efficiency in all cases. In general, the in silico data analysis revealed that RTCR, MiXCR and MiTCR seem a solid choice for analysis followed by LymAnalyzer, TCRklass, IMonitor, Vidjil and Trig. These tools showed a better accuracy on the in silico data sets for clonal composition and clonal plane analysis. The first three tools seem to be less affected in assigning different nucleotide strings, arising from the same sequence because of random mutations, to the same CDR3 peptide. However, MiTCR can be used only with built-in reference genome database of human and mouse sequences, constraining the researchers to a frozen version of reference genome for these two species. RTCR was more consistent in all error states, followed by MiXCR, MiTCR and Vidjil. LymAnalyzer and Trig lack a well-defined quality control description. The clonal plane analysis and the determination of specific clonal percentages in experimental samples revealed that the Trig and IMSEQ tools exhibit a significant deviation from the other programs output and from the expected values. However, it is equally important to mention that we have performed these comparative studies with default settings, wherever feasible. Therefore, it is possible that the optimization of available configurable parameters in each tool can potentially enhance its performance. Interestingly, LymAnalyzer, Vidjil and TCRklass tools reported the same read/CDR3 sequence as different clonotypes with respective frequency values. In Vidjil and LymAnalyzer, the same CDR3 sequence describes as different clonotype also had same V and J segments, whereas in case of TCRklass, two clones with same CDR3 sequence were regarded as different clonotypes only when they mapped to different V and J segments. This analysis behavior was more pronounced in case of monoclonal and oligoclonal samples, whereas it was not observed in highly polyclonal samples. Our study reveals that all tools under investigation are suitable for 250 and 150 bp (except IMSEQ) read lengths. However, RTCR, TCRklass, IMSEQ and Decombinator did not successfully execute for 100 bp reads. In general, a decrease in the read length from 250 to 150 bp results in a concomitant sharp decrease in number of detected reads by each method. This observation affects all the tools in this study and is essential during the preparation of the samples. Particular care should be taken in increasing the average length of the TCR fragments, in the sequenced reads, >200–250 bp. This precaution dramatically improves the fraction of reads that are used by the tools to reconstruct the TCR clonal repertoire. The output of each tool also varies in terms of information generated per detected sequence and additional supporting or intermediate files produced. TCRklass provides useful statistics and VJ profiling options like IMonitor. However, MiXCR tool followed by RTCR and Vidjil provides a more detailed output for detected clonotypes. IMonitor does not output the combined clustered result file with CDR3 and annotated VJ gene segments, which is a part of the standard output of all other tools. Trig does not generate clustered file and CDR3 amino acid sequences; Decombinator output is less comprehensive, as it does not provide clear information about VJ segments; however, it is an unpublished version and still under development. All these tools also vary greatly in terms of the accompanied documentation, available user configurable parameters and input data file formats, which are also crucial factors in tools selection. MiXCR provides the most detailed documentation and control over analysis by providing users a number of adjustable parameters, whereas in contrast, LymAnalyzer and Trig provide the least options for user customizability. MiXCR (along with MiGEC) and RTCR provide the added advantage of UMI-based analysis, which can further enhance the accuracy of analysis. In general, by taking into account all the different factors and results of our comparison, MiXCR and RTCR tools prove to be the most balanced in term of flexibility, usability and accuracy. Key Points Several tools have been recently developed for large-scale TCR analysis. However, no clear comparison is available to assist in choosing an optimal analysis approach. Tool selection has a huge impact on analysis results. The computational TCR analysis field still has space for improvements, as among currently available state-of-the-art tools, no single tool showed the best and consistent performance in all kinds of analyses. However, new versions of tools are constantly being released, thus showing that the field is under active development. Overall our comparison shows that MiXCR and RTCR are the optimal choice among all available methods for TCR analysis followed by MiTCR, Vidjil, TCRklass and LymAnalyzer. Advancement in the experimental protocols by using UMI has the potential to further improve the accuracy of TCR repertoire analysis, specifically in samples with enhanced complexity. Supplementary Data Supplementary data are available online at http://bib.oxfordjournals.org/. Funding This work was supported by the NCT 3.0 funding programs NCT3.0_2015.13 ImmunOmics and NCT3.0_2015.2 SPL/RP. Saira Afzal is a postdoctoral researcher at German Cancer Research Center. She is doing research in area of bioinformatics with focus on computational tools development and large-scale data analysis. Irene Gil-Farina is a postdoctoral researcher at German Cancer Research Center. Her research is mainly focused on gene therapy and immunology. Richard Gabriel is a postdoctoral researcher at German Cancer Research Center. He is doing research mainly in gene therapy. Shahzad Ahmad was a Master’s student at German Cancer Research Center. His research interests include bioinformatics tool development and omics data analysis. Christof von Kalle is one of the directors of National Center of Tumor Diseases and is leading the research division of Translational Oncology at German Cancer Research Center. Manfred Schmidt is a group leader at German Cancer Research Center. He has research expertise in wide range of areas including molecular and gene therapy. Raffaele Fronza is a postdoctoral researcher at German Cancer Research Center. His major research interests include algorithm development, computational analysis and system biology. Acknowledgement We would like to acknowledge that Polyclonal Monoclonal diversity index and code to visualize clonal plane was a part of original Master thesis (Life Science Informatics) of Ahmad, S., submitted to Bonn-Aachen International Center for Information Technology, Bonn Germany. References 1 Arstila TP, Casrouge A, Baron V, et al.   A direct estimate of the human αβ T cell receptor diversity. Science  1999; 286: 958– 61. Google Scholar CrossRef Search ADS PubMed  2 Douek DC, Betts MR, Brenchley JM, et al.   A novel approach to the analysis of specificity, clonality, and frequency of HIV-specific T cell responses reveals a potential mechanism for control of viral escape. J Immunol  2002; 168: 3099 LP– 104. Google Scholar CrossRef Search ADS   3 Wang C, Sanders CM, Yang Q, et al.   High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. Proc Natl Acad Sci USA  2010; 107: 1518– 23. Google Scholar CrossRef Search ADS PubMed  4 Schatz DG, Ji Y. Recombination centres and the orchestration of V(D)J recombination. Nat Rev Immunol  2011; 11: 251– 63. Google Scholar CrossRef Search ADS PubMed  5 Dziubianau M, Hecht J, Kuchenbecker L, et al.   TCR repertoire analysis by next generation sequencing allows complex differential diagnosis of T cell-related pathology. Am J Transplant  2013; 13: 2842– 54. Google Scholar CrossRef Search ADS PubMed  6 Tumeh PC, Harview CL, Yearley JH, et al.   PD-1 blockade induces responses by inhibiting adaptive immune resistance. Nature  2014; 515: 568– 71. Google Scholar CrossRef Search ADS PubMed  7 Cha E, Klinger M, Hou Y, et al.   Improved survival with T cell clonotype stability after anti-CTLA-4 treatment in cancer patients. Sci Transl Med  2014; 6: 238ra70. Google Scholar CrossRef Search ADS PubMed  8 Snyder A, Nathanson T, Funt SA, et al.   Contribution of systemic and somatic factors to clinical response and resistance to PD-L1 blockade in urothelial cancer: an exploratory multi-omic analysis. PLoS Med  2017; 14: e1002309. Google Scholar CrossRef Search ADS PubMed  9 Jackson KJL, Liu Y, Roskin KM, et al.   Human responses to influenza vaccination show seroconversion signatures and convergent antibody rearrangements. Cell Host Microbe  2014; 16: 105– 14. Google Scholar CrossRef Search ADS PubMed  10 Clemente MJ, Przychodzen B, Jerez A, et al.   Deep sequencing of the T-cell receptor repertoire in CD8+ T-large granular lymphocyte leukemia identifies signature landscapes. Blood  2013; 122: 4077 LP– 85. Google Scholar CrossRef Search ADS   11 Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet  2008; 9: 387– 402. Google Scholar CrossRef Search ADS PubMed  12 Benati D, Galperin M, Lambotte O, et al.   Public T cell receptors confer high-avidity CD4 responses to HIV controllers. J Clin Invest  2016; 126: 2093– 108. Google Scholar CrossRef Search ADS PubMed  13 Birnbaum ME, Mendoza JL, Sethi DK, et al.   Deconstructing the peptide-MHC specificity of T cell recognition. Cell  2017; 157: 1073– 87. Google Scholar CrossRef Search ADS   14 Van Den Berg HA, Molina-París C, Sewell AK. Specific T-cell activation in an unspecific T-cell repertoire. Sci Prog  2011; 94: 245– 64. Google Scholar CrossRef Search ADS PubMed  15 Heather JM, Ismail M, Oakes T, et al.   High-throughput sequencing of the T-cell receptor repertoire: pitfalls and opportunities. Brief Bioinform  2017, doi: 10.1093/bib/bbx138. 16 Bolotin DA, Shugay M, Mamedov IZ, et al.   MiTCR: software for T-cell receptor sequencing data analysis. Nat Methods  2013; 10: 813– 4. Google Scholar CrossRef Search ADS PubMed  17 Shugay M, Britanova OV, Merzlyak EM, et al.   Towards error-free profiling of immune repertoires. Nat Methods  2014; 11: 653– 5. Google Scholar CrossRef Search ADS PubMed  18 Bolotin DA, Poslavsky S, Mitrophanov I, et al.   MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods  2015; 12: 380– 1. Google Scholar CrossRef Search ADS PubMed  19 Yu Y, Ceredig R, Seoighe C. LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins. Nucleic Acids Res  2015; 44: e31. Google Scholar CrossRef Search ADS PubMed  20 Kuchenbecker L, Nienen M, Hecht J, et al.   IMSEQ—a fast and error aware approach to immunogenetic sequence analysis. Bioinformatics  2015; 31: 2963– 71. Google Scholar CrossRef Search ADS PubMed  21 Yang X, Liu D, Lv N, et al.   TCRklass: a new K-string-based algorithm for human and mouse TCR repertoire characterization. J Immunol  2015; 194: 446– 54. Google Scholar CrossRef Search ADS PubMed  22 Zhang W, Du Y, Su Z, et al.   Imonitor: a robust pipeline for TCR and BCR repertoire analysis. Genetics  2015; 201: 459– 72. Google Scholar CrossRef Search ADS PubMed  23 Gerritsen B, Pandit A, Andeweg AC, et al.   RTCR: a pipeline for complete and accurate recovery of T cell repertoires from high throughput sequencing data. Bioinformatics  2016; 32: 3098– 106. Google Scholar CrossRef Search ADS PubMed  24 Duez M, Giraud M, Herbert R, et al.   Vidjil: a web platform for analysis of high-throughput repertoire sequencing. PLoS One  2016; 11: e0166126. Google Scholar CrossRef Search ADS PubMed  25 Thomas N, Heather J, Ndifon W, et al.   Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics  2013; 29: 542– 50. Google Scholar CrossRef Search ADS PubMed  26 Hung S-J, Chen Y-L, Chu C-H, et al.   TRIg: a robust alignment pipeline for non-regular T-cell receptor and immunoglobulin sequences. BMC Bioinformatics  2016; 17: 433. Google Scholar CrossRef Search ADS PubMed  27 Brochet X, Lefranc MP, Giudicelli V. IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res  2008; 36: 503– 8. Google Scholar CrossRef Search ADS   28 Ye J, Ma N, Madden TL, et al.   IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res  2013; 41: 34– 40. Google Scholar CrossRef Search ADS   29 Hurlbert SH. The nonconcept of species diversity: a critique and alternative parameters. Ecology  1971; 52: 577– 86. Google Scholar CrossRef Search ADS   30 Efron B, Thisted R. Estimating the number of unseen species: how many words did Shakespeare know? Biometrika  1976; 63: 435– 47. 31 Chao A. Nonparametric estimation of the number of classes in a population. Scand J Stat  1984; 30: 265– 70. 32 Shannon CE, Weaver W. The Mathematical Theory of Communication . Urbana: University of Illinois Press, 1949. 33 Simpson EH. Measurement of diversity. Nature  1949; 163: 688. Google Scholar CrossRef Search ADS   34 Nagendra H. Opposite trends in response for the Shannon and Simpson indices of landscape diversity. Appl Geogr  2002; 22: 175– 86. Google Scholar CrossRef Search ADS   35 Grice EA, Kong HH, Renaud G, et al.   A diversity profile of the human skin microbiota. Genome Res  2008; 18: 1043– 50. Google Scholar CrossRef Search ADS PubMed  36 Peet R. The measurements of species diversity. Ann Rev Eeal Syst  1974; 5: 285– 307. Google Scholar CrossRef Search ADS   37 Tóthmérész B. Comparison of different methods for diversity ordering. J Veg Sci  1995; 6: 283– 90. Google Scholar CrossRef Search ADS   38 Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol  1981; 147: 195– 7. Google Scholar CrossRef Search ADS PubMed  39 Camacho C, Coulouris G, Avagyan V, et al.   BLAST plus: architecture and applications. BMC Bioinformatics  2009; 10: 241. Google Scholar CrossRef Search ADS PubMed  40 Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res  2013; 41: 41 Rasmussen KR, Stoye J, Myers EW, et al.   Efficient q-gram filters for finding all ε-matches over a given length. J Comput Biol  2006; 13: 296– 308. Google Scholar CrossRef Search ADS PubMed  42 Myers G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM  1999; 46: 395– 415. Google Scholar CrossRef Search ADS   43 Ye J, McGinnis S, Madden TL. BLAST: improvements for better sequence analysis. Nucleic Acids Res  2006; 34: 6– 9. Google Scholar CrossRef Search ADS   44 Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods  2012; 9: 357– 9. Google Scholar CrossRef Search ADS PubMed  45 Safonova Y, Lapidus A, Lill J. IgSimulator: a versatile immunosequencing simulator. Bioinformatics  2015; 31: 3213– 5. Google Scholar CrossRef Search ADS PubMed  46 Lefranc MP, Giudicelli V, Ginestoux C, et al.   IMGT®, the international ImMunoGeneTics information system®. Nucleic Acids Res  2009; 37: 1006– 12. Google Scholar CrossRef Search ADS   47 Wgsim. https://github.com/lh3/wgsim. 48 Ruggiero E, Nicolay JP, Fronza R, et al.   High-resolution analysis of the human T-cell receptor repertoire. Nat Commun  2015; 6: 8081. Google Scholar CrossRef Search ADS PubMed  49 Gupta NT, Vander Heiden JA, Uduman M, et al.   Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics  2015; 31: 3356– 8. Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Systematic comparative study of computational methods for T-cell receptor sequencing data analysis

Loading next page...
 
/lp/ou_press/systematic-comparative-study-of-computational-methods-for-t-cell-PLWdGd4iFj
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bbx111
Publisher site
See Article on Publisher Site

Abstract

Abstract High-throughput sequencing technologies have exposed the possibilities for the in-depth evaluation of T-cell receptor (TCR) repertoires. These studies are highly relevant to gain insights into human adaptive immunity and to decipher the composition and diversity of antigen receptors in physiological and disease conditions. The major objective of TCR sequencing data analysis is the identification of V, D and J gene segments, complementarity-determining region 3 (CDR3) sequence extraction and clonality analysis. With the advancement in sequencing technologies, new TCR analysis approaches and programs have been developed. However, there is still a deficit of systematic comparative studies to assist in the selection of an optimal analysis approach. Here, we present a detailed comparison of 10 state-of-the-art TCR analysis tools on samples with different complexities by taking into account many aspects such as clonotype detection [unique V(D)J combination], CDR3 identification or accuracy in error correction. We used our in silico and experimental data sets with known clonalities enabling the identification of potential tool biases. We also established a new strategy, named clonal plane, which allows quantifying and comparing the clonality of multiple samples. Our results provide new insights into the effect of method selection on analysis results, and it will assist users in the selection of an appropriate analysis method. T-cell receptor, high-throughput sequencing, computational analysis, immunology, immunosequencing, benchmarking Introduction The T-cell receptor (TCR), present on the surface of T lymphocytes, is responsible for the recognition of antigens presented within the major histocompatibility complex. Therefore, the TCR population must reflect the high diversity of the antigenic epitopes that will be displayed during a subject’s lifetime. The complementarity-determining region 3 (CDR3) accounts for most of this diversity, as it constitutes the site of V(D)J recombination [1, 2]. In addition to the recombination of the V, D and J gene segments, the random insertion and deletion of nucleotides in junctions further contribute to the diversity in the immune repertoire [1–5]. Consequently, the investigation of TCR diversity and clonality, specifically CDR3 analysis, is critical for unraveling the complexity of the adaptive immune system. Besides, changes in the clonality of specific T cells also provide significant information in certain pathological conditions [6–8]. With the advent of next-generation sequencing, the vast amount of immunogenomics data allows qualitative and quantitative assessment of lymphocyte repertoire and its dynamics at a higher resolution. These studies have implications for medical and clinical research including autoimmune diseases, malignancies, vaccination and transplantation processes [5, 9–11]. In the past decade, many strategies were released to analyze TCR sequence data, from assigning raw reads to the correct V(D)J regions (primary tools class), determining the clinical meaning of that precise CDR3 region (secondary tools class) [12–14]. Recently, Heather et al. published a paper focused on a generalized review of TCR analysis tools belonging to both categories. The primary objective was to provide a brief outline of the steps involved in TCR analysis, as well as the challenges associated with the study [15]. As a further step toward comprehensive TCR analysis, the present work focuses on the dissection of tools belonging to the first class providing a systematic benchmarking of available analysis algorithms. Thorough and comprehensive guidelines allowing general users to choose a reliable method for analysis are still needed. We performed an independent systematic comparison between 10 published tools, MiTCR [16], MiGEC/MiXCR [17, 18], LymAnalyzer [19], IMSEQ [20], TCRklass [21], IMonitor [22], RTCR [23], Vidjil [24], Decombinator [25] and Trig [26]. Other relevant immunosequencing analysis tools V-QUEST/High V-QUEST [27] and IgBlast [28] are also available. The first is a Web-based tool with restrictions on the amount of input data and is not suitable for large-scale and Illumina-based sequencing data analysis. IgBlast is available as a web/stand-alone version, but it is also computationally expensive for large data sets. Additionally, these tools require external postprocessing to obtain clustered CDR3 clonotype outputs. However, we have tested one of these two tools that is IMGT/High V-Quest, on small subsets of the data sets. In our benchmarking analysis, we used the clonality of controlled experimental samples to compare the performances of the methods under investigation. Numerous diversity measures are used to quantify the repertoire clonality of TCR samples (Supplementary Section S1). The concept of diversity was founded in ecological studies, where richness and evenness are used for estimating the clonality of TCR repertoires. Ecological diversity could be schematized as a function of the number of species present at a particular site (richness) and the relative distribution of individuals (evenness) among these species [29–31]. This focus either on the evenness or the richness may lead to misleading conclusions. Shannon–Wiener [32] and Simpson’s [33] indices are based both on richness and evenness, but their sensitivity varies toward rare species. Shannon–Wiener index is more biased toward richness, while Simpson’s index highlights the evenness as the predominant component of diversity [34, 35]. Hulbert showed this opposite trend on an ideal data set and Peet [36] described the reason for the divergence between these two indices. Tóthmérész [37] suggested representing the diversity as a profile of a collection of indices such as the Rényi index family (Rényi entropies). This family is defined by the following equation:   Hα=11−αlog ∑i=1Spiα, Where α ∈[0,+∞], S is the number of species and pi is the probability of the species i. It is easy to prove that Shannon–Wiener index = Hα=1 = H1  and Simpson’s index = Hα=2 = H2. An intersection of the Rényi profiles of two samples within the interval α∈[1, 2] indicates that Sample 1 has a larger diversity as Sample 2 using Shannon and the opposite using Simpson, or vice versa. The significant monotonic property of Rényi entropies can be directed to compare the diversity of different sites. When α=0 is set in the Renyi equation, the formula can be simplified to H0 =log S, obtaining the GR of the sample. With α=+∞, the equation collapse to H∞=-log p' and generalizing the evenness (GE; p' is the maximal proportional abundance among all the species; see the Supplementary Section S1 for details). These two values allow defining an area, the Clonal Plane (Supplementary Figure S1), an area to plot the GR (H0) versus the GE (H∞) of one or more samples. It is remarkable to observe that: (1) only combinations of (GR, GE) where GR≥GE are possible; (2) closer is GE to GR, more even is the sample; (3) if GE ≫ GR, the sample contains a dominant species and the sample is closer to monoclonality (Supplementary Sections S2 and S3). Visually, this translates to the observation of the position of the sample points with respect to the bisector of the first quadrant (maximal theoretical polyclonality) and the x-axis (minimal theoretical monoclonality). Plotting the two extreme Rényi entropies of a sample set in the clonal plane permits to estimate and compare the samples clonality examining at the same time the generalized richness (GR), the generalized evenness and the distance from the maximal theoretical polyclonality/monoclonality (Supplementary Sections S3 and S4). Methods Tools for immunosequencing data analysis For comprehensive details, the user is referred to the original publication and documentation. MiTCR maps the query sequence to a reference by first searching seed n-mers in the query followed by an extension of the alignment in both directions in the sequence. It detects V, D and J segments and performs clonotyping with polymerase chain reaction (PCR) and sequencing error correction. It provides several adjustable parameters for error correction along with quality consideration and also rescues low-quality sequencing reads. MiGEC method is designed for data containing unique molecular identifiers (UMIs). It uses a seed-and-extend algorithm combined with a Smith–Waterman [38] strategy and a modified BLAST version [39]. MiXCR implements a particular version of K-mer algorithm referred to as KAligner [40]. In the case of paired-end reads, it combines pairing information and takes into account indels as well as mismatched information. It identifies V, D, J and C gene segments, performs clonotyping and retrieves information from low-quality data. In the clustering step, a multilayer heuristic approach is used and the resultant clonal hits are aligned by the Smith–Waterman algorithm. MiXCR can be used, along with MiGEC to analyze UMI-based data. LymAnalyzer allows novel single-nucleotide polymorphism and lineage mutation trees analysis for somatic hypermutation detection. It incorporates a fast-tag-searching algorithm for mapping the input sequences to reference gene segments, followed by extraction of CDR3 region. Obtained sequences are classified and clustered to obtain frequency per clonotype. It does not provide configurable parameters for quality control. IMSEQ generates short substrings from the reference gene segments corresponding to the conserved residues Phe/Cys and maps them to the query sequence. It processes selected reads by SWIFT filters [41] and Myers bit-vector algorithm (MBVA) [42], followed by CDR3 identification. In the repertoire generation step, the clonotypes are clustered to correct PCR and sequencing errors followed by identification of ambiguous V or J segments. TCRklass incorporates an algorithm based on k-string to identify CDR3. Initially, it generates two sub-data sets (assembled and unassembled) from the input data. The indexed reference database is required, and the position of the reference k-strings is used to find the CRD3 region in the input data. High- and low-quality reads are then processed in subsequent steps followed by evaluation of PCR, sequencing and mRNA synthesis errors. IMonitor performs trimming and quality control at the first step. Sequences are then aligned to reference by BLAST [43] followed by a second alignment step to identify corresponding genes with higher accuracy. In subsequent steps, the CDR3 region of the query sequence is identified and PCR or sequencing error correction is performed, followed by generation of statistical and graphical output. It allows only FASTA input format for single-end data. RTCR performs mapping of the germline V and J segments with the raw TCR data sequencing reads using Bowtie2 [44]. It identifies and extracts CDR3 regions from sequences and annotate them accordingly with aligned V and J segments. To deal with sequencing and PCR errors, it implements a statistical model to estimate the sequences with errors and percentage of errors in each sequence. RTCR allows analysis with and without UMI containing sequences. Vidjil performs a VDJ recombination analysis by using a seed-based method. This algorithm allows a time-efficient analysis by performing clustering on the V(D)J junction (of 50 bp sequence) in the first stage. It is followed by the assignment of particular V, D and J regions. It further allows downstream analysis including diversity estimation and visualization. Decombinator implements a modified version of Aho–Corasick pattern matching algorithm for detection of V and J segments. It searches for the substrings of V/J regions referred as ‘tag’ (i.e. the nucleotide sequence of 20 bp) by an exhaustive search approach to identify specific gene segments. However, this older version of Decombinator is not updated, and a new unpublished version v3 (used in our study) is available that has been specifically designed to work with UMI-based data mainly. Trig tool performs detection of proper and aberrant TCR and B-cell receptor (BCR) sequences. Trig performs mapping of sequencing reads to complete reference sequence and implements a heuristic strategy to distinguish ambiguous and nonambiguous regions. It allows FASTA input file format only. In silico data sets In silico data sets were generated to perform the comparative assessment of selected methods. Data sets were produced with four different clonality values and varying error rates (Table 1). The first data set (DA) consists of 100% Clone 1, whereas other three data sets (DB, DC and DD) comprised 10, 1 and 0.1% Clone 1, respectively. Clone 2 is present in a constant amount (2%) in each of these three samples, whereas the rest of the reads in each data set is composed of mixed TCR sequences with a uniform distribution. To generate several TCR combinations, we used only the first module of IgSimulator [45], creating the initial base sequence file of equally probable gene segments. It comprised recombinant VDJ strings, where somatic deletions of gene segments (start and end of V and J region, respectively) are simulated along with the addition of random nucleotides at junction regions within gene segments. Reference VDJ human TCR sequences were obtained from IMGT database [46]. These strings were then used to generate preliminary data sets with a different clonal composition of Clones 1 and 2. The read simulator wgsim [47] was used to obtain the final set of synthetic reads. We used three levels of error profiles, a zero, a low and a high error rate. To estimate our lower bound for error frequency, we also evaluated the expected amount of errors that were introduced only by the PCR step (Supplementary Section S5). The calculated error frequency (0.02%) was then scaled, arbitrarily, to 0.1% (one error in 1000 bases) to define the final low error rate. A frequency error rate >10 times (1%), defines the high error rate. We found that errors outside this range fall into the error rate regions that are outside the dynamic range of most of the tools. The number of potential clones present in all DA, DB, DC and DD was 1, 4402, 4852 and 4897, respectively. It is highly important to clarify that IgSimulator produces more nonfunctional than functional sequences. For this reason, the number of functional clones, the true richness of the samples, is expected to be smaller. Each data set among these 12 data sets comprised total 1 000 000 single-end reads with a length of 250 bp. These data sets are available at PowerFolder (https://powerfolder.dkfz.de/dl/fiCka39jmibXokiJyVUFkjBB/InSilicoSamples_BIB.zip). Table 1. Twelve in silico data sets were generated with different clonality composition and error rates (0, 0.001 and 0.01) Data sets  Sequences with error rate 1 (0%)  Sequences with error rate 2 (0.1%) or 0.001  Sequences with error rate 3 (1%) or 0.01  Data set 1  Data set 2  Data set 3  Data set A (100%)  DA1: 1 000 000 Clone 1  DA2: 1 000 000 Clone 1  DA3: 1 000 000 Clone 1  Data set B (10%+2%+Poly)  DB1: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB2: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB3: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  Data set C (1%+2%+Poly)  DC1: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC2: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC3: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  Data set D (0.1%+2%+Poly)  DD1: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD2: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD3: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  Data sets  Sequences with error rate 1 (0%)  Sequences with error rate 2 (0.1%) or 0.001  Sequences with error rate 3 (1%) or 0.01  Data set 1  Data set 2  Data set 3  Data set A (100%)  DA1: 1 000 000 Clone 1  DA2: 1 000 000 Clone 1  DA3: 1 000 000 Clone 1  Data set B (10%+2%+Poly)  DB1: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB2: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB3: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  Data set C (1%+2%+Poly)  DC1: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC2: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC3: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  Data set D (0.1%+2%+Poly)  DD1: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD2: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD3: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  Note: Data sets A, B, C and D comprised 100, 10, 1 and 0.1% Clone 1, respectively, and 2% Clone 2 was present in all data sets except DA. Table 1. Twelve in silico data sets were generated with different clonality composition and error rates (0, 0.001 and 0.01) Data sets  Sequences with error rate 1 (0%)  Sequences with error rate 2 (0.1%) or 0.001  Sequences with error rate 3 (1%) or 0.01  Data set 1  Data set 2  Data set 3  Data set A (100%)  DA1: 1 000 000 Clone 1  DA2: 1 000 000 Clone 1  DA3: 1 000 000 Clone 1  Data set B (10%+2%+Poly)  DB1: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB2: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB3: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  Data set C (1%+2%+Poly)  DC1: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC2: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC3: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  Data set D (0.1%+2%+Poly)  DD1: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD2: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD3: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  Data sets  Sequences with error rate 1 (0%)  Sequences with error rate 2 (0.1%) or 0.001  Sequences with error rate 3 (1%) or 0.01  Data set 1  Data set 2  Data set 3  Data set A (100%)  DA1: 1 000 000 Clone 1  DA2: 1 000 000 Clone 1  DA3: 1 000 000 Clone 1  Data set B (10%+2%+Poly)  DB1: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB2: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  DB3: 100 000 Clone 1+20 000 Clone 2+880 000 mix TCR  Data set C (1%+2%+Poly)  DC1: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC2: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  DC3: 10 000 Clone 1+20 000 Clone 2+970 000 mix TCR  Data set D (0.1%+2%+Poly)  DD1: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD2: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  DD3: 1000 Clone 1+20 000 Clone 2+979 000 mix TCR  Note: Data sets A, B, C and D comprised 100, 10, 1 and 0.1% Clone 1, respectively, and 2% Clone 2 was present in all data sets except DA. To estimate the effect of different fragment lengths, three distinct in silico data sets were generated. The data set B with 0% error rate was selected as template and reproduced with varying fragment lengths, i.e. 75, 100 and 150 bp each comprising 1 000 000 reads. It is worthwhile to mention that in silico data set file and quality formats were adjusted according to the requirements of each tool, where needed. In this study, all the tools have been used by using default settings in all cases. However, for data sets with highest error rates, the quality values were adjusted to zero, wherever possible. In certain tools, as MiTCR, MiXCR, RTCR, LymAnalyzer, Decombinator and Vidjil, it is possible to estimate the number of clustered clonotypes before and after removal of out-of-frame and stop codon-containing sequences. For tools like IMSEQ and TCRklass, the generated clustered result files containing VJ segment, CDR3 nucleotide sequence and sequence count information do not contain sequences with out-of-frame and stop codon. In the case of IMonitor, the clustered nucleotide sequence file with abundance information does not filter out-of-frame and stop codon sequences. Trig tool does not produce grouped clonotype result files and do not generate CDR3 amino acid-translated sequences. Where possible the statistics were composed by using filtered files (without out-of-frame/stop codon sequences). The expected percentages of Clones 1 and 2 in all samples were corrected for the total number of sequences that are removed by the tools because of producing nonfunctional CDR3 regions. The correction factor applied is given by the total number of reads divided by the number of refined reads (in-frame) detected per tool (except IMonitor and Trig, where the total number of reads were divided by the number of identified reads). In case of Decombinator tool, we have used two main modules for non-UMI-based evaluation. The nucleotide/amino acid sequence information present in the clustered output file was used to estimate the clonal composition. IMonitor outputs two clustered clonotype files: one only based on CDR3 nucleotide sequence with sequence count and other VJ pairing information with sequence count. The first sequence clustered file has been used in this study. In the case of Trig, no clustered file is generated; therefore, for analysis purpose, the output file containing nucleotide sequence and VJ information was manually clustered based on the sequences. The principal aim of this article is to compare the performance of high-throughput sequencing analysis tools that are capable of large-scale time-efficient analysis. The IMGT/High V-Quest cannot practically analyze data sets consisting each of 1.0 M reads. However, to estimate the TCR discrimination performance, we have generated random subsets of our data sets (DB1, DB2 and DB3) comprising 10 000 reads each and evaluated the performance measures by IMGT/High V-Quest. All the comparisons (except RTCR) were conducted on Ubuntu 10.04 LTE 64 bit operating system, with the specifications of 48 GB of RAM and a Xeon CPU 2.40 GHz with eight cores. Only the RTCR analysis was performed on an OpenStack virtual instance with the same hardware specification and CentOS Linux release 7.2. RTCR remained halted in the initial alignment step at aforementioned Linux system. Experimental data sets We generated four samples with controlled clonality to verify the ten methods under evaluation. Human T cells were isolated from healthy donor peripheral blood mononuclear cells with the EasySep™ Human T Cell Isolation Kit (StemCell Technologies), and purity was assessed to be >95% by flow cytometry. Total RNA was isolated from T cells, Jurkat, CCRF-CEM and HeLa. RNA quality was assessed with the Agilent Bioanalyzer. Isolated RNAs were mixed as indicated in Table 3, and each mix was used for both Rapid Amplification of cDNA Ends (RACE) PCR, as previously described with minor modifications [48], and for quantitative PCR (qPCR). Table 3. Four experimental samples generated with different clonalities by varying Jurkat and CCRF-CEM percentages Samples  Jurkat %  CCRF-CEM %  T cells %  Sample 1  10  2  88  Sample 2  1  2  97  Sample 3  0.1  2  97.9  Sample 4  0.01  2  97.99  Samples  Jurkat %  CCRF-CEM %  T cells %  Sample 1  10  2  88  Sample 2  1  2  97  Sample 3  0.1  2  97.9  Sample 4  0.01  2  97.99  Note: Jurkat percentage was decreased from Samples 1 to 4, whereas CCRF-CEM was kept constant. Table 3. Four experimental samples generated with different clonalities by varying Jurkat and CCRF-CEM percentages Samples  Jurkat %  CCRF-CEM %  T cells %  Sample 1  10  2  88  Sample 2  1  2  97  Sample 3  0.1  2  97.9  Sample 4  0.01  2  97.99  Samples  Jurkat %  CCRF-CEM %  T cells %  Sample 1  10  2  88  Sample 2  1  2  97  Sample 3  0.1  2  97.9  Sample 4  0.01  2  97.99  Note: Jurkat percentage was decreased from Samples 1 to 4, whereas CCRF-CEM was kept constant. Table 4. Subjective performance scores of each tool on in silico data sets are presented based on the clonal percentage detection Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMonitor  IMSEQ  TCRklass  Vidjil  Decombinator  Trig  DA1 in silico  10  10  10  10  10  10  10  10  10  –  DA2 in silico  9  10  8  5  3  3  5  7  6  –  DA3 in silico  8  9  10  5  3  3  5  7  6  –  DB1 in silico  10  10  10  10  10  1  10  10  2  10  DB2 in silico  10  10  10  6  6  1  6  7  2  6  DB3 in silico  9  8  10  6  6  1  6  7  6  6  DC1 in silico  10  10  10  10  10  1  10  10  2  10  DC2 in silico  10  10  10  6  6  1  6  10  2  6  DC3 in silico  8  7  10  6  6  1  6  10  6  6  DD1 in silico  10  10  10  10  10  1  10  10  2  10  DD2 in silico  10  10  10  6  6  1  6  10  6  6  DD3 in silico  8  7  10  6  6  1  6  10  6  6  Performance score  112/120  111/120  118/120  86/120  82/120  25/120  86/120  108/120  56/120  66/120  Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMonitor  IMSEQ  TCRklass  Vidjil  Decombinator  Trig  DA1 in silico  10  10  10  10  10  10  10  10  10  –  DA2 in silico  9  10  8  5  3  3  5  7  6  –  DA3 in silico  8  9  10  5  3  3  5  7  6  –  DB1 in silico  10  10  10  10  10  1  10  10  2  10  DB2 in silico  10  10  10  6  6  1  6  7  2  6  DB3 in silico  9  8  10  6  6  1  6  7  6  6  DC1 in silico  10  10  10  10  10  1  10  10  2  10  DC2 in silico  10  10  10  6  6  1  6  10  2  6  DC3 in silico  8  7  10  6  6  1  6  10  6  6  DD1 in silico  10  10  10  10  10  1  10  10  2  10  DD2 in silico  10  10  10  6  6  1  6  10  6  6  DD3 in silico  8  7  10  6  6  1  6  10  6  6  Performance score  112/120  111/120  118/120  86/120  82/120  25/120  86/120  108/120  56/120  66/120  Note: RTCR, MiXCR, Vidjil and MiTCR obtained highest score. Table 4. Subjective performance scores of each tool on in silico data sets are presented based on the clonal percentage detection Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMonitor  IMSEQ  TCRklass  Vidjil  Decombinator  Trig  DA1 in silico  10  10  10  10  10  10  10  10  10  –  DA2 in silico  9  10  8  5  3  3  5  7  6  –  DA3 in silico  8  9  10  5  3  3  5  7  6  –  DB1 in silico  10  10  10  10  10  1  10  10  2  10  DB2 in silico  10  10  10  6  6  1  6  7  2  6  DB3 in silico  9  8  10  6  6  1  6  7  6  6  DC1 in silico  10  10  10  10  10  1  10  10  2  10  DC2 in silico  10  10  10  6  6  1  6  10  2  6  DC3 in silico  8  7  10  6  6  1  6  10  6  6  DD1 in silico  10  10  10  10  10  1  10  10  2  10  DD2 in silico  10  10  10  6  6  1  6  10  6  6  DD3 in silico  8  7  10  6  6  1  6  10  6  6  Performance score  112/120  111/120  118/120  86/120  82/120  25/120  86/120  108/120  56/120  66/120  Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMonitor  IMSEQ  TCRklass  Vidjil  Decombinator  Trig  DA1 in silico  10  10  10  10  10  10  10  10  10  –  DA2 in silico  9  10  8  5  3  3  5  7  6  –  DA3 in silico  8  9  10  5  3  3  5  7  6  –  DB1 in silico  10  10  10  10  10  1  10  10  2  10  DB2 in silico  10  10  10  6  6  1  6  7  2  6  DB3 in silico  9  8  10  6  6  1  6  7  6  6  DC1 in silico  10  10  10  10  10  1  10  10  2  10  DC2 in silico  10  10  10  6  6  1  6  10  2  6  DC3 in silico  8  7  10  6  6  1  6  10  6  6  DD1 in silico  10  10  10  10  10  1  10  10  2  10  DD2 in silico  10  10  10  6  6  1  6  10  6  6  DD3 in silico  8  7  10  6  6  1  6  10  6  6  Performance score  112/120  111/120  118/120  86/120  82/120  25/120  86/120  108/120  56/120  66/120  Note: RTCR, MiXCR, Vidjil and MiTCR obtained highest score. Briefly, RACE PCR was performed on 250 ng RNA through first-strand complementary DNA (cDNA) synthesis using the SMARTScribe reverse transcriptase (Clontech) by RNA incubation with the primer annealing to TCRbeta constant region at 72 °C for 4 min and at 42 °C for 2 min. The template-switching oligo contained both a 12 bp UMI and a 12 bp sample barcode, and template switch was carried out at 42 °C for 60 min followed by incubation with uracil-DNA glycosylase (5 U/reaction, New England Biolabs) at 37 °C for 40 min. Two nested 18 cycles of exponential PCRs were subsequently performed using 5 μl of the single-stranded cDNA or 2 μl of the first exponential PCR, respectively. In the second exponential PCR, barcoded primers were added to enable library preparation. Libraries were purified with AMPure beads and pooled for 400+100 bp paired-end sequencing in the MiSeq platform. For qPCR, cDNA was synthesized from 250 ng RNA using the ProtoScript® II First Strand cDNA Synthesis Kit (NEB) and the NEBNext® Second Strand Synthesis Buffer and Enzyme Mix (NEB) according to manufacturer’s instructions. Real-time amplification was performed with 50 ng cDNA as input in a LightCycler ® 480 Instrument (Roche) using the LightCycler ® 480 SYBR Green I Master. Relative quantification of CCRF-CEM and Jurkat CDR3 and TRBC has been performed using glyceraldehyde 3-phosphate dehydrogenase as housekeeping gene. Primer sequences are listed in Supplementary Table S2. Scoring system The approach in the ranking of 10 tools is based on a simple scoring model composed of two classes of features. One class contains general features and includes installation, usage, output, BCR analysis, paired/single end, quality, error correction and the reference genome flexibility. The second class scores the tools’ performances on our 12 in silico data sets. We have not prioritized any particular feature. For each characteristic, a numeric score s is assigned to the ten software by the performance or presence/absence of the feature. The score s is a numeric value comprised from 0 to 10, where 10 is the best and 0 is the worst value. If n tools have the same performance, the score assigned is the s, and the next score assigned is then s-n. A method that misses a feature or does not return a result is marked with 0. The resulting scores are then added up. The tools are arranged from the highest to the lowest value. Clonal plane A set of R functions were developed to compute and visualize the clonal values according to the Renyi entropies definitions. The code, sufficient to generate the eight clonal planes in this manuscript, and the four in silico data sets are provided as a Supplementary Material. We used an RStudio playbook as the container for the code along with descriptions of its usage (https://github.com/G100DKFZ/ManuscriptNote). Results and discussion General features We have compared the general features for the ten selected tools, which mainly include the installation, usage, customization, reference genome flexibility and generated output to provide a brief overview (Table 2). The installation of MiTCR, MiXCR, LymAnalyzer and Vidjil requires Java. RTCR installation depends on a few Python packages and Bowtie2. IMonitor depends on BLAST (blastall routine) for alignment; however, it is provided by authors within the Linux package. For IMonitor, indexed nucleotide reference sequence files are needed, which requires alignment to find the start position of CDR3 region in V and end of CDR3 in J reference followed by indexing. TCRklass installation also depends on some external programs and libraries, and its use requires index reference database for nucleotide and amino acid V and J reference sequences. An alignment is needed for amino acid references to locate the position of the conserved residue ‘C’ in V and ‘F’ or ‘W’ in J followed by indexing. Installation of Decombinator depends on certain external dependencies, and individual modules need to be executed for stepwise analysis. Trig pipeline depends on an external alignment tool MUMMER. Table 2. General features of 10 tools under investigation are represented Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMSEQ  TCRklass  IMonitor  Vidjil  Decombinator  Trig  Version  1.0.3  2.1.2  1.0  1.2.1  1.1.0  0.7.2  2.0  0.03  3.1 (Unpublished)  1.0  Year published  2013  2015  2016  2015  2015  2015  2015  2016  2017  2016  Operating System  Linux/Windows  Linux/Windows  Linux  Linux/Windows  Linux  Linux  Linux  Linux/Web  Linux  Linux  Linux installation and external tools  Java  Java  Bowtie 2, Python packages  Java  C ++, Zlib  HTQC package, CMake  Perl, R, BLAST (included local version)  C ++, Java/Python  Acora, Biopython, Regex, Python-Levenshtein  MUMMER  Ease of usage and customizability  10  10  6  5  8  5  5  8  1  5  Wide range of adjustable features  Wide range of adjustable features  Certain adjustable features  Few adjustable features  Medium range of adjustable features  Preprocess reference, medium range of adjustable features  Preprocess reference, medium range of adjustable features  Medium range of adjustable features  Individual scripts, few adjustable features  Few adjustable features  aOutput  7  10  9  6  6  6  6  9  2  2  Paired-end  No  10 (Yes)  No  No  10 (Yes)  10 (Yes)  10 (Yes)  No  No  No  SE FASTA file only  PE only for UMI data  SE FASTA file only  Quality  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  No  Only for PE  SE FASTA file only  Error correction  10 (Yes)  10 (Yes)  10 (Yes)  3 (Partial)  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  –  UMI allowed  No  10 (Yes, with MiGEC)  10 (Yes)  No  7 (Considers number of characters, not structure)  No  No  No  10 (Yes)  No  (Specific for a protocol in unpublished version)  TCR gamma/delta  Not mentioned  10 (Yes)  10 (Yes)  Not mentioned  Not mentioned  Not mentioned  Not mentioned  10 (Yes)  10 (Yes)  10 (Yes)  BCR  No  10 (Yes)  No  10 (Yes)  10 (Yes)  10 (Yes)Stated that can be used  10 (Yes)  10 (Yes)  No  10 (Yes)  Reference genome flexibility  No  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  –  –  Human/mouse  Human/mouse  Human/mouse  Human, Mouse, etc.  Human or mouse  Human or mouse  Additional features  NA  NA  Basic statistics  SNP, lineage mutation tree  NA  VJ profiling and statistics  Statistics, profiles and plots  Unexpected recombination  NA  Nonregular TCR detection  License  GNU GPL V3  Restricted or commercial  GNU GPLV3.0  Open source  GPL V2 open source  GNU GPL V3.0  Open source  GNU GPLv3 (For research use only)  Open source  GNU GPLv3  Feature score  37/90  90/90  65/90  34/90  71/90  61/90  61/90  57/90  43/90  27/90  Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMSEQ  TCRklass  IMonitor  Vidjil  Decombinator  Trig  Version  1.0.3  2.1.2  1.0  1.2.1  1.1.0  0.7.2  2.0  0.03  3.1 (Unpublished)  1.0  Year published  2013  2015  2016  2015  2015  2015  2015  2016  2017  2016  Operating System  Linux/Windows  Linux/Windows  Linux  Linux/Windows  Linux  Linux  Linux  Linux/Web  Linux  Linux  Linux installation and external tools  Java  Java  Bowtie 2, Python packages  Java  C ++, Zlib  HTQC package, CMake  Perl, R, BLAST (included local version)  C ++, Java/Python  Acora, Biopython, Regex, Python-Levenshtein  MUMMER  Ease of usage and customizability  10  10  6  5  8  5  5  8  1  5  Wide range of adjustable features  Wide range of adjustable features  Certain adjustable features  Few adjustable features  Medium range of adjustable features  Preprocess reference, medium range of adjustable features  Preprocess reference, medium range of adjustable features  Medium range of adjustable features  Individual scripts, few adjustable features  Few adjustable features  aOutput  7  10  9  6  6  6  6  9  2  2  Paired-end  No  10 (Yes)  No  No  10 (Yes)  10 (Yes)  10 (Yes)  No  No  No  SE FASTA file only  PE only for UMI data  SE FASTA file only  Quality  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  No  Only for PE  SE FASTA file only  Error correction  10 (Yes)  10 (Yes)  10 (Yes)  3 (Partial)  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  –  UMI allowed  No  10 (Yes, with MiGEC)  10 (Yes)  No  7 (Considers number of characters, not structure)  No  No  No  10 (Yes)  No  (Specific for a protocol in unpublished version)  TCR gamma/delta  Not mentioned  10 (Yes)  10 (Yes)  Not mentioned  Not mentioned  Not mentioned  Not mentioned  10 (Yes)  10 (Yes)  10 (Yes)  BCR  No  10 (Yes)  No  10 (Yes)  10 (Yes)  10 (Yes)Stated that can be used  10 (Yes)  10 (Yes)  No  10 (Yes)  Reference genome flexibility  No  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  –  –  Human/mouse  Human/mouse  Human/mouse  Human, Mouse, etc.  Human or mouse  Human or mouse  Additional features  NA  NA  Basic statistics  SNP, lineage mutation tree  NA  VJ profiling and statistics  Statistics, profiles and plots  Unexpected recombination  NA  Nonregular TCR detection  License  GNU GPL V3  Restricted or commercial  GNU GPLV3.0  Open source  GPL V2 open source  GNU GPL V3.0  Open source  GNU GPLv3 (For research use only)  Open source  GNU GPLv3  Feature score  37/90  90/90  65/90  34/90  71/90  61/90  61/90  57/90  43/90  27/90  Note: These tools are scored in a subjective manner per feature. a Output is scored based on how detailed information is available per detected clones. – Information not clear/found, SE; Single-end, PE: Paired-end. Table 2. General features of 10 tools under investigation are represented Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMSEQ  TCRklass  IMonitor  Vidjil  Decombinator  Trig  Version  1.0.3  2.1.2  1.0  1.2.1  1.1.0  0.7.2  2.0  0.03  3.1 (Unpublished)  1.0  Year published  2013  2015  2016  2015  2015  2015  2015  2016  2017  2016  Operating System  Linux/Windows  Linux/Windows  Linux  Linux/Windows  Linux  Linux  Linux  Linux/Web  Linux  Linux  Linux installation and external tools  Java  Java  Bowtie 2, Python packages  Java  C ++, Zlib  HTQC package, CMake  Perl, R, BLAST (included local version)  C ++, Java/Python  Acora, Biopython, Regex, Python-Levenshtein  MUMMER  Ease of usage and customizability  10  10  6  5  8  5  5  8  1  5  Wide range of adjustable features  Wide range of adjustable features  Certain adjustable features  Few adjustable features  Medium range of adjustable features  Preprocess reference, medium range of adjustable features  Preprocess reference, medium range of adjustable features  Medium range of adjustable features  Individual scripts, few adjustable features  Few adjustable features  aOutput  7  10  9  6  6  6  6  9  2  2  Paired-end  No  10 (Yes)  No  No  10 (Yes)  10 (Yes)  10 (Yes)  No  No  No  SE FASTA file only  PE only for UMI data  SE FASTA file only  Quality  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  No  Only for PE  SE FASTA file only  Error correction  10 (Yes)  10 (Yes)  10 (Yes)  3 (Partial)  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  –  UMI allowed  No  10 (Yes, with MiGEC)  10 (Yes)  No  7 (Considers number of characters, not structure)  No  No  No  10 (Yes)  No  (Specific for a protocol in unpublished version)  TCR gamma/delta  Not mentioned  10 (Yes)  10 (Yes)  Not mentioned  Not mentioned  Not mentioned  Not mentioned  10 (Yes)  10 (Yes)  10 (Yes)  BCR  No  10 (Yes)  No  10 (Yes)  10 (Yes)  10 (Yes)Stated that can be used  10 (Yes)  10 (Yes)  No  10 (Yes)  Reference genome flexibility  No  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  –  –  Human/mouse  Human/mouse  Human/mouse  Human, Mouse, etc.  Human or mouse  Human or mouse  Additional features  NA  NA  Basic statistics  SNP, lineage mutation tree  NA  VJ profiling and statistics  Statistics, profiles and plots  Unexpected recombination  NA  Nonregular TCR detection  License  GNU GPL V3  Restricted or commercial  GNU GPLV3.0  Open source  GPL V2 open source  GNU GPL V3.0  Open source  GNU GPLv3 (For research use only)  Open source  GNU GPLv3  Feature score  37/90  90/90  65/90  34/90  71/90  61/90  61/90  57/90  43/90  27/90  Tools  MiTCR  MiXCR  RTCR  LymAnalyzer  IMSEQ  TCRklass  IMonitor  Vidjil  Decombinator  Trig  Version  1.0.3  2.1.2  1.0  1.2.1  1.1.0  0.7.2  2.0  0.03  3.1 (Unpublished)  1.0  Year published  2013  2015  2016  2015  2015  2015  2015  2016  2017  2016  Operating System  Linux/Windows  Linux/Windows  Linux  Linux/Windows  Linux  Linux  Linux  Linux/Web  Linux  Linux  Linux installation and external tools  Java  Java  Bowtie 2, Python packages  Java  C ++, Zlib  HTQC package, CMake  Perl, R, BLAST (included local version)  C ++, Java/Python  Acora, Biopython, Regex, Python-Levenshtein  MUMMER  Ease of usage and customizability  10  10  6  5  8  5  5  8  1  5  Wide range of adjustable features  Wide range of adjustable features  Certain adjustable features  Few adjustable features  Medium range of adjustable features  Preprocess reference, medium range of adjustable features  Preprocess reference, medium range of adjustable features  Medium range of adjustable features  Individual scripts, few adjustable features  Few adjustable features  aOutput  7  10  9  6  6  6  6  9  2  2  Paired-end  No  10 (Yes)  No  No  10 (Yes)  10 (Yes)  10 (Yes)  No  No  No  SE FASTA file only  PE only for UMI data  SE FASTA file only  Quality  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  No  Only for PE  SE FASTA file only  Error correction  10 (Yes)  10 (Yes)  10 (Yes)  3 (Partial)  10 (Yes)  10 (Yes)  10 (Yes)  –  10 (Yes)  –  UMI allowed  No  10 (Yes, with MiGEC)  10 (Yes)  No  7 (Considers number of characters, not structure)  No  No  No  10 (Yes)  No  (Specific for a protocol in unpublished version)  TCR gamma/delta  Not mentioned  10 (Yes)  10 (Yes)  Not mentioned  Not mentioned  Not mentioned  Not mentioned  10 (Yes)  10 (Yes)  10 (Yes)  BCR  No  10 (Yes)  No  10 (Yes)  10 (Yes)  10 (Yes)Stated that can be used  10 (Yes)  10 (Yes)  No  10 (Yes)  Reference genome flexibility  No  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  10 (Yes)  –  –  Human/mouse  Human/mouse  Human/mouse  Human, Mouse, etc.  Human or mouse  Human or mouse  Additional features  NA  NA  Basic statistics  SNP, lineage mutation tree  NA  VJ profiling and statistics  Statistics, profiles and plots  Unexpected recombination  NA  Nonregular TCR detection  License  GNU GPL V3  Restricted or commercial  GNU GPLV3.0  Open source  GPL V2 open source  GNU GPL V3.0  Open source  GNU GPLv3 (For research use only)  Open source  GNU GPLv3  Feature score  37/90  90/90  65/90  34/90  71/90  61/90  61/90  57/90  43/90  27/90  Note: These tools are scored in a subjective manner per feature. a Output is scored based on how detailed information is available per detected clones. – Information not clear/found, SE; Single-end, PE: Paired-end. In TCRklass and IMonitor, several independent programs are available to perform either partial or complete analysis. In addition to CDR3 containing sequences, their output also provides detailed V and J profile for CDR3 negative sequences. TCRklass and MiXCR output complete unclustered files that contain read IDs information. MiXCR and RTCR also generate intermediate files. IMonitor outputs clustered clonotypes files only based on CDR3 region; unlike other tools, it does not generate a clustered file containing VJ gene segments and CDR3 information. Vidjil generates clustered CDR3 output files with read IDs information. Trig does not produce clustered result files for clonotypes, and it also does not output information about CDR3 amino acid sequences. The final output of MiXCR followed by Vidjil provides the most detailed information about each clone, as compared with all other tools. Clonal composition analysis The analysis of 12 in silico samples with different clonal composition and error rates was performed with the tools evaluated in this study. To analyze the sensitivity and specificity of each algorithm, the varying clonal composition of two clones in each sample was estimated. The DA data set only contains Clone 1 (100%), and we assessed it at different error rates (DA1: 0, DA2: 0.001 and DA3: 0.01 error rates). Trig did not execute successfully on DA data sets. With a null error rate, Clone 1 percentage was homogeneously detected, but performance of most tools degraded by almost 50% in its detection on increasing error rates (Figure 1A). Nonetheless, MiXCR and RTCR were able to detect the expected clone frequency, despite the error rates assayed in DA2 and DA3. In the case of MiTCR, its performance was homogeneous for DA1 and DA2 and only decreased when analyzing the DA3 data set. Besides, differences in the number of reads detected for Clone 1 were analyzed and a similar trend was observed (Supplementary Figure S2a). RTCR and MiXCR showed more consistency in all error rates and assigned the highest number of reads to Clone 1. Where possible, we also determined the overall percentage of used reads before and after removal of out-of-frame and stop codon reads, and the data correlated with the trend observed in the analysis of read numbers detected for Clone 1 (Supplementary Figure S3a). Figure 1. View largeDownload slide Clones 1 and 2 percentages detected in samples (A) DA with 100% Clone 1, (B) DB with 10% Clone 1 and 2% Clone 2, (C) DC with 1% Clone 1 and 2% Clone 2 and (D) DD with 0.1% Clone 1 and 2% Clone 2. Each plot represents clonal percentage detection at different error rates (0, 0.001 and 0.01). Overall, RTCR followed by MiXCR, MiTCR and Vidjil showed better performance. Black and gray bars represent the expected percentages of Clones 1 and 2, respectively. Figure 1. View largeDownload slide Clones 1 and 2 percentages detected in samples (A) DA with 100% Clone 1, (B) DB with 10% Clone 1 and 2% Clone 2, (C) DC with 1% Clone 1 and 2% Clone 2 and (D) DD with 0.1% Clone 1 and 2% Clone 2. Each plot represents clonal percentage detection at different error rates (0, 0.001 and 0.01). Overall, RTCR followed by MiXCR, MiTCR and Vidjil showed better performance. Black and gray bars represent the expected percentages of Clones 1 and 2, respectively. DB consists of 10% Clone 1 and 2% Clone 2 sequences. The analysis of Clone 1 percentage in DB1 showed a general good performance for all tools. However, IMSEQ yielded values lower (5%) than expected. In case of Clone 2 within DB data sets and in line with DA results, IMSEQ reported 0.99% and Decombinator detected 1.6%. In case of DB2, MiTCR, MiXCR, RTCR and Vidjil exhibited good performance for Clones 1 and 2. A similar behavior was shown by RTCR in DB3 for Clones 1 and 2. About half of the expected percentages were returned by LymAnalyzer, TCRklass, IMonitor and Decombinator in DB3 (Figure 1B). IMSEQ showed analogous deviations in DB2 and DB3 as for DB1. In assigning number of reads for Clones 1 and 2, a similar behavior was observed for all DB samples as was seen in DA for Clone 1 (Supplementary Figure S2b). The percentage of total detected reads is presented in Supplementary Figure S3b. It is worthwhile to mention that we have analyzed a randomly selected subset of 10 000 reads from all DB samples by IMGT/High V-Quest and postprocessed with Change-O tool [49]. The percentage of Clone 1 in DB1, DB2 and DB3 samples was 9.98, 9.86 and 8.54, respectively, whereas the percentage of Clone 2 was 2.3, 2.2 and 2.0, respectively. We have not included any further analysis with this tool and also with IgBlast because of poor time efficiency and additional postprocessing. In case of DC and DD, comprising 1 and 0.1% Clone 1, respectively, and 2% Clone 2, the trend observed was similar to the one retrieved from DB data sets (Figure 1C and D). In the detection of number of reads for Clones 1 and 2 and the overall read numbers, a similar behavior as DB was observed with all DC and DD data sets (Supplementary Figures S2c, S3c, S2d and S3d, respectively). Overall in all 12 samples, RTCR showed better performance followed by MiXCR and MiTCR in the detection of Clones 1 and 2 percentages (Table 4). In general, RTCR maintained a more consistent behavior across all error rates followed by MiXCR, MiTCR and Vidjil. LymAnalyzer, IMonitor, Trig and TCRklass exhibited a highly similar response in detecting the clones across all data sets. Decombinator slightly underestimated the clonal percentages in some samples, whereas IMSEQ showed a large underestimation. Additionally, to understand the effect of altering a particular TCR sequence in clonal percentage detection, we have also generated and analyzed with MiXCR, RTCR and Decombinator an in silico data set DB2.X (similar to DB2) by using two different TCR sequences as Clones 1 and 2. However, no difference in the clonal composition detection was observed (Supplementary Figure S4). Clonal plane analysis We conducted the clonal plane analysis for in silico data sets DA2, DB2, DC2 and DD2. In the clonal plane representation, we used a nucleotide sequence containing CDR3-clustered file (default output) from tools, after removing out-of-frame and stop codon sequences. This procedure was not possible in the case of (1) IMonitor, as nucleotide CDR3 sequence frequency file does not include VJ information, so only this sequence file was used and (2) Trig, as no clustered file is generated and sequences were grouped manually based on nucleotide sequence. Therefore, in both of these tools, the selection of in-frame reads was not possible and could explain the severe deviation in the detected GR. In the case of Vidjil, the complete read sequence in the returned clustered file was used. In DA2, expected GR and GE values are zero, as only one clone is present in the data set. Accordingly, all tools showed close to zero GE. However, MiXCR (2) performance in calculating GR was superior among all tools. Vidjil reported a GR of about 5 and all other tools between 7 and 8 (Figure 2A). This first data set, on the clonal plane shows visually the capacity of each tool in reducing the excess of detected clones because of the noise injected into the reads. In DB2, DC2 and DD2, the expected GR was 8.39, 8.49 and 8.5 and the GE was 2.3, 3.91 and 3.91, respectively (Figure 2B–D). All tools, except IMSEQ (DB2, DC2 and DD2) and, marginally, Decombinator (DC2 and DD2), exhibited a profoundly similar trend in detecting the samples GE. This response correlates with the findings in the previous section (Figure 1), and indicates that the correction strategies for discarding malformed CDR3 sequences worked well in most tools. On the contrary, the detected GR is highly variable among groups of tools. RTCR, Vidjil, MiTCR and MiXCR are clustered around a value of GR that is smaller than the maximal expected value. This response indicates that these four tools are efficiently similarly removing false CDR3 regions that derive from mutated versions of clones, while all other tools are overestimating the number of unique CDR3 regions. Especially Trig, IMonitor and LymAnalyzer are finding a large excess of clones. Figure 2. View largeDownload slide Clonal plane depicting performance of tools in richness and evenness estimation on in silico data sets with 0.001 error rate and varying clonality (A) DA2 (100% Clone 1), (B) DB2 (10% Clone 1 and 2% Clone 2), (C) DC2 (1% Clone 1 and 2% Clone 2) and (D) DD2 (0.1% Clone 1 and 2% Clone 2). Black circles represent the expected value of GE and maximal expected value of GR, which is computed from the total number of TCR sequences produced by IgSimulator and also takes into account the nonfunctional reads. Figure 2. View largeDownload slide Clonal plane depicting performance of tools in richness and evenness estimation on in silico data sets with 0.001 error rate and varying clonality (A) DA2 (100% Clone 1), (B) DB2 (10% Clone 1 and 2% Clone 2), (C) DC2 (1% Clone 1 and 2% Clone 2) and (D) DD2 (0.1% Clone 1 and 2% Clone 2). Black circles represent the expected value of GE and maximal expected value of GR, which is computed from the total number of TCR sequences produced by IgSimulator and also takes into account the nonfunctional reads. Fragment length analysis The fragment analysis was performed to enhance the understanding of variations in the tool’s performance with different fragment lengths. Here, DB1 data set with 0 error rate was the source for four data sets with lengths of 250, 150, 100 and 75 bp each. The read numbers detected for Clones 1 and 2 were homogeneous for most tools with the read lengths of 250 and 150 bp (Figure 3). IMSEQ did not yield results when the reads are 150 and 100 bp. RTCR, Decombinator and TCRklass failed to detect Clones 1 and 2 within DB 100 bp reads and, for the 75 bp reads length data set, no tools were able to return any clone. Remarkably, all the programs showed a sharp decrease in the number of reads detected for each clone at shorter read lengths. For example, the 150 bp data set yielded ∼50% lower read number as compared with the data set with 250 bp long reads. A similar trend was observed in the percentage of the total used reads (Supplementary Figure S5). Figure 3. View largeDownload slide Number of reads detected for Clones 1 and 2 in sample DB1 (10% Clone 1 and 2% Clone 2 with 0 error rate) with read lengths of 250, 150, 100 and 75 bp. MiTCR, MiXCR, LymAnalyzer, IMonitor and Trig showed better performance allowing the detection of Clones 1 and 2 in three of the aforementioned read lengths. Figure 3. View largeDownload slide Number of reads detected for Clones 1 and 2 in sample DB1 (10% Clone 1 and 2% Clone 2 with 0 error rate) with read lengths of 250, 150, 100 and 75 bp. MiTCR, MiXCR, LymAnalyzer, IMonitor and Trig showed better performance allowing the detection of Clones 1 and 2 in three of the aforementioned read lengths. Analysis of experimental samples To estimate the accuracy of each method within biological samples, we generated four different samples with known clonalities by spiking-in Jurkat and CCRF-CEM clones in polyclonal T cells (Table 3). However, a disadvantage in analyzing T- and B-cell mRNA instead of DNA is that the expression levels of the receptor transcripts can diverge from 1:1 proportionality ratio depending on the cell type or its activation status. Therefore, we additionally performed qPCR (Supplementary Table S3) to independently estimate the proportion biases of the clonal transcripts. TCR beta chains were analyzed by RACE PCR using UMI to enable UMI-based studies using RTCR and MiGEC-MiXCR tools along with non-UMI analysis (Figure 4). We could not apply UMI Decombinator scripts on our samples, as the unpublished version of Decombinator is designed for a particular UMI protocol that was not compatible with the standard UMI structure of our reads, despite our attempt to test different UMI structures. Trig produced shallow sequence detection for CDR3 regions in all experimental samples; however, in the case of in silico data sets, it performed well. In Samples 1 and 2 (Figure 4A and B, respectively), IMSEQ and Trig showed the highest deviations from the expected output. It is worthwhile to mention that IMSEQ also allows barcode-based analysis, where the length of the barcode string in the prefix of the read is required. However, the analysis of our experimental samples by using these parameters showed a strong underestimation of clonal percentages, and the corresponding UMI analysis is not discussed. In Sample 3, IMSEQ maintained the similar behavior (Figure 4C). In the case of Sample 4, Trig and MiGEC-MiXCR with UMI analysis did not detect Clone 1 (Figure 4D). Besides, the number of reads detected for each clone is represented in Supplementary Figure S6. Figure 4. View largeDownload slide Percentage of Jurkat and CCRF-CEM clones detected by the different methods in four experimental samples with controlled clonalities. Estimated by qPCR (A) Sample 1 with Clone 1: 17.80% and Clone 2: 1.32%, (B) Sample 2 with Clone 1: 1.54% and Clone 2: 1.57%, (C) Sample 3 with Clone 1: 0.17% and Clone 2: 1.96% and (D) Sample 4 Clone 1: 0.03% and Clone 2: 2.26%. Black and gray bars represent the expected percentages of Clones 1 and 2, respectively, estimated by qPCR relative quantification. Figure 4. View largeDownload slide Percentage of Jurkat and CCRF-CEM clones detected by the different methods in four experimental samples with controlled clonalities. Estimated by qPCR (A) Sample 1 with Clone 1: 17.80% and Clone 2: 1.32%, (B) Sample 2 with Clone 1: 1.54% and Clone 2: 1.57%, (C) Sample 3 with Clone 1: 0.17% and Clone 2: 1.96% and (D) Sample 4 Clone 1: 0.03% and Clone 2: 2.26%. Black and gray bars represent the expected percentages of Clones 1 and 2, respectively, estimated by qPCR relative quantification. The clonal status of all analyzed samples was projected onto the clonal plane to compare each tools' performance in recognizing the sample clonality. In general, we expected a definite set of values in the clonality components, richness and evenness, from Sample 1 to Sample 4. The GR of our samples is driven by the T-cell amount and is not known a priori. As the amount of the dominant clones was evaluated in all samples via qPCR, we could determine the expected GE value (Renyi Entropy with α = +∞; see Supplementary Section S2 and Supplementary Table S3). It is possible to group the tools response by the position of the sample points on the clonal plane (Figure 5). In all samples, RTCR, MiXCR, MiTCR, Decombinator and IMonitor (Set 1) exhibit a remarkably similar clonality reconstruction. TCRklass and LymAnalyzer (Set 2) group together and merge with Set 1 in Samples 2, 3 and 4. The two methods using UMI, MiGEC-MiXCR-UMI and RTCR-UMI (Set 3) reduce the detected GR and increase the GE. This modification of the returned clonality in respect with the no UMI version (MiXCR and RTCR) is larger for MiGEC-MiXCR-UMI than for RTCR-UMI showing that the two tools are using this additional information differently. Vidjil returns a GR that is in line with the two previous tools without using the UMI but seems to underestimate the GE in all the samples. Trig and IMSEQ are the two methods showing a manifest deviation both in GR and GE. Trig decays in all the samples, whereas IMSEQ returns low values of GE in Samples 1 and 2. In Sample 3, IMSEQ showed an apparently reasonable estimation of GE, but it is because of an aberrant detection of the Jurkat clone instead of the CCRF-CEM clone (Figure 4C). Figure 5. View largeDownload slide RACE-PCR experimental samples are presented on clonal planes. Estimated by qPCR (A) Sample 1 with Clone 1: 17.80% and Clone 2: 1.32%, (B) Sample 2 with Clone 1: 1.54% and Clone 2: 1.57%, (C) Sample 3 with Clone 1: 0.17% and Clone 2: 1.96% and (D) Sample 4 Clone 1: 0.03% and Clone 2: 2.26%. In general, UMI-based analysis of RTCR and MiGEC-MiXCR along with Vidjil exhibited a decreased GR and other tools, except IMSEQ and Trig, shared similar trend with a higher GR. Horizontal gray line represents the expected GE obtained from the qPCR data. The vertical distance from this line indicates the deviation of the detected GE from the expected value. Figure 5. View largeDownload slide RACE-PCR experimental samples are presented on clonal planes. Estimated by qPCR (A) Sample 1 with Clone 1: 17.80% and Clone 2: 1.32%, (B) Sample 2 with Clone 1: 1.54% and Clone 2: 1.57%, (C) Sample 3 with Clone 1: 0.17% and Clone 2: 1.96% and (D) Sample 4 Clone 1: 0.03% and Clone 2: 2.26%. In general, UMI-based analysis of RTCR and MiGEC-MiXCR along with Vidjil exhibited a decreased GR and other tools, except IMSEQ and Trig, shared similar trend with a higher GR. Horizontal gray line represents the expected GE obtained from the qPCR data. The vertical distance from this line indicates the deviation of the detected GE from the expected value. Overall, Set 3 is capturing the sample richness and evenness well. Vidjil underestimates the GE in all the samples but is in line with Set 3 in detecting the GR. Sets 1 and 2 overestimate the clonal abundance but reasonably reconstruct the samples evenness. IMSEQ and Trig deviate widely from the other tools and from the expected GE. However, as previously mentioned Trig overall detected an insufficient number of CDR3 sequences that probably can result in this difference. Runtime efficiency To estimate the runtime performance of tools on varying data sets, we have compared the time consumption by each tool to analyze the in silico samples. Each sample comprised 1.0 M reads depicting an increase in clonality, i.e. DA1, DB1 and DC1 (Figure 6). Figure 6. View largeDownload slide Comparison of time consumption by tools in analyzing in silico data sets with different clonalities DA1, DB1 and DC1 comprising each 1.0 M reads. MiTCR is the most time efficient, and Vidjil is computationally most expensive among all tools, followed by TCRklass, LymAnalyzer and IMonitor. The red bar in Vidjil represents the additional time taken to generate all the translated amino acid CDR3 sequences, and by default, it generates only first 100 sequences. Figure 6. View largeDownload slide Comparison of time consumption by tools in analyzing in silico data sets with different clonalities DA1, DB1 and DC1 comprising each 1.0 M reads. MiTCR is the most time efficient, and Vidjil is computationally most expensive among all tools, followed by TCRklass, LymAnalyzer and IMonitor. The red bar in Vidjil represents the additional time taken to generate all the translated amino acid CDR3 sequences, and by default, it generates only first 100 sequences. MiTCR showed the smallest execution time followed by IMSEQ, Decombinator and MiXCR. Overall LymAnalyzer, IMonitor, RTCR and TCRklass time consumption was comparable, although TCRklass and IMonitor were slightly more computationally expensive. In the case of Vidjil, time efficiency was measured in two situations, once with CDR3 generation for the first 100 clones (recommended by the tool’s authors) and later for the CDR3 generation for all detected clones that required large processing time. Interestingly, MiTCR, MiXCR, IMonitor, Vidjil, Trig and IMSEQ analysis times slightly correlate with sample clonality. LymAnalyzer, Decombinator and RTCR depicted a more linear trend between the read number and the time consumed. Certain tools including MiTCR, MiXCR, RTCR, Trig, IMonitor and IMSEQ also allow partial parallel processing options. Time consideration is a key factor in selecting the analysis tool when a large set of samples has to be analyzed. Conclusion We have carried out a comparative study of 10 tools for TCR-sequencing data analysis by using various data sets to guide users in the selection of an appropriate analysis method. Our study demonstrates that, in general, the analysis results are profoundly affected by the choice of the method, and no single tool maintains the best efficiency in all cases. In general, the in silico data analysis revealed that RTCR, MiXCR and MiTCR seem a solid choice for analysis followed by LymAnalyzer, TCRklass, IMonitor, Vidjil and Trig. These tools showed a better accuracy on the in silico data sets for clonal composition and clonal plane analysis. The first three tools seem to be less affected in assigning different nucleotide strings, arising from the same sequence because of random mutations, to the same CDR3 peptide. However, MiTCR can be used only with built-in reference genome database of human and mouse sequences, constraining the researchers to a frozen version of reference genome for these two species. RTCR was more consistent in all error states, followed by MiXCR, MiTCR and Vidjil. LymAnalyzer and Trig lack a well-defined quality control description. The clonal plane analysis and the determination of specific clonal percentages in experimental samples revealed that the Trig and IMSEQ tools exhibit a significant deviation from the other programs output and from the expected values. However, it is equally important to mention that we have performed these comparative studies with default settings, wherever feasible. Therefore, it is possible that the optimization of available configurable parameters in each tool can potentially enhance its performance. Interestingly, LymAnalyzer, Vidjil and TCRklass tools reported the same read/CDR3 sequence as different clonotypes with respective frequency values. In Vidjil and LymAnalyzer, the same CDR3 sequence describes as different clonotype also had same V and J segments, whereas in case of TCRklass, two clones with same CDR3 sequence were regarded as different clonotypes only when they mapped to different V and J segments. This analysis behavior was more pronounced in case of monoclonal and oligoclonal samples, whereas it was not observed in highly polyclonal samples. Our study reveals that all tools under investigation are suitable for 250 and 150 bp (except IMSEQ) read lengths. However, RTCR, TCRklass, IMSEQ and Decombinator did not successfully execute for 100 bp reads. In general, a decrease in the read length from 250 to 150 bp results in a concomitant sharp decrease in number of detected reads by each method. This observation affects all the tools in this study and is essential during the preparation of the samples. Particular care should be taken in increasing the average length of the TCR fragments, in the sequenced reads, >200–250 bp. This precaution dramatically improves the fraction of reads that are used by the tools to reconstruct the TCR clonal repertoire. The output of each tool also varies in terms of information generated per detected sequence and additional supporting or intermediate files produced. TCRklass provides useful statistics and VJ profiling options like IMonitor. However, MiXCR tool followed by RTCR and Vidjil provides a more detailed output for detected clonotypes. IMonitor does not output the combined clustered result file with CDR3 and annotated VJ gene segments, which is a part of the standard output of all other tools. Trig does not generate clustered file and CDR3 amino acid sequences; Decombinator output is less comprehensive, as it does not provide clear information about VJ segments; however, it is an unpublished version and still under development. All these tools also vary greatly in terms of the accompanied documentation, available user configurable parameters and input data file formats, which are also crucial factors in tools selection. MiXCR provides the most detailed documentation and control over analysis by providing users a number of adjustable parameters, whereas in contrast, LymAnalyzer and Trig provide the least options for user customizability. MiXCR (along with MiGEC) and RTCR provide the added advantage of UMI-based analysis, which can further enhance the accuracy of analysis. In general, by taking into account all the different factors and results of our comparison, MiXCR and RTCR tools prove to be the most balanced in term of flexibility, usability and accuracy. Key Points Several tools have been recently developed for large-scale TCR analysis. However, no clear comparison is available to assist in choosing an optimal analysis approach. Tool selection has a huge impact on analysis results. The computational TCR analysis field still has space for improvements, as among currently available state-of-the-art tools, no single tool showed the best and consistent performance in all kinds of analyses. However, new versions of tools are constantly being released, thus showing that the field is under active development. Overall our comparison shows that MiXCR and RTCR are the optimal choice among all available methods for TCR analysis followed by MiTCR, Vidjil, TCRklass and LymAnalyzer. Advancement in the experimental protocols by using UMI has the potential to further improve the accuracy of TCR repertoire analysis, specifically in samples with enhanced complexity. Supplementary Data Supplementary data are available online at http://bib.oxfordjournals.org/. Funding This work was supported by the NCT 3.0 funding programs NCT3.0_2015.13 ImmunOmics and NCT3.0_2015.2 SPL/RP. Saira Afzal is a postdoctoral researcher at German Cancer Research Center. She is doing research in area of bioinformatics with focus on computational tools development and large-scale data analysis. Irene Gil-Farina is a postdoctoral researcher at German Cancer Research Center. Her research is mainly focused on gene therapy and immunology. Richard Gabriel is a postdoctoral researcher at German Cancer Research Center. He is doing research mainly in gene therapy. Shahzad Ahmad was a Master’s student at German Cancer Research Center. His research interests include bioinformatics tool development and omics data analysis. Christof von Kalle is one of the directors of National Center of Tumor Diseases and is leading the research division of Translational Oncology at German Cancer Research Center. Manfred Schmidt is a group leader at German Cancer Research Center. He has research expertise in wide range of areas including molecular and gene therapy. Raffaele Fronza is a postdoctoral researcher at German Cancer Research Center. His major research interests include algorithm development, computational analysis and system biology. Acknowledgement We would like to acknowledge that Polyclonal Monoclonal diversity index and code to visualize clonal plane was a part of original Master thesis (Life Science Informatics) of Ahmad, S., submitted to Bonn-Aachen International Center for Information Technology, Bonn Germany. References 1 Arstila TP, Casrouge A, Baron V, et al.   A direct estimate of the human αβ T cell receptor diversity. Science  1999; 286: 958– 61. Google Scholar CrossRef Search ADS PubMed  2 Douek DC, Betts MR, Brenchley JM, et al.   A novel approach to the analysis of specificity, clonality, and frequency of HIV-specific T cell responses reveals a potential mechanism for control of viral escape. J Immunol  2002; 168: 3099 LP– 104. Google Scholar CrossRef Search ADS   3 Wang C, Sanders CM, Yang Q, et al.   High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. Proc Natl Acad Sci USA  2010; 107: 1518– 23. Google Scholar CrossRef Search ADS PubMed  4 Schatz DG, Ji Y. Recombination centres and the orchestration of V(D)J recombination. Nat Rev Immunol  2011; 11: 251– 63. Google Scholar CrossRef Search ADS PubMed  5 Dziubianau M, Hecht J, Kuchenbecker L, et al.   TCR repertoire analysis by next generation sequencing allows complex differential diagnosis of T cell-related pathology. Am J Transplant  2013; 13: 2842– 54. Google Scholar CrossRef Search ADS PubMed  6 Tumeh PC, Harview CL, Yearley JH, et al.   PD-1 blockade induces responses by inhibiting adaptive immune resistance. Nature  2014; 515: 568– 71. Google Scholar CrossRef Search ADS PubMed  7 Cha E, Klinger M, Hou Y, et al.   Improved survival with T cell clonotype stability after anti-CTLA-4 treatment in cancer patients. Sci Transl Med  2014; 6: 238ra70. Google Scholar CrossRef Search ADS PubMed  8 Snyder A, Nathanson T, Funt SA, et al.   Contribution of systemic and somatic factors to clinical response and resistance to PD-L1 blockade in urothelial cancer: an exploratory multi-omic analysis. PLoS Med  2017; 14: e1002309. Google Scholar CrossRef Search ADS PubMed  9 Jackson KJL, Liu Y, Roskin KM, et al.   Human responses to influenza vaccination show seroconversion signatures and convergent antibody rearrangements. Cell Host Microbe  2014; 16: 105– 14. Google Scholar CrossRef Search ADS PubMed  10 Clemente MJ, Przychodzen B, Jerez A, et al.   Deep sequencing of the T-cell receptor repertoire in CD8+ T-large granular lymphocyte leukemia identifies signature landscapes. Blood  2013; 122: 4077 LP– 85. Google Scholar CrossRef Search ADS   11 Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet  2008; 9: 387– 402. Google Scholar CrossRef Search ADS PubMed  12 Benati D, Galperin M, Lambotte O, et al.   Public T cell receptors confer high-avidity CD4 responses to HIV controllers. J Clin Invest  2016; 126: 2093– 108. Google Scholar CrossRef Search ADS PubMed  13 Birnbaum ME, Mendoza JL, Sethi DK, et al.   Deconstructing the peptide-MHC specificity of T cell recognition. Cell  2017; 157: 1073– 87. Google Scholar CrossRef Search ADS   14 Van Den Berg HA, Molina-París C, Sewell AK. Specific T-cell activation in an unspecific T-cell repertoire. Sci Prog  2011; 94: 245– 64. Google Scholar CrossRef Search ADS PubMed  15 Heather JM, Ismail M, Oakes T, et al.   High-throughput sequencing of the T-cell receptor repertoire: pitfalls and opportunities. Brief Bioinform  2017, doi: 10.1093/bib/bbx138. 16 Bolotin DA, Shugay M, Mamedov IZ, et al.   MiTCR: software for T-cell receptor sequencing data analysis. Nat Methods  2013; 10: 813– 4. Google Scholar CrossRef Search ADS PubMed  17 Shugay M, Britanova OV, Merzlyak EM, et al.   Towards error-free profiling of immune repertoires. Nat Methods  2014; 11: 653– 5. Google Scholar CrossRef Search ADS PubMed  18 Bolotin DA, Poslavsky S, Mitrophanov I, et al.   MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods  2015; 12: 380– 1. Google Scholar CrossRef Search ADS PubMed  19 Yu Y, Ceredig R, Seoighe C. LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins. Nucleic Acids Res  2015; 44: e31. Google Scholar CrossRef Search ADS PubMed  20 Kuchenbecker L, Nienen M, Hecht J, et al.   IMSEQ—a fast and error aware approach to immunogenetic sequence analysis. Bioinformatics  2015; 31: 2963– 71. Google Scholar CrossRef Search ADS PubMed  21 Yang X, Liu D, Lv N, et al.   TCRklass: a new K-string-based algorithm for human and mouse TCR repertoire characterization. J Immunol  2015; 194: 446– 54. Google Scholar CrossRef Search ADS PubMed  22 Zhang W, Du Y, Su Z, et al.   Imonitor: a robust pipeline for TCR and BCR repertoire analysis. Genetics  2015; 201: 459– 72. Google Scholar CrossRef Search ADS PubMed  23 Gerritsen B, Pandit A, Andeweg AC, et al.   RTCR: a pipeline for complete and accurate recovery of T cell repertoires from high throughput sequencing data. Bioinformatics  2016; 32: 3098– 106. Google Scholar CrossRef Search ADS PubMed  24 Duez M, Giraud M, Herbert R, et al.   Vidjil: a web platform for analysis of high-throughput repertoire sequencing. PLoS One  2016; 11: e0166126. Google Scholar CrossRef Search ADS PubMed  25 Thomas N, Heather J, Ndifon W, et al.   Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics  2013; 29: 542– 50. Google Scholar CrossRef Search ADS PubMed  26 Hung S-J, Chen Y-L, Chu C-H, et al.   TRIg: a robust alignment pipeline for non-regular T-cell receptor and immunoglobulin sequences. BMC Bioinformatics  2016; 17: 433. Google Scholar CrossRef Search ADS PubMed  27 Brochet X, Lefranc MP, Giudicelli V. IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res  2008; 36: 503– 8. Google Scholar CrossRef Search ADS   28 Ye J, Ma N, Madden TL, et al.   IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res  2013; 41: 34– 40. Google Scholar CrossRef Search ADS   29 Hurlbert SH. The nonconcept of species diversity: a critique and alternative parameters. Ecology  1971; 52: 577– 86. Google Scholar CrossRef Search ADS   30 Efron B, Thisted R. Estimating the number of unseen species: how many words did Shakespeare know? Biometrika  1976; 63: 435– 47. 31 Chao A. Nonparametric estimation of the number of classes in a population. Scand J Stat  1984; 30: 265– 70. 32 Shannon CE, Weaver W. The Mathematical Theory of Communication . Urbana: University of Illinois Press, 1949. 33 Simpson EH. Measurement of diversity. Nature  1949; 163: 688. Google Scholar CrossRef Search ADS   34 Nagendra H. Opposite trends in response for the Shannon and Simpson indices of landscape diversity. Appl Geogr  2002; 22: 175– 86. Google Scholar CrossRef Search ADS   35 Grice EA, Kong HH, Renaud G, et al.   A diversity profile of the human skin microbiota. Genome Res  2008; 18: 1043– 50. Google Scholar CrossRef Search ADS PubMed  36 Peet R. The measurements of species diversity. Ann Rev Eeal Syst  1974; 5: 285– 307. Google Scholar CrossRef Search ADS   37 Tóthmérész B. Comparison of different methods for diversity ordering. J Veg Sci  1995; 6: 283– 90. Google Scholar CrossRef Search ADS   38 Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol  1981; 147: 195– 7. Google Scholar CrossRef Search ADS PubMed  39 Camacho C, Coulouris G, Avagyan V, et al.   BLAST plus: architecture and applications. BMC Bioinformatics  2009; 10: 241. Google Scholar CrossRef Search ADS PubMed  40 Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res  2013; 41: 41 Rasmussen KR, Stoye J, Myers EW, et al.   Efficient q-gram filters for finding all ε-matches over a given length. J Comput Biol  2006; 13: 296– 308. Google Scholar CrossRef Search ADS PubMed  42 Myers G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM  1999; 46: 395– 415. Google Scholar CrossRef Search ADS   43 Ye J, McGinnis S, Madden TL. BLAST: improvements for better sequence analysis. Nucleic Acids Res  2006; 34: 6– 9. Google Scholar CrossRef Search ADS   44 Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods  2012; 9: 357– 9. Google Scholar CrossRef Search ADS PubMed  45 Safonova Y, Lapidus A, Lill J. IgSimulator: a versatile immunosequencing simulator. Bioinformatics  2015; 31: 3213– 5. Google Scholar CrossRef Search ADS PubMed  46 Lefranc MP, Giudicelli V, Ginestoux C, et al.   IMGT®, the international ImMunoGeneTics information system®. Nucleic Acids Res  2009; 37: 1006– 12. Google Scholar CrossRef Search ADS   47 Wgsim. https://github.com/lh3/wgsim. 48 Ruggiero E, Nicolay JP, Fronza R, et al.   High-resolution analysis of the human T-cell receptor repertoire. Nat Commun  2015; 6: 8081. Google Scholar CrossRef Search ADS PubMed  49 Gupta NT, Vander Heiden JA, Uduman M, et al.   Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics  2015; 31: 3356– 8. Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Journal

Briefings in BioinformaticsOxford University Press

Published: Sep 23, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off