TY - JOUR AU1 - Saier, Milton, H AU2 - Reddy, Vamsee, S AU3 - Moreno-Hagelsieb,, Gabriel AU4 - Hendargo, Kevin, J AU5 - Zhang,, Yichi AU6 - Iddamsetty,, Vasu AU7 - Lam,, Katie Jing Kay AU8 - Tian,, Nuo AU9 - Russum,, Steven AU1 - Wang,, Jianing AU1 - Medrano-Soto,, Arturo AB - Abstract The Transporter Classification Database (TCDB; tcdb.org) is a freely accessible reference resource, which provides functional, structural, mechanistic, medical and biotechnological information about transporters from organisms of all types. TCDB is the only transport protein classification database adopted by the International Union of Biochemistry and Molecular Biology (IUBMB) and now (October 1, 2020) consists of 20 653 proteins classified in 15 528 non-redundant transport systems with 1567 tabulated 3D structures, 18 336 reference citations describing 1536 transporter families, of which 26% are members of 82 recognized superfamilies. Overall, this is an increase of over 50% since the last published update of the database in 2016. This comprehensive update of the database contents and features include (i) adoption of a chemical ontology for substrates of transporters, (ii) inclusion of new superfamilies, (iii) a domain-based characterization of transporter families for the identification of new members as well as functional and evolutionary relationships between families, (iv) development of novel software to facilitate curation and use of the database, (v) addition of new subclasses of transport systems including 11 novel types of channels and 3 types of group translocators and (vi) the inclusion of many man-made (artificial) transmembrane pores/channels and carriers. INTRODUCTION Membrane transporters constitute a diverse group of proteins that form intricate networks of channels, carriers, pumps, group translocators and electron flow carriers that determine the molecular compositions and energy status of cells (1). These proteins, which include ∼10% of all cellular proteins, transfer nutrients, end products of metabolism, toxic substances, macromolecules, signalling molecules, drugs, electrons, etc., from source to sink, resulting in the cellular uptake and extrusion of compounds and energy sources (2). Of particular importance to the fields of oncology, microbial pathogenesis and virology, drug efflux pumps play a dominant role in drug resistance both in pathogenic organisms and in cancer cells (3,4). Thousands of researchers worldwide contribute to the collective understanding of molecular transport across cellular membranes (5). TCDB (tcdb.org) is used by researchers over a million times per year involving about 100 000 different users. In June 2001, the International Union of Biochemistry and Molecular Biology (IUBMB) formally adopted the Transporter Classification (TC) system as the only internationally recognized system for the organization of transport protein information derived from all organisms in the Earth's biosphere (6,7). With the advent of metagenomic sequencing and recent progress in computational biology (last 5 years), resulting in the discovery of dozens of novel phyla of nanobacteria and archaea, TCDB has expanded to include potential families of transporters found only in these novel phyla as well as proteins distantly related to members of pre-existing families. All transport systems are classified from evolutionary, structural and functional standpoints according to the original TCDB design (2,8). TCDB is not intended to be a collection of all transport-related protein sequences; instead, the ultimate goal is to identify and classify representative proteins of all transporter families in nature. Earlier versions of the TC database (TCDB) have been described in previous publications in Nucleic Acids Research (6,9–11). As of now, TCDB provides access to data published in over 18 300 research papers. This information is integrated into descriptions and hierarchical structures within TCDB. The database now contains 15 528 single- or multi-component transport systems from all kinds of living organisms, and for 1567 of them, accessions for high resolution 3D structures in PDB (12) are available. These systems are classified into 1536 transporter families based on their phylogenies and functions. Many of these families have been found to be distantly related and are now classified into 82 superfamilies (13). This represents an overall growth of over 50% relative to our 2016 report (9). TCDB is continually updated as our software becomes more refined and as novel published research regarding transport systems becomes available. We hope these advances will increase the utility of the database for use by the international scientific community. RECENT DEVELOPMENTS Database expansion due to novel genomic and metagenomic analyses Since the publication of our last update (9), we have carried out multiple genome and metagenome comparative analyses that have allowed the discovery of numerous novel transporter families derived from a variety of organismal types. Most, but not all, of these are of prokaryotic origin as progress in the discovery of novel phyla of bacteria and archaea has revolutionized our concept of the tree of life. We have carried out studies to identify transporters in many groups of pathogenic and probiotic bacteria. These include numerous pathogenic, symbiotic and free living species of spirochetes such as (a) Leptospira (14) and (b) Treponema (15), (c) species of Bdellovibrio that kill and eat other Gram-negative bacteria, either from within (intracellular) or from without (extracellular) the prey bacteria (16), (d) probiotic and pathogenic strains of Escherichia coli (17), and (e) probiotic and pathogenic Bacteroides species (18). In addition, we have ongoing projects on probiotic and pathogenic Lactobacillus species, seven species of sulfur reducing proteobacteria, four different phyla of the recently discovered ‘Candidatus Phylum Radiation’ (CPR) nanobacteria (19,20), and four phyla of the newly discovered Asgard archaeal superphylum, presumably the closest prokaryotic relatives of Eukarya (21). These analyses have resulted in 552 new proteins added to TCDB from metagenomes organized in 189 relatively small families. These proteins are representative of at least 100 000 non-redundant homologs in NCBI (BLAST E-value < 10−6, coverage >70%, and <90% sequence redundancy). From the 189 families with metagenomic members, 119 (62%) involve transporters of unknown mechanism of action (TC subclass 9.A) and putative transporters (TC subclass 9.B), including 17 families that seem to be exclusive to the newly discovered CPR and Asgard superphyla. Finally, the genomic analyses of some less well characterized eukaryotes such as red algae (22) as well as the plethora of experimental findings carried out by many research groups, worldwide, has allowed tremendous expansion of TCDB to include both well characterized and uncharacterized (putative) transport systems. In our laboratory, transportome characterization requires the separate analysis of single- and multiple-component transport systems based on the knowledge available in TCDB. Furthermore, each newly sequenced (meta)genome provides the opportunity to identify potential transporters with little or no similarity to known transporters. Incorporation of these proteins into TCDB will continue to increase the coverage and sequence diversity in the database. We have developed a number of programs to tackle these challenges (see Table 1 and Supplementary File 1). Table 1. Programs developed for this TCDB update. All programs running in the UNIX shell can be downloaded from the laboratory's software repository (https://github.com/SaierLaboratory) Programa,b,c . Description . Length Statistics Tool Plots the distribution of protein lengths across any TC class, superfamily or family (tcdb.org/progs/?tool=lens). Disease Explorer Tool Searches TCDB for disease-related transport systems (tcdb.org/progs/index.php?tool=disease). Re-entrant Loop Finder Tool Identifies re-entrant P-loops within transporters. This online tool is available in every page describing transport systems in TCDB. getSubstrates Generates a tab-delimited table with the ChEBI ID and names of the substrates annotated for all systems in TCDB (tcdb.org/cgi-bin/substrates/getSubstrates.py). listSuperfamilies Generates a tab-delimited table with the superfamily assigned to each system, subfamily and family in TCDB (tcdb.org/cgi-bin/substrates/listSuperfamilies.py). mkProteinClusters Generates a protein similarity tree based on BLAST or Smith-Waterman bit scores (30). extractFamily Downloads protein sequences from TCDB in various formats (30). famXpander Extracts non-redundant homologs from NCBI for a given TCDB family (30). areFamiliesHomologous Applies the transitivity property of homology to test whether two families are homologous and if they can be expanded into a superfamily (32). quod Generates highly customizable hydropathy plots (32). hvordan Generates graphical reports for each significant hit produced by areFamiliesHomologous (32). getDomainTopology Identifies conserved Pfam domains within a family of transporters (32). tmsRepeat Identifies repeat units of TMSs in transporter sequences (32). deuterocol Identifies homology between families by comparing bundles of transmembranal α-helices in 3D structures (32). findDistFamilyMembers Scans NCBI for distant members of a given TCDB family. singEasy Extracts all hits with single component systems in TCDB from GBlast results and classifies them into two groups, based on the reliability of the matches. getMultiCompSystems Exhaustive search of full genomes for matches against multicomponent systems in TCDB. This provides the raw data for programs that decide which systems are complete. YutanpaNet Network-based approach to infer complete multicomponent systems present in (meta)genomes based on the output of getMultiCompSystems. showSubnet Given the output of YutanpaNet, it generates a web-based graphical layout of the subnetwork for an input list of user-specified multicomponent systems. searchMissComponents Downloads homologs from NCBI matching user-specified functional keywords of missing target components and compares them against the query genome. searchPseudogenes Searches for missing components at the DNA level in the query genome (i.e. sequencing errors or pseudogenes). getOrthologs Compares full proteomes and infers orthologs as reciprocal best hits. matchDomains Compares protein sequences against databases of domains (i.e. Pfam, CDD, TIGRFAMs). cleanDomains Parses and filters the results of program matchDomains findNovelTrasporters Identifies potential transporters within a genome that show little or no sequence similarity to any transporter in TCDB. prepNewIMPs4TCDBupload Extracts non-redundant homologs for each protein identified by findNovelTrasporters in preparation for upload to TCDB. Programa,b,c . Description . Length Statistics Tool Plots the distribution of protein lengths across any TC class, superfamily or family (tcdb.org/progs/?tool=lens). Disease Explorer Tool Searches TCDB for disease-related transport systems (tcdb.org/progs/index.php?tool=disease). Re-entrant Loop Finder Tool Identifies re-entrant P-loops within transporters. This online tool is available in every page describing transport systems in TCDB. getSubstrates Generates a tab-delimited table with the ChEBI ID and names of the substrates annotated for all systems in TCDB (tcdb.org/cgi-bin/substrates/getSubstrates.py). listSuperfamilies Generates a tab-delimited table with the superfamily assigned to each system, subfamily and family in TCDB (tcdb.org/cgi-bin/substrates/listSuperfamilies.py). mkProteinClusters Generates a protein similarity tree based on BLAST or Smith-Waterman bit scores (30). extractFamily Downloads protein sequences from TCDB in various formats (30). famXpander Extracts non-redundant homologs from NCBI for a given TCDB family (30). areFamiliesHomologous Applies the transitivity property of homology to test whether two families are homologous and if they can be expanded into a superfamily (32). quod Generates highly customizable hydropathy plots (32). hvordan Generates graphical reports for each significant hit produced by areFamiliesHomologous (32). getDomainTopology Identifies conserved Pfam domains within a family of transporters (32). tmsRepeat Identifies repeat units of TMSs in transporter sequences (32). deuterocol Identifies homology between families by comparing bundles of transmembranal α-helices in 3D structures (32). findDistFamilyMembers Scans NCBI for distant members of a given TCDB family. singEasy Extracts all hits with single component systems in TCDB from GBlast results and classifies them into two groups, based on the reliability of the matches. getMultiCompSystems Exhaustive search of full genomes for matches against multicomponent systems in TCDB. This provides the raw data for programs that decide which systems are complete. YutanpaNet Network-based approach to infer complete multicomponent systems present in (meta)genomes based on the output of getMultiCompSystems. showSubnet Given the output of YutanpaNet, it generates a web-based graphical layout of the subnetwork for an input list of user-specified multicomponent systems. searchMissComponents Downloads homologs from NCBI matching user-specified functional keywords of missing target components and compares them against the query genome. searchPseudogenes Searches for missing components at the DNA level in the query genome (i.e. sequencing errors or pseudogenes). getOrthologs Compares full proteomes and infers orthologs as reciprocal best hits. matchDomains Compares protein sequences against databases of domains (i.e. Pfam, CDD, TIGRFAMs). cleanDomains Parses and filters the results of program matchDomains findNovelTrasporters Identifies potential transporters within a genome that show little or no sequence similarity to any transporter in TCDB. prepNewIMPs4TCDBupload Extracts non-redundant homologs for each protein identified by findNovelTrasporters in preparation for upload to TCDB. aFour additional programs developed during the previous TCDB update were also extended and optimized (i.e. GBlast, tmsplit, AveHAS, WHAT). bPrograms were developed by Kevin Hendargo, Pranav Iddamsetty, Yichi Zhang, Vamsee Reddy, Gabriel Moreno-Hagelsieb and Arturo Medrano-Soto. cFor more details on each program see Supplementary File 1. Open in new tab Table 1. Programs developed for this TCDB update. All programs running in the UNIX shell can be downloaded from the laboratory's software repository (https://github.com/SaierLaboratory) Programa,b,c . Description . Length Statistics Tool Plots the distribution of protein lengths across any TC class, superfamily or family (tcdb.org/progs/?tool=lens). Disease Explorer Tool Searches TCDB for disease-related transport systems (tcdb.org/progs/index.php?tool=disease). Re-entrant Loop Finder Tool Identifies re-entrant P-loops within transporters. This online tool is available in every page describing transport systems in TCDB. getSubstrates Generates a tab-delimited table with the ChEBI ID and names of the substrates annotated for all systems in TCDB (tcdb.org/cgi-bin/substrates/getSubstrates.py). listSuperfamilies Generates a tab-delimited table with the superfamily assigned to each system, subfamily and family in TCDB (tcdb.org/cgi-bin/substrates/listSuperfamilies.py). mkProteinClusters Generates a protein similarity tree based on BLAST or Smith-Waterman bit scores (30). extractFamily Downloads protein sequences from TCDB in various formats (30). famXpander Extracts non-redundant homologs from NCBI for a given TCDB family (30). areFamiliesHomologous Applies the transitivity property of homology to test whether two families are homologous and if they can be expanded into a superfamily (32). quod Generates highly customizable hydropathy plots (32). hvordan Generates graphical reports for each significant hit produced by areFamiliesHomologous (32). getDomainTopology Identifies conserved Pfam domains within a family of transporters (32). tmsRepeat Identifies repeat units of TMSs in transporter sequences (32). deuterocol Identifies homology between families by comparing bundles of transmembranal α-helices in 3D structures (32). findDistFamilyMembers Scans NCBI for distant members of a given TCDB family. singEasy Extracts all hits with single component systems in TCDB from GBlast results and classifies them into two groups, based on the reliability of the matches. getMultiCompSystems Exhaustive search of full genomes for matches against multicomponent systems in TCDB. This provides the raw data for programs that decide which systems are complete. YutanpaNet Network-based approach to infer complete multicomponent systems present in (meta)genomes based on the output of getMultiCompSystems. showSubnet Given the output of YutanpaNet, it generates a web-based graphical layout of the subnetwork for an input list of user-specified multicomponent systems. searchMissComponents Downloads homologs from NCBI matching user-specified functional keywords of missing target components and compares them against the query genome. searchPseudogenes Searches for missing components at the DNA level in the query genome (i.e. sequencing errors or pseudogenes). getOrthologs Compares full proteomes and infers orthologs as reciprocal best hits. matchDomains Compares protein sequences against databases of domains (i.e. Pfam, CDD, TIGRFAMs). cleanDomains Parses and filters the results of program matchDomains findNovelTrasporters Identifies potential transporters within a genome that show little or no sequence similarity to any transporter in TCDB. prepNewIMPs4TCDBupload Extracts non-redundant homologs for each protein identified by findNovelTrasporters in preparation for upload to TCDB. Programa,b,c . Description . Length Statistics Tool Plots the distribution of protein lengths across any TC class, superfamily or family (tcdb.org/progs/?tool=lens). Disease Explorer Tool Searches TCDB for disease-related transport systems (tcdb.org/progs/index.php?tool=disease). Re-entrant Loop Finder Tool Identifies re-entrant P-loops within transporters. This online tool is available in every page describing transport systems in TCDB. getSubstrates Generates a tab-delimited table with the ChEBI ID and names of the substrates annotated for all systems in TCDB (tcdb.org/cgi-bin/substrates/getSubstrates.py). listSuperfamilies Generates a tab-delimited table with the superfamily assigned to each system, subfamily and family in TCDB (tcdb.org/cgi-bin/substrates/listSuperfamilies.py). mkProteinClusters Generates a protein similarity tree based on BLAST or Smith-Waterman bit scores (30). extractFamily Downloads protein sequences from TCDB in various formats (30). famXpander Extracts non-redundant homologs from NCBI for a given TCDB family (30). areFamiliesHomologous Applies the transitivity property of homology to test whether two families are homologous and if they can be expanded into a superfamily (32). quod Generates highly customizable hydropathy plots (32). hvordan Generates graphical reports for each significant hit produced by areFamiliesHomologous (32). getDomainTopology Identifies conserved Pfam domains within a family of transporters (32). tmsRepeat Identifies repeat units of TMSs in transporter sequences (32). deuterocol Identifies homology between families by comparing bundles of transmembranal α-helices in 3D structures (32). findDistFamilyMembers Scans NCBI for distant members of a given TCDB family. singEasy Extracts all hits with single component systems in TCDB from GBlast results and classifies them into two groups, based on the reliability of the matches. getMultiCompSystems Exhaustive search of full genomes for matches against multicomponent systems in TCDB. This provides the raw data for programs that decide which systems are complete. YutanpaNet Network-based approach to infer complete multicomponent systems present in (meta)genomes based on the output of getMultiCompSystems. showSubnet Given the output of YutanpaNet, it generates a web-based graphical layout of the subnetwork for an input list of user-specified multicomponent systems. searchMissComponents Downloads homologs from NCBI matching user-specified functional keywords of missing target components and compares them against the query genome. searchPseudogenes Searches for missing components at the DNA level in the query genome (i.e. sequencing errors or pseudogenes). getOrthologs Compares full proteomes and infers orthologs as reciprocal best hits. matchDomains Compares protein sequences against databases of domains (i.e. Pfam, CDD, TIGRFAMs). cleanDomains Parses and filters the results of program matchDomains findNovelTrasporters Identifies potential transporters within a genome that show little or no sequence similarity to any transporter in TCDB. prepNewIMPs4TCDBupload Extracts non-redundant homologs for each protein identified by findNovelTrasporters in preparation for upload to TCDB. aFour additional programs developed during the previous TCDB update were also extended and optimized (i.e. GBlast, tmsplit, AveHAS, WHAT). bPrograms were developed by Kevin Hendargo, Pranav Iddamsetty, Yichi Zhang, Vamsee Reddy, Gabriel Moreno-Hagelsieb and Arturo Medrano-Soto. cFor more details on each program see Supplementary File 1. Open in new tab Family and superfamily expansion A major effort in our bioinformatics laboratory has been to develop software that allows the identification of distant relationships between families with the concomitant creation of superfamilies (Table 1 and Supplementary File 1). In TCDB, the operational definitions of family and superfamily have become more robust over time following the development of bioinformatics. A family is defined as a set of homologous proteins with common sequence, structural and functional attributes (i.e., topology, domains, motifs, folds, etc.). On the other hand, a superfamily is a group of distantly related families that may exhibit topological and functional differences, but interfamily comparisons reveal lower (but reliable) levels of sequence and structural similarities. Superfamilies can thus be created even when no single member has been functionally characterized. We have identified many new superfamilies and have expanded others that we had previously identified. We had previously defined the LysE Superfamily with three families (23), but new work has revealed that it includes eleven families (24), each of which is specific for a different class of compounds. For example, the LysE family (TC# 2.A.75) exports basic amino acids while the RhtB family (TC# 2.A.76) exports small neutral amino acids, and the NAAT family (TC# 2.A.95) takes up neutral amino acids. However, four newly identified LysE superfamily families include members that transport a variety of inorganic ions: Cd2+ (CadD family; TC# 2.A.77), Mn2+ (Mtp family; TC# 2.A.107), Fe2+/Pb2+ (ILT family; TC# 2.A.108), or Ni2+/Co2+ (family NicO; TC# 2.A.113), depending on the family. The GAP family (TC# 2.A.116) seems to be specific for peptidoglycolipids which are exported to the cell walls of mycobacterial species, although only one member has been characterized (25). But this family has members from a wide range of bacteria and archaea and will probably prove to export different lipid types and possibly drugs. Surprisingly, still another member of this superfamily is the well characterized DsbD family (TC# 5.A.1) that exports electron pairs from a cytoplasm electron donor to extracellular disulfide-containing proteins (26,27). The Tetraspan Junctional Complex (4JC) Superfamily is almost exclusively eukaryotic, with only one of the 15 member families being derived from bacteria (28). These proteins are the principal constituents of tight and gap junctions in animals, but some of these families are represented only in fungi while others have representation in a broader group of eukaryotes including viruses. Some form simple one-constituent endomembrane channels rather than multiprotein junctions. The Membrane Attack Complex/Perforin (MACPF) superfamily includes three families that are derived exclusively from one of the three domains of life or are found more broadly in more than one (29). They can be part of the immune-related complement system of animals that combat infections by pathogenic microbes, or they can be of bacterial origin and cause diseases in animals. The Anoctamin Superfamily includes seven eukaryotic families of calcium-activated cation or anion channels as well as lipid scramblases (30). Members of the first five families include proteins that are functionally characterized, some of which can both transport ions and scramble lipids, but the last two families have not yet been characterized except from bioinformatic standpoints. Presumably, the proteins of this superfamily do not have a prokaryotic ancestor; they may have evolved in eukaryotes. Novel mechanistic inferences, based on structure, have recently been forthcoming (31). The Transporter/Opsin/G protein-coupled receptor (TOG) Superfamily has recently been expanded to include twelve families, and surprisingly, not all of the member families are typical transporters (32). They include light-dependent integral membrane chaperone proteins, photoreceptors and G-protein receptors. However, some of the photoreceptors can be mutated to transport cations, and some have a central channel which is blocked to prevent transport. A number of different G-protein receptors have been shown to flip lipids from one side of a bilayer to the other, and another transports cholesterol through a pore-type mechanism (33–35). In 1993, Marger and Saier (36) described the Major Facilitator Superfamily (MFS) which then included five previously recognized families. Since then, we have published several papers that have resulted in very substantial expansion of this superfamily, until now, it is the largest transporter superfamily known with over one hundred distinct families (37). It is even larger than the prominent ATP-binding Cassette (ABC) superfamily of uptake/efflux porters which is polyphyletic (38). Until recently, all MFS family members were believed to be transporters. They utilize the proton motive force, not ATP hydrolysis, to energize transport. It has been known for some time that a very few MFS carriers have gained receptor functions, either together with, or instead of, their transport functions. However, with more sensitive methods, we were able to identify new, more distantly related families that do not have known transport functions (37). These include family (a) Lysyl tRNA synthetase (LysS; TC# 9.B.111); (b) Lysyl phosphatidyl glycerol synthases (MprF; TC# 4.H.1), which had been reported to flip their lipid products from the inside to the outside of the cell (39), (c) eukaryotic cytochrome b561 (Cytb561; TC# 5.B.2), (d) a poorly characterized family of proteins somehow involved in fungal conidiation and conidia germination (CCGP; TC# 9.B.57), and (e) an uncharacterized family in which all members contain domains of unknown function (DUF1275; TC# 9.B.143). Not all of these proteins are homologous to known MFS members throughout their entire lengths, but they at least share the basic repeat unit of six α-helical transmembrane segments (TMS) of the MFS domain. These observations provide interesting clues as to how proteins of similar domain structures evolve to serve very different biochemical functions (37). Holins are phage or prokaryotic proteins that form ‘holes’ in cytoplasmic membranes to allow uptake of protons or release of proteins, and even to facilitate phage particle release. There are 58 holin families in TCDB, and some of them are large with multiple subfamilies. We have identified seven holin superfamilies, each with up to seven families (40,41). There is a total of 76 outer membrane porin families in TCDB, and they are from Gram-negative bacteria, actinobacteria and eukaryotic organelles. Fifty-six of the porin families (74%) fall into a single superfamily (OMPP-I), and while one such family derives members from eukaryotic organelles, a second family functions to transfer electrons across the bacterial envelope, and several have not been functionally characterized. All others transport solutes either selectively or non-specifically. The other four superfamilies (OMPP-II–V) include only two families each (42). Two of these last four superfamilies (OMPP-II and V) have members derived only from actinobacteria while the other two are exclusively from plastids (OMPP-III) or from various eukaryotic organelles (plastids, mitochondria and peroxisomes) (OMPP-IV). A substrate ontology for TCDB A recent addition to our transporter database is the comprehensive annotation of transported substrates based on the ontology of Chemical Entities of Biological Interest (ChEBI) (43). This is a formal representation of the properties of chemical entities and their relationships, thus allowing the organization of raw data into information and knowledge. As a result, each known substrate molecule or type is now associated with a ChEBI ID. Prior to this improvement, it was an arduous task to extract all transporters that utilized a carbohydrate substrate, for example. It would have also been impossible for third party software to organize and access our database according to chemical type. We have completed a semi-automatic process of annotating over 7800 individual transport systems with a ChEBI ID that corresponds to the most selective substrates a system is known to transport. This process required natural language processing to extract chemical names from text descriptions and were all manually curated for accuracy. When new transport systems are added to TCDB, the ChEBI IDs of the corresponding substrates (if available) are now manually annotated at the time of entry. The advantage of adopting the ChEBI ontology is that each ChEBI ID exists within a hierarchy of terms with ‘parents’, ‘children’ and ‘synonyms’. The parent–children relationship is illustrated by the following example: ‘glucose’, ‘fructose’ and ‘galactose’ are ‘hexoses’, which are ‘carbohydrates’. Therefore, a search for ‘hexose’ will retrieve everything down the relational tree (i.e. glucose, fructose, galactose, etc.). Substrate annotations for systems in TCDB are now available for download in tabular format via a webservice (see Table 1), which can be easily integrated by third party software. This opens the doors to many potential future applications. Substrate annotations can be used for genomic analysis, functional predictions, family/superfamily characterizations, and more. For example, genomes can be studied by observing the distribution of the types of transport systems present. A distribution of imported/exported substrates or substrate-types adds a valuable dimension to the analysis. A substrate-annotated genome can even benefit metabolic modelling projects. The ability to retrieve proteins that transport a particular substrate or substrate class may allow for improved methods for the prediction of molecular function and drug targets. Common motifs may be identified and associated with substrates, or statistical models such as hidden Markov models may be trained to predict substrates of unknown transporters. Another benefit of adopting the ChEBI ontology is that users can develop their own software tools to mine TCDB information while taking advantage of the application program interface provided by ChEBI to navigate the ontology (44). The substrate ontology system fulfills a long-standing need in the transport biology and bioinformatics community. To the best of our knowledge, this is the first comprehensive and manually curated database of transport systems that standardizes substrate annotations by adopting a chemical ontology. In the future we plan to expand this ontology to include system modulators (i.e. activators, inhibitors etc.). Domain characterization of TCDB families (tcDoms) We have started to produce tcDoms (Transporter Classification Domains), a set of Hidden Markov Models derived from families in TCDB. These tcDoms will be a useful resource (a) to sort query proteins into their corresponding TCDB families, (b) to infer relationships between families, and thus characterize superfamilies, or, when applicable, (c) to infer the substrates and functions of putative transporters. Although many transport-related domains are currently available in Pfam (45), CDD (46) and other databases, tcDoms are specialized in cellular transport, and are designed to assist in database manual curation, as well as to increase the robustness of family definitions in TCDB. In the initial phase, we are focusing on families composed exclusively of single component systems. In our approach, we first performed an all vs all comparison of all proteins in TCDB using BLAST (47) and selected those families where all protein members found each other. We then produced multiple alignments for these proteins using MUSCLE (48). Next, the program hmmbuild, from the HMMER software suite (49) was used to produce hidden Markov models from the multiple alignments. Finally, the performance of the resulting models was benchmarked using a leave-one-out cross-validation and checking against cross-contamination with unrelated families (manuscript in preparation). Of the 364 tcDoms currently built for 277 families, 295 overlap with 382 known Pfam models, and metagenomic proteins contribute to 66 (24%) of the families with tcDoms. Note that a one-to-one relationship between Pfam and tcDoms cannot be expected because different regions of proteins in a TCDB family can match different, nonoverlapping, Pfam domains, and different tcDoms can match the same Pfam domain. Our tcDoms are meant to help distinguish members of one family from those of another family, even if they belong to the same superfamily. Thus, we are expecting to produce more than one tcDom per family. The number of tcDoms per family depends on their relationships with other domain collections (e.g., Pfam and CDD), and the number of distinctive characteristic domains that we can identify for each family in TCDB. The initial set of hidden Markov models in tcDoms is available for download through the TCDB web site (tcdb.org/tcDoms.php). Software tools Multiple programs have been developed and/or updated since 2016 to enhance the functionality of TCDB, mine the wealth of information stored therein, and support research projects that ultimately discover relevant data that can be added to TCDB. Table 1 presents a list of the programs developed for this update, and Supplementary File 1 gives details on each program. Most programs run in the UNIX shell, and a few are offered as web services. DISCUSSION Since 2016, the number of representative transport systems and families in TCDB has grown by over 50%. This has been possible in part because of novel metagenomic data; 36% of the new families in this update include proteins from metagenomes. Moreover, combined with the improvement of software tools and developments in the field, TCDB now provides excellent opportunities to (a) explore relationships between families of transporters (5,13,28–32,37,40–42), (b) characterize and perform comparative analyses of transportomes encoded within genomes and metagenomes (14–18,22) and (c) derive inferences of function to be followed by experimental verification (50). The characterization of transporter families in TCDB, based on shared domains (tcDoms), allows rapid evaluation of family memberships for individual proteins, identification of evolutionary relationships, and inference of substrates and molecular functions. The adoption of the ChEBI ontology allows TCDB users to consider motif/domain relationships between non-homologous transporters with similar substrates, possibly identifying novel features that determine substrate recognition. The ontology can also be used to evaluate the distributions of substrates used not only within types of organisms, but also within the various families and superfamilies currently in TCDB. Since entirely new phyla of bacteria and archaea have been discovered within the last five years due to metagenomic analyses, the recently discovered sequence data have allowed the discovery of previously unrecognized putative transporter families unique to these newly discovered types of organisms. The new data also allow extension of relationships between organisms (i.e., archaea and eukaryotes). While monumental progress has been made in these regards, we are aware that the field is still young and rapidly expanding. The organization and structure of TCDB allows us and other users to accommodate these additions and to ask more global questions than ever before. For example, what families in TCDB that do not belong to the same superfamily are potentially related or share regulatory domains? Of particular interest are transmembrane domains shared between established superfamilies, as they offer the possibility of grouping superfamilies at a higher level (e.g. ultra-superfamilies or super-superfamilies including two or more superfamilies). We have found evidence for such higher-order relationships, for example, between the family NicO (TC# 2.A.113), a member of the LysE superfamily, and the family NiCoT (TC# 2.A.52), a member of the TOG superfamily (32). Such relationships may arise because of addition to or subtraction from TMSs in a repeat unit. Similarly, we are interested in identifying multidomain proteins that link different superfamilies and investigate their impact on the evolution of families and their relationships. We are confident that TCDB will continue to be useful to the scientific community, and we always welcome suggestions for improvement. DATA AVAILABILITY TCDB home page: tcdb.org. All software developed and maintained by the Saier lab for TCDB functionality and our research is available in our GitHub repository (github.com/SaierLaboratory). SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS We thank all professionals and students in the Saier Laboratory who contributed to the advances reported in this TCDB update. FUNDING National Institutes of Health [GM077402 to M.H.S.]. Funding for open access charge: National Institutes of Health [GM077402]. Conflict of interest statement. None declared. REFERENCES 1. Cook G.M. , Greening C., Hards K., Berney M. Energetics of pathogenic bacteria and opportunities for drug development . Adv. Microb. Physiol. 2014 ; 65 : 1 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Busch W. , Saier M.H. Jr The transporter classification (TC) system, 2002 . Crit. Rev. Biochem. Mol. Biol. 2002 ; 37 : 287 – 337 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Delmar J.A. , Su C.C., Yu E.W. Bacterial multidrug efflux transporters . Annu. Rev. Biophys. 2014 ; 43 : 93 – 117 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Putman M. , van Veen H.W., Konings W.N. Molecular properties of bacterial multidrug transporters . Microbiol. Mol. Biol. Rev. 2000 ; 64 : 672 – 693 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Yen M.R. , Choi J., Saier M.H. Jr Bioinformatic analyses of transmembrane transport: novel software for deducing protein phylogeny, topology, and evolution . J. Mol. Microbiol. Biotechnol. 2009 ; 17 : 163 – 176 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Saier M.H. Jr, Tran C.V., Barabote R.D. TCDB: the Transporter Classification Database for membrane transport protein analyses and information . Nucleic Acids Res. 2006 ; 34 : D181 – D186 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Busch W. , Saier M.H. Jr The IUBMB-endorsed transporter classification system . Methods Mol. Biol. 2003 ; 227 : 21 – 36 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 8. Saier M.H. Jr Tracing pathways of transport protein evolution . Mol. Microbiol. 2003 ; 48 : 1145 – 1156 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Saier M.H. Jr, Reddy V.S., Tsu B.V., Ahmed M.S., Li C., Moreno-Hagelsieb G. The Transporter Classification Database (TCDB): recent advances . Nucleic Acids Res. 2016 ; 44 : D372 – D379 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Saier M.H. Jr, Reddy V.S., Tamang D.G., Vastermark A. The transporter classification database . Nucleic Acids Res. 2014 ; 42 : D251 – D258 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Saier M.H. Jr, Yen M.R., Noto K., Tamang D.G., Elkan C. The Transporter Classification Database: recent advances . Nucleic Acids Res. 2009 ; 37 : D274 – D278 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Burley S.K. , Berman H.M., Christie C., Duarte J.M., Feng Z., Westbrook J., Young J., Zardecki C. RCSB Protein Data Bank: Sustaining a living digital data resource that enables breakthroughs in scientific research and biomedical education . Protein Sci. 2018 ; 27 : 316 – 330 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Saier M.H. Jr Transport protein evolution deduced from analysis of sequence, topology and structure . Curr. Opin. Struct. Biol. 2016 ; 38 : 9 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Buyuktimkin B. , Saier M.H. Jr Comparative analyses of transport proteins encoded within the genomes of Leptospira species . Microb. Pathog. 2016 ; 98 : 118 – 131 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Buyuktimkin B. , Zafar H., Saier M.H. Jr Comparative genomics of the transportome of Ten Treponema species . Microb. Pathog. 2019 ; 132 : 87 – 99 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Heidari Tajabadi F. , Medrano-Soto A., Ahmadzadeh M., Salehi Jouzani G., Saier M.H. Jr Comparative Analyses of Transport Proteins Encoded within the Genomes of Bdellovibrio bacteriovorus HD100 and Bdellovibrio exovorus JSS . J. Mol. Microbiol. Biotechnol. 2017 ; 27 : 332 – 349 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Do J. , Zafar H., Saier M.H. Jr Comparative genomics of transport proteins in probiotic and pathogenic Escherichia coli and Salmonella enterica strains . Microb. Pathog. 2017 ; 107 : 106 – 115 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Zafar H. , Saier M.H. Jr Comparative genomics of transport proteins in seven Bacteroides species . PLoS One . 2018 ; 13 : e0208151 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Castelle C.J. , Brown C.T., Thomas B.C., Williams K.H., Banfield J.F. Unusual respiratory capacity and nitrogen metabolism in a Parcubacterium (OD1) of the Candidate Phyla Radiation . Sci. Rep. 2017 ; 7 : 40101 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Hug L.A. , Baker B.J., Anantharaman K., Brown C.T., Probst A.J., Castelle C.J., Butterfield C.N., Hernsdorf A.W., Amano Y., Ise K et al. . A new view of the tree of life . Nat Microbiol . 2016 ; 1 : 16048 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Zaremba-Niedzwiedzka K. , Caceres E.F., Saw J.H., Backstrom D., Juzokaite L., Vancaester E., Seitz K.W., Anantharaman K., Starnawski P., Kjeldsen K.U et al. . Asgard archaea illuminate the origin of eukaryotic cellular complexity . Nature . 2017 ; 541 : 353 – 358 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Lee J. , Ghosh S., Saier M.H. Jr Comparative genomic analyses of transport proteins encoded within the red algae Chondrus crispus, Galdieria sulphuraria, and Cyanidioschyzon merolae(11) . J Phycol . 2017 ; 53 : 503 – 521 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Vrljic M. , Garg J., Bellmann A., Wachi S., Freudl R., Malecki M.J., Sahm H., Kozina V.J., Eggeling L., Saier M.H. Jr et al. . The LysE superfamily: topology of the lysine exporter LysE of Corynebacterium glutamicum, a paradyme for a novel superfamily of transmembrane solute translocators . J. Mol. Microbiol. Biotechnol. 1999 ; 1 : 327 – 336 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 24. Tsu B.V. , Saier M.H. Jr The LysE superfamily of transport proteins involved in cell physiology and pathogenesis . PLoS One . 2015 ; 10 : e0137184 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Sonden B. , Kocincova D., Deshayes C., Euphrasie D., Rhayat L., Laval F., Frehel C., Daffe M., Etienne G., Reyrat J.M. Gap, a mycobacterial specific integral membrane protein, is required for glycolipid transport to the cell surface . Mol. Microbiol. 2005 ; 58 : 426 – 440 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Zhou Y. , Bushweller J.H. Solution structure and elevator mechanism of the membrane electron transporter CcdA . Nat. Struct. Mol. Biol. 2018 ; 25 : 163 – 169 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Williamson J.A. , Cho S.H., Ye J., Collet J.F., Beckwith J.R., Chou J.J. Structure and multistate function of the transmembrane electron transporter CcdA . Nat. Struct. Mol. Biol. 2015 ; 22 : 809 – 814 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Chou A. , Lee A., Hendargo K.J., Reddy V.S., Shlykov M.A., Kuppusamykrishnan H., Medrano-Soto A., Saier M.H. Jr Characterization of the Tetraspan Junctional Complex (4JC) superfamily . Biochim. Biophys. Acta Biomembr. 2017 ; 1859 : 402 – 414 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Moreno-Hagelsieb G. , Vitug B., Medrano-Soto A., Saier M.H. Jr The Membrane Attack Complex/Perforin Superfamily . J. Mol. Microbiol. Biotechnol. 2017 ; 27 : 252 – 267 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Medrano-Soto A. , Moreno-Hagelsieb G., McLaughlin D., Ye Z.S., Hendargo K.J., Saier M.H. Jr Bioinformatic characterization of the Anoctamin Superfamily of Ca2+-activated ion channels and lipid scramblases . PLoS One . 2018 ; 13 : e0192851 . Google Scholar Crossref Search ADS PubMed WorldCat 31. Maity K. , Heumann J.M., McGrath A.P., Kopcho N.J., Hsu P.K., Lee C.W., Mapes J.H., Garza D., Krishnan S., Morgan G.P et al. . Cryo-EM structure of OSCA1.2 from Oryza sativa elucidates the mechanical basis of potential membrane hyperosmolality gating . Proc. Natl. Acad. Sci. U.S.A. 2019 ; 116 : 14309 – 14318 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Medrano-Soto A. , Ghazi F., Hendargo K.J., Moreno-Hagelsieb G., Myers S., Saier M.H. Jr Expansion of the Transporter-Opsin-G protein-coupled receptor superfamily with five new protein families . PLoS One . 2020 ; 15 : e0231085 . Google Scholar Crossref Search ADS PubMed WorldCat 33. Ernst O.P. , Menon A.K. Phospholipid scrambling by rhodopsin . Photochem. Photobiol. Sci. 2015 ; 14 : 1922 – 1931 . Google Scholar Crossref Search ADS PubMed WorldCat 34. Goren M.A. , Morizumi T., Menon I., Joseph J.S., Dittman J.S., Cherezov V., Stevens R.C., Ernst O.P., Menon A.K. Constitutive phospholipid scramblase activity of a G protein-coupled receptor . Nat. Commun. 2014 ; 5 : 5115 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Menon I. , Huber T., Sanyal S., Banerjee S., Barre P., Canis S., Warren J.D., Hwa J., Sakmar T.P., Menon A.K. Opsin is a phospholipid flippase . Curr. Biol. 2011 ; 21 : 149 – 153 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Marger M.D. , Saier M.H. Jr A major superfamily of transmembrane facilitators that catalyse uniport, symport and antiport . Trends Biochem. Sci. 1993 ; 18 : 13 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Wang S.C. , Davejan P., Hendargo K.J., Javadi-Razaz I., Chou A., Yee D.C., Ghazi F., Lam K.J.K., Conn A.M., Madrigal A et al. . Expansion of the Major Facilitator Superfamily (MFS) to include novel transporters as well as transmembrane-acting enzymes . Biochim. Biophys. Acta Biomembr. 2020 ; 1862 : 183277 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Wang B. , Dukarevich M., Sun E.I., Yen M.R., Saier M.H. Jr Membrane porters of ATP-binding cassette transport systems are polyphyletic . J. Membr. Biol. 2009 ; 231 : 1 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Ernst C.M. , Peschel A. Broad-spectrum antimicrobial peptide resistance by MprF-mediated aminoacylation and flipping of phospholipids . Mol. Microbiol. 2011 ; 80 : 290 – 299 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Saier M.H. Jr, Reddy B.L. Holins in bacteria, eukaryotes, and archaea: multifunctional xenologues with potential biotechnological and biomedical applications . J. Bacteriol. 2015 ; 197 : 7 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 41. Reddy B.L. , Saier M.H. Jr Topological and phylogenetic analyses of bacterial holin families and superfamilies . Biochim. Biophys. Acta . 2013 ; 1828 : 2654 – 2671 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Reddy B.L. , Saier M.H. Jr Properties and phylogeny of 76 families of bacterial and eukaryotic organellar outer membrane pore-forming proteins . PLoS One . 2016 ; 11 : e0152733 . Google Scholar Crossref Search ADS PubMed WorldCat 43. Hastings J. , Owen G., Dekker A., Ennis M., Kale N., Muthukrishnan V., Turner S., Swainston N., Mendes P., Steinbeck C. ChEBI in 2016: improved services and an expanding collection of metabolites . Nucleic Acids Res. 2016 ; 44 : D1214 – D1219 . Google Scholar Crossref Search ADS PubMed WorldCat 44. Swainston N. , Hastings J., Dekker A., Muthukrishnan V., May J., Steinbeck C., Mendes P. libChEBI: an API for accessing the ChEBI database . J. Cheminform. 2016 ; 8 : 11 . Google Scholar Crossref Search ADS PubMed WorldCat 45. El-Gebali S. , Mistry J., Bateman A., Eddy S.R., Luciani A., Potter S.C., Qureshi M., Richardson L.J., Salazar G.A., Smart A et al. . The Pfam protein families database in 2019 . Nucleic Acids Res. 2019 ; 47 : D427 – D432 . Google Scholar Crossref Search ADS PubMed WorldCat 46. Lu S. , Wang J., Chitsaz F., Derbyshire M.K., Geer R.C., Gonzales N.R., Gwadz M., Hurwitz D.I., Marchler G.H., Song J.S. et al. . CDD/SPARCLE: the conserved domain database in 2020 . Nucleic Acids Res. 2020 ; 48 : D265 – D268 . Google Scholar Crossref Search ADS PubMed WorldCat 47. Altschul S.F. , Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs . Nucleic Acids Res. 1997 ; 25 : 3389 – 3402 . Google Scholar Crossref Search ADS PubMed WorldCat 48. Edgar R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput . Nucleic Acids Res. 2004 ; 32 : 1792 – 1797 . Google Scholar Crossref Search ADS PubMed WorldCat 49. Eddy S.R. Accelerated profile HMM searches . PLoS Comput. Biol. 2011 ; 7 : e1002195 . Google Scholar Crossref Search ADS PubMed WorldCat 50. Rodionova I.A. , Heidari Tajabadi F., Zhang Z., Rodionov D.A., Saier M.H. Jr A riboflavin transporter in Bdellovibrio exovorous JSS . J. Mol. Microbiol. Biotechnol. 2019 ; 29 : 27 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. TI - The Transporter Classification Database (TCDB): 2021 update JF - Nucleic Acids Research DO - 10.1093/nar/gkaa1004 DA - 2021-01-08 UR - https://www.deepdyve.com/lp/oxford-university-press/the-transporter-classification-database-tcdb-2021-update-u0wzk9yNik SP - D461 EP - D467 VL - 49 IS - D1 DP - DeepDyve ER -