The improved de Bruijn graph for multitask learning: predicting functions, subcellular localization, and interactions of noncoding RNAsWei, Yuxiao; Zhang, Qi; Liu, Liwei
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae627pmid: 39592154
Noncoding RNA refers to RNA that does not encode proteins. The lncRNA and miRNA it contains play crucial regulatory roles in organisms, and their aberrant expression is closely related to various diseases. Traditional experimental methods for validating the interactions of these RNAs have limitations, and existing prediction models exhibit relatively limited functionality, relying on isolated feature extraction and performing poorly in handling various types of small sample tasks. This paper proposes an improved de Bruijn graph that can inject RNA structural information into the graph while preserving sequence information. Furthermore, the improved de Bruijn graph enables graph neural networks to learn broader dependencies and correlations among data by introducing richer edge relationships. Meanwhile, the multitask learning model, DVMnet, proposed in this paper can handle multiple related tasks, and we optimize model parameters by integrating the total loss of three tasks. This enables multitask prediction of RNA interactions, disease associations, and subcellular localization. Compared with the best existing models in this field, DVMnet has achieved the best performance with a 3% improvement in the area under the curve value and demonstrates robust results in predicting diseases and subcellular localization. The improved de Bruijn graph is also applicable to various scenarios and can unify the sequence and structural information of various nucleic acids into a single graph.
MCGAE: unraveling tumor invasion through integrated multimodal spatial transcriptomicsYang, Yiwen; Zhang, Chengming; Liu, Zhaonan; Aihara, Kazuyuki; Zhang, Chuanchao; Chen, Luonan; Wei, Wu
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae608pmid: 39576225
Spatially Resolved Transcriptomics (SRT) serves as a cornerstone in biomedical research, revealing the heterogeneity of tissue microenvironments. Integrating multimodal data including gene expression, spatial coordinates, and morphological information poses significant challenges for accurate spatial domain identification. Herein, we present the Multi-view Contrastive Graph Autoencoder (MCGAE), a cutting-edge deep computational framework specifically designed for the intricate analysis of spatial transcriptomics (ST) data. MCGAE advances the field by creating multi-view representations from gene expression and spatial adjacency matrices. Utilizing modular modeling, contrastive graph convolutional networks, and attention mechanisms, it generates modality-specific spatial representations and integrates them into a unified embedding. This integration process is further enriched by the inclusion of morphological image features, markedly enhancing the framework’s capability to process multimodal data. Applied to both simulated and real SRT datasets, MCGAE demonstrates superior performance in spatial domain detection, data denoising, trajectory inference, and 3D feature extraction, outperforming existing methods. Specifically, in colorectal cancer liver metastases, MCGAE integrates histological and gene expression data to identify tumor invasion regions and characterize cellular molecular regulation. This breakthrough extends ST analysis and offers new tools for cancer and complex disease research.
ONDSA: a testing framework based on Gaussian graphical models for differential and similarity analysis of multiple omics networksChen, Jiachen; Murabito, Joanne M; Lunetta, Kathryn L
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae610pmid: 39581869
The Gaussian graphical model (GGM) is a statistical network approach that represents conditional dependencies among components, enabling a comprehensive exploration of disease mechanisms using high-throughput multi-omics data. Analyzing differential and similar structures in biological networks across multiple clinical conditions can reveal significant biological pathways and interactions associated with disease onset and progression. However, most existing methods for estimating group differences in sparse GGMs only apply to comparisons between two groups, and the challenging problem of multiple testing across multiple GGMs persists. This limitation hinders the ability to uncover complex biological insights that arise from comparing multiple conditions simultaneously. To address these challenges, we propose the Omics Networks Differential and Similarity Analysis (ONDSA) framework, specifically designed for continuous omics data. ONDSA tests for structural differences and similarities across multiple groups, effectively controlling the false discovery rate (FDR) at a desired level. Our approach focuses on entry-wise comparisons of precision matrices across groups, introducing two test statistics to sequentially estimate structural differences and similarities while adjusting for correlated effects in FDR control procedures. We show via comprehensive simulations that ONDSA outperforms existing methods under a range of graph structures and is a valuable tool for joint comparisons of multiple GGMs. We also illustrate our method through the detection of neuroinflammatory pathways in a multi-omics dataset from the Framingham Heart Study Offspring cohort, involving three apolipoprotein E genotype groups. It highlights ONDSA’s ability to provide a more holistic view of biological interactions and disease mechanisms through multi-omics data integration.
VGAE-CCI: variational graph autoencoder-based construction of 3D spatial cell–cell communication networkZhang, Tianjiao; Zhang, Xiang; Wu, Zhenao; Ren, Jixiang; Zhao, Zhongqian; Zhang, Hongfei; Wang, Guohua; Wang, Tao
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae619pmid: 39581873
Cell–cell communication plays a critical role in maintaining normal biological functions, regulating development and differentiation, and controlling immune responses. The rapid development of single-cell RNA sequencing and spatial transcriptomics sequencing (ST-seq) technologies provides essential data support for in-depth and comprehensive analysis of cell–cell communication. However, ST-seq data often contain incomplete data and systematic biases, which may reduce the accuracy and reliability of predicting cell–cell communication. Furthermore, other methods for analyzing cell–cell communication mainly focus on individual tissue sections, neglecting cell–cell communication across multiple tissue layers, and fail to comprehensively elucidate cell–cell communication networks within three-dimensional tissues. To address the aforementioned issues, we propose VGAE-CCI, a deep learning framework based on the Variational Graph Autoencoder, capable of identifying cell–cell communication across multiple tissue layers. Additionally, this model can be applied to spatial transcriptomics data with missing or partially incomplete data and can clustered cells at single-cell resolution based on spatial encoding information within complex tissues, thereby enabling more accurate inference of cell–cell communication. Finally, we tested our method on six datasets and compared it with other state of art methods for predicting cell–cell communication. Our method outperformed other methods across multiple metrics, demonstrating its efficiency and reliability in predicting cell–cell communication.
CosGeneGate selects multi-functional and credible biomarkers for single-cell analysisLiu, Tianyu; Long, Wenxin; Cao, Zhiyuan; Wang, Yuge; He, Chuan Hua; Zhang, Le; Strittmatter, Stephen M; Zhao, Hongyu
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae626pmid: 39592241
Motivation: Selecting representative genes or marker genes to distinguish cell types is an important task in single-cell sequencing analysis. Although many methods have been proposed to select marker genes, the genes selected may have redundancy and/or do not show cell-type-specific expression patterns to distinguish cell types. Results: Here, we present a novel model, named CosGeneGate, to select marker genes for more effective marker selections. CosGeneGate is inspired by combining the advantages of selecting marker genes based on both cell-type classification accuracy and marker gene specific expression patterns. We demonstrate the better performance of the marker genes selected by CosGeneGate for various downstream analyses than the existing methods with both public datasets and newly sequenced datasets. The non-redundant marker genes identified by CosGeneGate for major cell types and tissues in human can be found at the website as follows: https://github.com/VivLon/CosGeneGate/blob/main/marker gene list.xlsx.
A versatile pipeline to identify convergently lost ancestral conserved fragments associated with convergent evolution of vocal learningLi, Xiaoyi; Zhu, Kangli; Zhen, Ying
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae614pmid: 39581870
Molecular convergence in convergently evolved lineages provides valuable insights into the shared genetic basis of converged phenotypes. However, most methods are limited to coding regions, overlooking the potential contribution of regulatory regions. We focused on the independently evolved vocal learning ability in multiple avian lineages, and developed a whole-genome-alignment-free approach to identify genome-wide Convergently Lost Ancestral Conserved fragments (CLACs) in these lineages, encompassing noncoding regions. We discovered 2711 CLACs that are overrepresented in noncoding regions. Proximal genes of these CLACs exhibit significant enrichment in neurological pathways, including glutamate receptor signaling pathway and axon guidance pathway. Moreover, their expression is highly enriched in brain tissues associated with speech formation. Notably, several have known functions in speech and language learning, including ROBO family, SLIT2, GRIN1, and GRIN2B. Additionally, we found significantly enriched motifs in noncoding CLACs, which match binding motifs of transcriptional factors involved in neurogenesis and gene expression regulation in brain. Furthermore, we discovered 19 candidate genes that harbor CLACs in both human and multiple avian vocal learning lineages, suggesting their potential contribution to the independent evolution of vocal learning in both birds and humans.
Repun: an accurate small variant representation unification method for multiple sequencing platformsZheng, Zhenxian; Ren, Yingxuan; Chen, Lei; Wong, Angel On Ki; Li, Shumin; Yu, Xian; Lam, Tak-Wah; Luo, Ruibang
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae613pmid: 39584701
Ensuring a unified variant representation aligning the sequencing data is critical for downstream analysis as variant representation may differ across platforms and sequencing conditions. Current approaches typically treat variant unification as a post-step following variant calling and are incapable of measuring the correct variant representation from the outset. Aligning variant representations with the alignment before variant calling has benefits like providing reliable training labels for deep learning-based variant caller model training and enabling direct assessment of alignment quality. However, it also poses challenges due to the large number of candidates to handle. Here, we present Repun, a haplotype-aware variant-alignment unification algorithm that harmonizes the variant representation between provided variants and alignments in different sequencing platforms. Repun leverages phasing to facilitate equivalent haplotype matches between variants and alignments. Our approach reduced the comparisons between variant haplotypes and candidate haplotypes by utilizing haplotypes with read evidence to speed up the unification process. Repun achieved >99.99% precision and > 99.5% recall through extensive evaluations of various Genome in a Bottle Consortium samples encompassing three sequencing platforms: Oxford Nanopore Technology, Pacific Biosciences, and Illumina. Repun is open-source and available at (https://github.com/zhengzhenxian/Repun).
LIMO-GCN: a linear model-integrated graph convolutional network for predicting Alzheimer disease genesLin, Cui-Xiang; Li, Hong-Dong; Wang, Jianxin
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae611pmid: 39592152
Alzheimer’s disease (AD) is a complex disease with its genetic etiology not fully understood. Gene network-based methods have been proven promising in predicting AD genes. However, existing approaches are limited in their ability to model the nonlinear relationship between networks and disease genes, because (i) any data can be theoretically decomposed into the sum of a linear part and a nonlinear part, (ii) the linear part can be best modeled by a linear model since a nonlinear model is biased and can be easily overfit, and (iii) existing methods do not separate the linear part from the nonlinear part when building the disease gene prediction model. To address the limitation, we propose linear model-integrated graph convolutional network (LIMO-GCN), a generic disease gene prediction method that models the data linearity and nonlinearity by integrating a linear model with GCN. The reason to use GCN is that it is by design naturally suitable to dealing with network data, and the reason to integrate a linear model is that the linearity in the data can be best modeled by a linear model. The weighted sum of the prediction of the two components is used as the final prediction of LIMO-GCN. Then, we apply LIMO-GCN to the prediction of AD genes. LIMO-GCN outperforms the state-of-the-art approaches including GCN, network-wide association studies, and random walk. Furthermore, we show that the top-ranked genes are significantly associated with AD based on molecular evidence from heterogeneous genomic data. Our results indicate that LIMO-GCN provides a novel method for prioritizing AD genes.
Comprehensive human respiratory genome catalogue underlies the high resolution and precision of the respiratory microbiomeLi, Yinhu; Pan, Guangze; Wang, Shuai; Li, Zhengtu; Yang, Ru; Jiang, Yiqi; Chen, Yu; Li, Shuai Cheng; Shen, Bairong
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae620pmid: 39581874
The human respiratory microbiome plays a crucial role in respiratory health, but there is no comprehensive respiratory genome catalogue (RGC) for studying the microbiome. In this study, we collected whole-metagenome shotgun sequencing data from 4067 samples and sequenced long reads of 124 samples, yielding 9.08 and 0.42 Tbp of short- and long-read data, respectively. By submitting these data with a novel assembly algorithm, we obtained a comprehensive human RGC. This high-quality RGC contains 190,443 contigs over 1 kbps and an N50 length exceeding 13 kbps; it comprises 159 high-quality and 393 medium-quality genomes, including 117 previously uncharacterized respiratory bacteria. Moreover, the RGC contains 209 respiratory-specific species not captured by the unified human gastrointestinal genome. Using the RGC, we revisited a study on a pediatric pneumonia dataset and identified 17 pneumonia-specific respiratory pathogens, reversing an inaccurate etiological conclusion due to the previous incomplete reference. Furthermore, we applied the RGC to the data of 62 participants with a clinical diagnosis of infection. Compared to the Nucleotide database, the RGC yielded greater specificity (0 versus 0.444, respectively) and sensitivity (0.852 versus 0.881, respectively), suggesting that the RGC provides superior sensitivity and specificity for the clinical diagnosis of respiratory diseases.
RNADiffFold: generative RNA secondary structure prediction using discrete diffusion modelsWang, Zhen; Feng, Yizhen; Tian, Qingwen; Liu, Ziqi; Yan, Pengju; Li, Xiaolin
2024 Briefings in Bioinformatics
doi: 10.1093/bib/bbae618pmid: 39581872
Ribonucleic acid (RNA) molecules are essential macromolecules that perform diverse biological functions in living beings. Precise prediction of RNA secondary structures is instrumental in deciphering their complex three-dimensional architecture and functionality. Traditional methodologies for RNA structure prediction, including energy-based and learning-based approaches, often depict RNA secondary structures from a static perspective and rely on stringent a priori constraints. Inspired by the success of diffusion models, in this work, we introduce RNADiffFold, an innovative generative prediction approach of RNA secondary structures based on multinomial diffusion. We reconceptualize the prediction of contact maps as akin to pixel-wise segmentation and accordingly train a denoising model to refine the contact maps starting from a noise-infused state progressively. We also devise a potent conditioning mechanism that harnesses features extracted from RNA sequences to steer the model toward generating an accurate secondary structure. These features encompass one-hot encoded sequences, probabilistic maps generated from a pre-trained scoring network, and embeddings and attention maps derived from RNA foundation model. Experimental results on both within- and cross-family datasets demonstrate RNADiffFold’s competitive performance compared with current state-of-the-art methods. Additionally, RNADiffFold has shown a notable proficiency in capturing the dynamic aspects of RNA structures, a claim corroborated by its performance on datasets comprising multiple conformations.