Uncertainty quantification and multi-stage variable selection for personalized treatment regimesBi, Jiefeng; Borrotti, Matteo; Nipoti, Bernardo
doi: 10.1093/biomtc/ujag081pmid: 42166188
A dynamic treatment regime is a sequence of medical decisions that adapts to the evolving clinical status of a patient over time. To facilitate personalized care, it is crucial to assess the probability of each available treatment option being optimal for a specific patient, while also identifying the key prognostic factors that determine the optimal sequence of treatments. This task has become increasingly challenging due to the growing number of individual prognostic factors typically available. In response to these challenges, we propose a Bayesian model for optimizing dynamic treatment regimes that addresses the uncertainty in identifying optimal decision sequences and incorporates dimensionality reduction to manage high-dimensional individual covariates. The first task is achieved through a suitable augmentation of the model to handle counterfactual variables. For the second, we introduce a novel class of spike-and-slab priors for the multi-stage selection of significant factors, to favor the sharing of information across stages. The effectiveness of the proposed approach is demonstrated through an extensive simulation study and illustrated using clinical trial data on severe acute arterial hypertension.
A zero-inflated hierarchical generalized transformation model to address non-normality in spatially-informed cell-type deconvolutionMelton, Hunter J; Bradley, Jonathan R; Wu, Chong
doi: 10.1093/biomtc/ujag055pmid: 41994891
Oral squamous cell carcinomas (OSCC), the predominant head and neck cancer, pose significant challenges due to late-stage diagnoses and low five-year survival rates. Spatial transcriptomics offers a promising avenue to decipher the genetic intricacies of OSCC tumor microenvironments. In spatial transcriptomics, Cell-type deconvolution is a crucial inferential goal; however, current methods fail to consider the high zero-inflation present in OSCC data. To address this, we develop a novel zero-inflated version of the hierarchical generalized transformation model (ZI-HGT) and apply it to the Conditional AutoRegressive Deconvolution (CARD) for cell-type deconvolution. The ZI-HGT serves as an auxiliary Bayesian technique for CARD, reconciling the highly zero-inflated OSCC spatial transcriptomics data with CARD’s normality assumption. The combined ZI-HGT + CARD framework achieves enhanced cell-type deconvolution accuracy and quantifies uncertainty in the estimated cell-type proportions. We demonstrate the superior performance through simulations and analysis of the OSCC data. Furthermore, our approach enables the determination of the locations of the diverse fibroblast population in the tumor microenvironment, critical for understanding tumor growth and immunosuppression in OSCC.
A mixed effect similarity matrix regression model (SMRmix) for integrating multiple microbiome datasets at the community levelHe, Mengyu; Zhao, Ni
doi: 10.1093/biomtc/ujag077pmid: 42127284
Recent studies have highlighted the importance of the human microbiota in health and disease. However, in many areas of research, individual microbiome studies often provide inconsistent results due to limited sample sizes and the heterogeneity in study populations and experimental procedures. This inconsistency underscores the need for integrative analysis of multiple microbiome datasets. Despite the critical need, statistical methods that incorporate multiple microbiome datasets and account for study heterogeneity are not available in the literature. To address this, we propose a mixed effect similarity matrix regression (SMRmix) approach for identifying community-level microbiome shifts associated with outcomes. SMRmix has a close connection with the microbiome kernel association test, one of the most popular approaches for such a task, but it is only applicable when we have a single study. SMRmix enables researchers to consolidate findings from diverse microbiome studies. Through extensive simulations, we show that SMRmix maintains well-controlled Type I error rates and achieves higher power than competing methods. We further demonstrate its utility on two real-world datasets—17 HIV gut dysbiosis studies and 11 colorectal cancer studies—showing that SMRmix provides consistent results on community-level shifts in both applications.
A novel exact confidence interval for the difference of proportions in paired data using a restricted most probable statisticCao, Xingyun; Wang, Weizhen; Xie, Tianfa
doi: 10.1093/biomtc/ujag061pmid: 42053378
Inference on the difference between two proportions in paired data is a key issue, particularly in biomedical research and clinical trials. Numerous methods exist for constructing confidence intervals for this difference. However, approximate methods that rely on asymptotic normality can be unreliable, underscoring the need for exact confidence intervals to improve reliability. In this paper, we develop a novel interval based on the restricted most probable method, which is further optimized using the h-function method to yield an optimal exact interval, ensuring both reliability and precision. We compare the proposed interval with other exact intervals developed through methodologies such as the score method, two Tang methods, the Wang method, the adjusted Wald method, and the score method with continuity correction. Our comparative analysis, utilizing the infimum coverage probability and total interval length as evaluation metrics, demonstrates the uniformly superior performance of the proposed interval. Additionally, an example illustrates its practical application in real-world scenarios. Supplementary Materials provide another example, numerical results on coverage and non-coverage probabilities, and R code.
Nonparametric estimation of the total treatment effect with multiple outcomes in the presence of terminal eventsGronsbell, Jessica; McCaw, Zachary R; Nogues, Isabelle-Emmanuella; Kong, Xiangshan; Cai, Tianxi; Tian, Lu; Wei, L J
doi: 10.1093/biomtc/ujag053pmid: 42166186
As standards of care advance, patients are living longer and once-fatal diseases are becoming manageable. Clinical trials increasingly focus on reducing disease burden, which can be quantified by the timing and occurrence of multiple non-fatal clinical events. Most existing methods for the analysis of multiple event-time data require stringent modeling assumptions that can be difficult to verify empirically, leading to treatment efficacy estimates that forego interpretability when the underlying assumptions are not met. Moreover, many methods do not appropriately account for informative terminal events, such as premature treatment discontinuation or death, which prevent the occurrence of subsequent events. To address these limitations, we derive and validate estimation and inference procedures for the area under the mean cumulative function (AUMCF), an extension of the restricted mean survival time to the multiple event-time setting. The AUMCF is clinically interpretable, properly accounts for terminal competing risks, and can be estimated nonparametrically. To enable covariate adjustment, we also develop an augmentation estimator that provides efficiency at least equaling, and often exceeding, the unadjusted estimator. The utility and interpretability of the AUMCF are illustrated with extensive simulation studies and through an analysis of multiple heart-failure-related endpoints using data from the Beta-Blocker Evaluation of Survival Trial. Our open-source R package MCC makes conducting AUMCF analyses straightforward and accessible.
Transfer learning estimation of the accelerated failure time model based on high-dimensional dataLou, Yichen; Du, Mingyue; Zhao, Hui; Sun, Jianguo
doi: 10.1093/biomtc/ujag103pmid: 42259651
Motivated by a study on seriously ill hospitalized adults to improve their end-of-life care, we consider estimation of the accelerated failure time model, one of the most commonly used models for regression analysis of failure time data. Although many methods have been developed for the problem, standard approaches may fail or underperform when available information is limited. To address this issue, we propose two transfer learning estimation procedures that leverage auxiliary information from multiple source datasets. The first is a data-driven source detection procedure that classifies the source datasets into positively and negatively transferable groups and performs estimation using only the positively transferable or informative source datasets. The other is an ensemble-based approach that adaptively assigns weights to source datasets based on their relevance to the target dataset. Theoretical justifications are provided for the proposed methods, and an extensive simulation study is performed, indicating that the proposed methods work well in practice. Finally, they are applied to the study above and identify some prognostic factors that would not be possible by using the existing methods.
Two-phase designs for biomarker studies when disease processes are under intermittent observationLi, Kecheng; Cook, Richard J
doi: 10.1093/biomtc/ujag088pmid: 42166187
Multistate models offer an appealing framework for studying the onset and progression of chronic diseases in large cohort studies. Such studies often involve the collection and storage of biospecimens at an initial assessment, and intermittent observation of the disease process at future assessment times. We consider the design of two-phase biomarker studies in such settings where budgetary constraints prohibit assaying all biospecimens. A subsample of individuals is instead chosen to have their biospecimens assayed to facilitate examination of the association between a biomarker of interest and the disease process. Analyses based on likelihood, conditional likelihood, and estimating functions are considered, with the efficiency gains from various subsampling strategies investigated. Pseudo-score residual-dependent sampling strategies are shown to yield highly efficient maximum likelihood estimates of biomarker effects on disease progression. This sampling strategy along with competing methods are empirically studied and applied to a motivating study of the relationship between the HLA-B27 marker and joint damage in patients with psoriatic arthritis.
Decentralized EM algorithm for Gaussian mixtures under data heterogeneity and partial labelingLi, Xuetong; Wu, Shuyuan; Du, Bin; Wang, Hansheng
doi: 10.1093/biomtc/ujag092pmid: 42201842
We systematically study several network-based Expectation–Maximization (EM) algorithms for the Gaussian mixture model within decentralized federated learning (DFL). Our theoretical investigation reveals that directly extending the classic EM algorithm to DFL leads to a seriously biased estimator if the data are heterogeneously distributed across different sites. To address this issue, we introduce a momentum network EM (MNEM) algorithm, which integrates information from both current and historical estimators from previous DFL iterations. We further develop a semi-supervised MNEM (semi-MNEM) algorithm, which utilizes valuable information provided by partially labeled data. Rigorous theoretical analysis demonstrates that the MNEM estimator can achieve the same asymptotic efficiency as the whole sample estimator under appropriate regularity conditions, even if the data are heterogeneously distributed. Moreover, the semi-MNEM estimator significantly improves the convergence speed of the MNEM algorithm, even if different mixture components are poorly separated. Extensive simulations are conducted, and a widely used chest X-ray dataset is analyzed to demonstrate the finite-sample performance of the proposed methods.