journal article
Open Access Collection
Unsupervised and supervised approaches for breast cancer subtype classification: hierarchical clustering and machine learning with hyperparameter optimization
Miranda Valentin, Ana Beatriz; Bressan, Glaucia Maria; da Silva Lizzi, Elisângela Ap.
doi: 10.1007/s42600-026-00484-0pmid: N/A
Purpose Breast cancer is considered a public health problem and a disease of concern, which contains distinct subtypes, making accurate classification critical for personalized treatment. This study proposes a complementary analytical framework by applying supervised and unsupervised learning techniques for breast cancer subtype classification using gene expression data from The Cancer Genome Atlas (TCGA).MethodsFirst, hierarchical clustering with Pearson correlation and Euclidean distance as similarity metrics are employed to explore the intrinsic structure of the dataset. Subsequently, supervised machine learning models, including Logistic Regression, Support Vector Machine (SVM), Random Forest, and Multilayer Perceptron (MLP), are trained for the classification task. Hyperparameter optimization is systematically performed using the Optuna framework, and model interpretability is enhanced through SHapley Additive exPlanations (SHAP).ResultsAmong the optimized classifiers, the Multilayer Perceptron achieved the highest classification accuracy (79.14%), followed by Logistic Regression (78.07%) and Support Vector Machines (77.54%), outperforming Random Forest (69.52%).ConclusionThe results highlight the effectiveness of applying clustering methods and machine learning to improve classification accuracy and interpretability, contributing to the development of more accurate diagnostic tools and to personalize treatment strategies in breast cancer.