Ergonomic risk assessment for supporting surgeons’ well-being using CNN and dragonfly optimizationAzyabi, Abdulmajeed; Khamaj, Abdulrahman; Ali, Abdulelah M.; Alghamdi, Saleh Y.; Hamzi, Ahmed; Hassan, Shabbir; Sidhwa, Haroonhaider; Ahmad, Mohammad Tauheed; Alam, Md. Mottahir
doi: 10.1007/s00521-026-12207-8pmid: N/A
Surgeons face many musculoskeletal problems mainly because they have to take up awkward positions and carry out repetitive motions for long periods. These issues not only pose risks to the health of the surgeon but also affect the quality of patient care adversely. The usual methods for identifying ergonomic risks, which require the filling out of questionnaires or surveys, are not usually very effective. The traditional methods for evaluating ergonomic hazards via the use of questionnaires and surveys are often ineffective. In this regard, the present study takes a different route by creating an intelligent technology that automatically evaluates ergonomic risks through the continuous observation of surgeons’ movements and postures during surgical operations. The method being discussed relies chiefly on a deep learning based CNN (Convolutional Neural Network), which extracts the most important features from visual data in a manner analogous to a continuous monitoring system. The system’s intelligence is further enhanced by the optimization of the CNN via the Dragonfly Optimization Algorithm (DOA). DOA efficiently investigates new options for the optimization of the most promising ones, thus enhancing the model’s learning capacity and addressing key challenges such as overfitting and inefficiency. All the components were created and tested in MATLAB, taking a publicly accessible dataset of surgical postures. The integration of CNN-DOA methodology proposed in this paper yields a classification accuracy of 97.56%. However, it also outperformed various conventional machine learning models, not only in terms of accuracy but also in precision, recall, and computational efficiency. This method could empower the surgeons to recognize and tackle ergonomic problems effectively before they turn into more serious issues.
CLIP-AL: an adaptive CLIP-based active learning framework for offline signature forgery verificationV, Aravinda C.; Shaffi, Noushath; Ashwath, Soumya; MS, Sannidhan
doi: 10.1007/s00521-026-12117-9pmid: N/A
Signature verification is a cornerstone of biometric authentication, where document authenticity and forgery prevention are crucial. This study introduces an improved signature authentication framework that combines Contrastive Language–Image Pretraining (CLIP) embeddings with an active learning loop to reliably detect forged and synthetic signatures. The CLIP model extracts robust multimodal representations of signatures, while the active learning component dynamically queries uncertain predictions for human validation, allowing iterative improvement with limited labeled data. Comprehensive experimental evaluation includes pairwise genuine–forged and genuine–synthetic verification settings, person-disjoint testing, and analysis of uncertainty-driven decision behavior. Experiments on the ICDAR 2011 dataset show notable improvements in accuracy (83.9%), recall (95.9%), and AUC (0.89), outperforming state-of-the-art baselines such as Vision Transformer (ViT), Swin Transformer, and ResNet-50, as well as a CLIP-last static baseline. Additional analyses examine the confidence and calibration characteristics of static baselines to motivate the proposed uncertainty-aware human-in-the-loop design. The proposed method reduces annotation requirements and enhances robustness against synthetic forgeries. These results highlight the promise of integrating CLIP embeddings with human-in-the-loop learning to develop scalable, explainable biometric verification systems for real-world use.
Multi-scale local-global fusion network with temporal attention for speech emotion recognitionMa, Wenning; Jin, Nanlin
doi: 10.1007/s00521-026-12157-1pmid: N/A
Speech Emotion Recognition (SER) is essential for affective computing, intelligent dialogue, and mental health assessment. However, the dynamic nature of speech signals poses challenges for accurately capturing both local acoustic cues and long-range emotional context. To address this, we propose a novel architecture named Multi-scale Local-Global Fusion Network (MLGFNet). MLGFNet features three core modules: (1) a Local Feature Extraction Module (LFEM), which proposes Inception-style multi-branch depthwise convolutions to extract emotional patterns at different temporal resolutions; and (2) a Global Key Context Focusing Module (GKCF), which presents hierarchical strip convolutions to generate frame-wise attention maps, allowing the network to highlight emotionally salient frames across the utterance. Then (3) a new Attention-based Fusion Mechanism is designed to adaptively integrate the outputs from (1) and (2). We evaluate the proposed MLGFNet on four public SER datasets: IEMOCAP, RAVDESS, SAVEE, and EMOVO. The experimental results show that MLGFNet consistently outperforms competitive baselines in terms of unweighted and weighted average recall. Ablation studies verify the effectiveness of LFEM and GKCF individually and jointly. Furthermore, t-Distributed Stochastic Neighbor Embedding(t-SNE) visualizations demonstrate that MLGFNet learns more separable and emotion-aware feature representations. These findings highlight MLGFNet’s robustness and interpretability, making it a promising solution for speech emotion recognition.
MP-TDN: Multi-path tumor delineation network for brain tumor segmentation using bidirectional approachPatel, Ronak R.; Patel, Miral
doi: 10.1007/s00521-026-12250-5pmid: N/A
Glioblastoma is a high-grade brain tumor that causes a high risk of death. Early detection of such tumors helps to improve human life. MRI scans are one of the most popular diagnostic reports to identify such kind of complex diseases. Such advanced diagnosis reports help to identify tumor size, location, and aggressiveness. The proposed architecture uses a hybrid bidirectional approach to share the feature with prior and domain branches. Domain branch focuses on the volumetric context for enhancing the boundary of tumor. Prior branch works on 2D and 3D fusion to identify spatial information for highly affected cells. Identification of sharp boundaries of the tumor is challenging. Based on the intensity of all modalities, the Residual Feature Interaction Network (RFIN) focuses on the non-enhancing regions. On the other end, based on the spatial information Domain Knowledge Interaction Network (DKIN) component focuses on WT. RFIN and DKIN act in a bi-directional manner for better identification of the region. The proposed architecture gives remarkable results based on benchmark datasets BraTS2019 and BraTS2020. For evaluation of the proposed approach, the mean DSC is considered, and the results are 0.8184, 0.8735, and 0.8902 for ET, WT, and TC, respectively.
A multimodal deep learning framework for symptom-based disease prediction and clinical decision supportKishore, Ashish; Naruganahalli Gavirangaiah, Girish Kumar
doi: 10.1007/s00521-026-12231-8pmid: N/A
This paper presents AyuSeva, a multimodal clinical decision support system integrating the novel D2B2C-IIFNN architecture (DenseNet, Dual Attention, Bidirectional Long Short-Term Memory, One-Dimensional Convolution with Intra-Inter Fusion Neural Network) with a confidence-gated generative conversational module for symptom-based disease diagnosis from structured and unstructured clinical inputs. The Symptom Model processes 132 clinically ordered binary features through hierarchical DenseNet encoding, dual attention saliency weighting, and BiLSTM-driven bidirectional context modelling across 41 disease categories. The NLP Model maps unstructured clinical narratives into a 256-dimensional semantic space via Embedding with SpatialDropout1D, hierarchical dual Conv1D blocks, BiLSTM, dot-product Attention, and a dual-path fusion (Flatten + GlobalAveragePooling1D) producing a 2,560-dimensional hybrid representation across 55 disease categories. Under rigorous 5-fold stratified cross-validation with fresh model instantiation per fold, the Symptom Model achieves 99.52% ± 1.08% test accuracy (macro-F1: 0.9935, ROC-AUC: 0.9999, ECE: 0.0244, Brier: 0.0224) and the NLP Model achieves 93.91% ± 0.94% (macro-F1: 0.9063, ROC-AUC: 0.9975, ECE: 0.0329, Brier: 0.0958). Probability calibration analysis, confidence threshold validation with tripartite criteria (safety, reliability, efficiency), and asymmetric severity-weighted cost modelling collectively establish the 95% confidence threshold as an empirically validated, non-arbitrary, Pareto-optimal decision boundary, achieving zero acute false-negative rate for structured inputs and perfect Critical-tier recall (1.0000) on both diagnostic pathways. To bridge diagnostic inference with patient-centric care, AyuSeva incorporates a confidence-gated generative conversational model. Through domain-constrained contextual conditioning and structured semantic segmentation, it generates high-confidence, empathetic clinical guidance addressing triage, treatment, lifestyle modification, and emergency protocols. The system leverages a memory-enabled real-time interface to facilitate interactive, personalised dialogues while enforcing strict medical safety and ethical standards. Extensive ablation studies confirm the synergistic importance of BiLSTM and dual attention mechanisms in resolving semantic ambiguity and optimising feature representation. Benchmark comparisons against state-of-the-art algorithms, including Temporal Convolutional Networks (TCN), Multilayer Perceptron’s (MLP), Gated Recurrent Units (GRU), and Random Forest, validate AyuSeva’s superior architectural robustness, generalisability, and training stability. With a scalable design suitable for telemedicine and rural outreach, AyuSeva redefines healthcare artificial intelligence by merging algorithmic depth with human-centred design, setting a new standard for intelligent, ethical, and empathetic clinical assistants.
Speech emotion recognition using deep learning: from basic to complex emotions in unimodal and multimodal frameworksLai, Rachel Si Ting; Theng, Lau Bee; Tee, Mark Kit Tsun; Tan, Colin Choon Lin; Chua, Caslon
doi: 10.1007/s00521-026-12186-wpmid: N/A
Speech Emotion Recognition (SER) is an advanced technology for developing intuitive and empathetic human-computer interfaces (HCI). While traditional SER systems have achievement a certain degree of succeed in recognising basic emotions from acted speech in a closed environment, real-world applications necessitate the recognition of more complex emotions. This paper presents a systematic review of deep learning approaches in SER from 2019 to the present, following the PRISMA guidelines, with a specific focus on the bridge between basic and complex SER within unimodal (audio-only) and multimodal frameworks. Analysis was done on the landscape of emotion models, datasets, and state-of-the-art (SOTA) model architectures, including CNNs, RNNs, Transformers, and their hybrids. The results reveal that deep learning has improved performance; the following hybrid models improved considerably; however, unimodal models still struggle with the subtle and often overlapping acoustic features of complex emotions. In contrast, multimodal models that leverage complementary information are consistently superior. Nevertheless, challenges remain, such as the over-reliance on a limited range of non-naturalistic datasets, the subjectivity associated with labelling complex emotions, and models not generalising to the variability in the real world. Finally, a conclusion is drawn by offering a strategic roadmap to guide the continuation of research in recognising complex emotions, including the efficient creation of naturalistic, large datasets for future modelling, the development of more advanced techniques for multimodal fusion, and the targeting of unconsidered but available acoustic features to enhance the modelling of the complexity of human emotions.
Adaptive learning system based on virtual reality and reinforcement learningLu, Wenyi; Ren, Jianhong
doi: 10.1007/s00521-026-12218-5pmid: N/A
Aiming at the problem of insufficient real-time adjustment capability of adaptive learning system in the field of Virtual Reality (VR) education due to insufficient utilization of multimodal behavior data and lack of dynamic strategy, this paper proposes a framework integrating VR and Deep Reinforcement Learning (DRL). Based on the Deep Q-Network (DQN), a reinforcement learning (RL) model is constructed, and a high-dimensional state space is constructed based on the learner’s cognitive state characteristics. The dynamic adjustment action space of teaching content, difficulty, and feedback is defined, and a reward function for collaborative optimization of learning effect gain and time efficiency is established. This paper develops a multi-scenario virtual learning environment, integrates motion capture and eye tracking technology to collect multimodal behavior data in real time, and realizes dynamic decision-making and deployment of teaching actions through a closed-loop strategy optimization mechanism driven by multi-source data. The experimental results show that the system significantly improves the mastery of professional course knowledge, and the accuracy of the two core knowledge points of binary modulation comparison and common-emitter amplifier circuit analysis is increased to 83.6% and 85.6%, respectively. The average response time of eye movement selection in simple scenarios of content adjustment is 143 milliseconds. This framework verifies the synergistic effect of multimodal data and real-time strategy optimization, forming a reusable adaptive learning technology paradigm, and providing theoretical support and methodological reference for the construction of a dynamic decision-making closed loop in the education system.
Trend analysis of simulated streamflows via NARX-RNN and under CMIP6 climate scenarios in the Amazon River basinde Cássia Lobato Soares, Amanda; Blanco, Claudio; de Mendonça, Leonardo Melo; da Silva Cruz, Josias
doi: 10.1007/s00521-026-12146-4pmid: N/A
This study aims to simulate streamflow using machine learning in a basin located in the Brazilian Amazon under two future climate scenarios from CMIP6, and analyze the impacts of climate change on streamflow until 2100 through trend analysis. A Nonlinear Auto Regressive Recurrent Neural Network with Exogenous Inputs (NARX) was trained to project streamflow under the SSP2-4.5 and SSP2-4.5 and SSP5-8.5 scenarios. Precipitation projected by the Global Circulation Models (GCMs) GFDL-ESM4, FGOALS-g3 e CESM2 was used as input to the model. The maximum streamflow simulated by NARX model in the reference period were underestimated. This underestimation was attributed to a systematic error in the precipitation projected by the GCMs, characterized by the delay in the peak of maximum. Therefore, the empirical quantile mapping EQM method was applied to correct the bias in the simulated streamflow using observed data from the reference period. The Mann-Kendall method was used to analyze future streamflow trends. The results show that the overall performance of the simulations was classified as good, with Kling-Gupta (KGE) values ranging from 0.73 to 0.77 in both scenarios. This performance was also reflected in the cumulative distribution function (CDF) curves of streamflow, which showed good agreement with the observed streamflow distribution. The Mann–Kendall test results for the simulated streamflow under the SSP2-4.5 scenario indicate stability in the streamflow regime. In contrast, for SSP5-8.5, a stronger signal appears mainly in GFDL-ESM4, which shows a significant decreasing trend (p = 0.0071) with Sen’s slope = − 0.313, indicating possible streamflow reduction. FGOALS-g3 (p = 0.5120) and CESM2 (p = 0.1527) showed no significant trends, although their slopes suggest opposite tendencies (0.0566 and − 0.1533, respectively). The MME also showed no significant trend (p = 0.1029), but its negative slope (− 0.1488) suggests a general tendency toward decreasing streamflow.
EATR: Emotion-aware temporal reinforcement learning for serendipitous multi-device recommendation ecosystemsBatra, Amit
doi: 10.1007/s00521-026-12221-wpmid: N/A
In this work, we introduce Emotion-Aware Temporal Reinforcement Learning (EATR) to optimize serendipitous recommendations on multi-device environments. The existing literature is based on optimal timing by employing reinforcement learning, but it does not always consider the affective conditions of the users and cross-device temporal responses that influence the perceived serendipity. We synthesize cross-device temporal feedback loops with real-time affective computing signals into one policy learning scheme in this work. Through a hierarchical reinforcement learning, we can align the timing of recommendations to devices, including smartphone, wearable, and smart TVs. Experiments on a dataset (artificially created, augmented with emotion signals) and realistic data of an interaction between devices demonstrate that EATR makes serendipity, engagement, and long-term reward highly effective compared to the current state of the art. This study provides an exploratory contribution toward integrating affective computing, temporal decision-making, and intra-/inter-device recommendation mechanisms within a unified reinforcement learning framework.
Improving visual differentiation of drones and birds in aerial surveillance using trajectory featuresLuesutthiviboon, Salil; de Croon, Guido C. H. E.; Altena, Anique; Snellen, Mirjam; Voskuijl, Mark
doi: 10.1007/s00521-026-12080-5pmid: N/A
Detecting malicious drones using aerial surveillance cameras is challenging when the distance is large, because the drone then occupies only a few pixels. Current optical detection methods rely mostly on visual appearance features. Hence, they struggle to differentiate drones from other flying objects, especially birds, when the apparent object size is small. Fortunately, the observed trajectory over time can help improve the differentiation accuracy. Here, we propose to combine classification neural networks of the object’s trajectory features and visual appearance features. We train and test the networks using our dataset containing infrared videos of drones and birds, where the variation of drone configurations and flight patterns is relatively larger than other publicly available datasets. We show that, particularly for small objects with high motion, the inclusion of trajectory features for visual classification achieves up to 22% higher frame-wise classification accuracy compared to when only visual appearance features are used. We further demonstrate that integrating both feature types provides improved accuracy over all of the considered trajectories, with 4% more of the trajectories being classified correctly. Consistent results are also shown on an open dataset, confirming the generalizability. Our study demonstrates the crucial role of information beyond the frame-wise visual appearance features in extending the operational range of aerial surveillance cameras.Graphic abstract[graphic not available: see fulltext]