TY - JOUR AU - Wang, Shir Li AB - 1 Introduction 1.1 Background In the complex environment of modern financial markets, stock prediction has become a widely studied research field. Market dynamics and investment decisions face unprecedented challenges and opportunities. Stock prices are influenced by a multitude of factors, including macroeconomic environment, market sentiment, and socio-economic dynamics. With the advancement of related research, more people recognize that stock market volatility is greatly influenced by social media and public sentiment. For instance, in August 2018, Elon Musk’s tweet about taking Tesla private led to a 7% rise in the stock price, while another tweet stating that the price was “too high” caused a 10% drop. In January 2021, the Reddit user-led short squeeze on GameStop and AMC Entertainment drove their stock prices up to $400 (an increase of over 1900%) and $73 (an increase of 3500%), respectively, seriously affecting market stability. MARYLAND SMITH RESEARCH has also shown that a company’s tweet sentiment can trigger short-term fluctuations in its stock price. With the rapid development of information technology, researchers have gradually shifted from traditional statistical analysis to more advanced machine learning and deep learning techniques [12]. However, existing prediction models face two main challenges when dealing with complex stock market data [12,13]. In terms of stock relationship modeling, existing methods have difficulty effectively capturing the complex nonlinear correlation patterns between stocks. Especially during periods of severe market fluctuations, traditional models fail to accurately characterize the dynamic mutual influence between stocks, resulting in poor predictive performance [14]. Additionally, in terms of anomaly detection, existing models have limited capabilities for integrating multi-source data, making it difficult to identify and respond to abnormal market fluctuations in a timely manner. Such prediction inaccuracies during critical moments not only affect model practicality but also introduce significant uncertainty and risk into investment decisions. 1.2 Literature review In the current field of stock prediction research, an increasing number of researchers have started using hybrid models to improve prediction accuracy and stability. Xu et al. proposed an enhanced nonlinear fusion model based on GAN, integrating ACNN, LSTM, and ARIMA models for effective stock price prediction, demonstrating its superiority in experiments [2]. However, the model still has limitations in capturing complex temporal features. Conversely, Dong et al. combined the SARIMA model with the Monte Carlo method to overcome the limitations of single models, proposing a hybrid approach for stock value prediction to enhance accuracy [3], but it lacked consideration for external variables such as macroeconomics, limiting the model’s flexibility. Meanwhile, the application of deep learning methods in the stock prediction domain has also been on the rise. Liu et al. introduced a deep learning model that integrates mixed-frequency data for stock volatility prediction, showing good performance in handling high-frequency data [5], though there remains room for improvement in capturing high-frequency data features. To address prediction challenges in complex market environments, Zhang et al. proposed the MDF-DMC model, which combines multi-perspective stock data features and dynamic market-related information, effectively enhancing the accuracy of stock price prediction by dynamically learning correlations between stocks [10]. However, the model’s stability under complex market conditions still requires further verification. Wang et al. proposed a model based on PCA and improved IGRU, focusing on reducing redundant input information to improve model training efficiency and prediction performance [4], but the method still lacks in considering the interrelationships between stocks. Current research trends mainly focus on using hybrid models, deep learning methods, and dynamic fusion of data features to overcome the shortcomings of traditional methods in terms of prediction accuracy and capturing market dynamics [27]. However, these methods still face challenges in the comprehensiveness of data features, integration of external factors, and model adaptability and stability, necessitating further research and improvement, as shown in Table 1. The application and improvement of these methods provide new ideas and stronger interpret ability for stock prediction, but to fully utilize their potential in practical applications, the existing limitations must be addressed. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Literature review table. https://doi.org/10.1371/journal.pone.0313772.t001 1.3 Our contributions Combination of graph attention network and variational graph autoencoder: This study proposes an innovative algorithm combining the Graph Attention Network (GAT) and Variational Graph Autoencoder (VGAE). GAT is used to efficiently aggregate correlations between stocks, while VGAE deeply encodes stock features in latent space to generate more representative high-dimensional latent space representations, capturing complex interactions between stocks more effectively. Dynamic modeling with sparse spatiotemporal convolutional network: To address the issue of anomalies in financial data, this study introduces the Sparse Spatiotemporal Convolutional Network (STCN). By dynamically modeling stock features in temporal and spatial dimensions, STCN can efficiently detect abnormal changes in the data and enhance sensitivity and robustness to complex dynamic features through sparse regularization. Comparative ablation experiment on real dataset: We conducted comprehensive comparative and ablation experiments on a real financial dataset to validate the effectiveness of the proposed STAGE framework in capturing complex relationships between stocks and addressing anomalies. The results show that the complete STAGE framework significantly outperforms simplified models with key modules removed in terms of prediction accuracy and robustness to anomalous data. 1.1 Background In the complex environment of modern financial markets, stock prediction has become a widely studied research field. Market dynamics and investment decisions face unprecedented challenges and opportunities. Stock prices are influenced by a multitude of factors, including macroeconomic environment, market sentiment, and socio-economic dynamics. With the advancement of related research, more people recognize that stock market volatility is greatly influenced by social media and public sentiment. For instance, in August 2018, Elon Musk’s tweet about taking Tesla private led to a 7% rise in the stock price, while another tweet stating that the price was “too high” caused a 10% drop. In January 2021, the Reddit user-led short squeeze on GameStop and AMC Entertainment drove their stock prices up to $400 (an increase of over 1900%) and $73 (an increase of 3500%), respectively, seriously affecting market stability. MARYLAND SMITH RESEARCH has also shown that a company’s tweet sentiment can trigger short-term fluctuations in its stock price. With the rapid development of information technology, researchers have gradually shifted from traditional statistical analysis to more advanced machine learning and deep learning techniques [12]. However, existing prediction models face two main challenges when dealing with complex stock market data [12,13]. In terms of stock relationship modeling, existing methods have difficulty effectively capturing the complex nonlinear correlation patterns between stocks. Especially during periods of severe market fluctuations, traditional models fail to accurately characterize the dynamic mutual influence between stocks, resulting in poor predictive performance [14]. Additionally, in terms of anomaly detection, existing models have limited capabilities for integrating multi-source data, making it difficult to identify and respond to abnormal market fluctuations in a timely manner. Such prediction inaccuracies during critical moments not only affect model practicality but also introduce significant uncertainty and risk into investment decisions. 1.2 Literature review In the current field of stock prediction research, an increasing number of researchers have started using hybrid models to improve prediction accuracy and stability. Xu et al. proposed an enhanced nonlinear fusion model based on GAN, integrating ACNN, LSTM, and ARIMA models for effective stock price prediction, demonstrating its superiority in experiments [2]. However, the model still has limitations in capturing complex temporal features. Conversely, Dong et al. combined the SARIMA model with the Monte Carlo method to overcome the limitations of single models, proposing a hybrid approach for stock value prediction to enhance accuracy [3], but it lacked consideration for external variables such as macroeconomics, limiting the model’s flexibility. Meanwhile, the application of deep learning methods in the stock prediction domain has also been on the rise. Liu et al. introduced a deep learning model that integrates mixed-frequency data for stock volatility prediction, showing good performance in handling high-frequency data [5], though there remains room for improvement in capturing high-frequency data features. To address prediction challenges in complex market environments, Zhang et al. proposed the MDF-DMC model, which combines multi-perspective stock data features and dynamic market-related information, effectively enhancing the accuracy of stock price prediction by dynamically learning correlations between stocks [10]. However, the model’s stability under complex market conditions still requires further verification. Wang et al. proposed a model based on PCA and improved IGRU, focusing on reducing redundant input information to improve model training efficiency and prediction performance [4], but the method still lacks in considering the interrelationships between stocks. Current research trends mainly focus on using hybrid models, deep learning methods, and dynamic fusion of data features to overcome the shortcomings of traditional methods in terms of prediction accuracy and capturing market dynamics [27]. However, these methods still face challenges in the comprehensiveness of data features, integration of external factors, and model adaptability and stability, necessitating further research and improvement, as shown in Table 1. The application and improvement of these methods provide new ideas and stronger interpret ability for stock prediction, but to fully utilize their potential in practical applications, the existing limitations must be addressed. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Literature review table. https://doi.org/10.1371/journal.pone.0313772.t001 1.3 Our contributions Combination of graph attention network and variational graph autoencoder: This study proposes an innovative algorithm combining the Graph Attention Network (GAT) and Variational Graph Autoencoder (VGAE). GAT is used to efficiently aggregate correlations between stocks, while VGAE deeply encodes stock features in latent space to generate more representative high-dimensional latent space representations, capturing complex interactions between stocks more effectively. Dynamic modeling with sparse spatiotemporal convolutional network: To address the issue of anomalies in financial data, this study introduces the Sparse Spatiotemporal Convolutional Network (STCN). By dynamically modeling stock features in temporal and spatial dimensions, STCN can efficiently detect abnormal changes in the data and enhance sensitivity and robustness to complex dynamic features through sparse regularization. Comparative ablation experiment on real dataset: We conducted comprehensive comparative and ablation experiments on a real financial dataset to validate the effectiveness of the proposed STAGE framework in capturing complex relationships between stocks and addressing anomalies. The results show that the complete STAGE framework significantly outperforms simplified models with key modules removed in terms of prediction accuracy and robustness to anomalous data. 2 Methodology 2.1 Problem description In the stock prediction problem, given a series of stock price data at different time points, the goal is to predict the stock price at a future time point. Suppose the time series data is , our goal is to predict the stock price at time ρ. Let the input sequence be , and the corresponding predicted value be . The prediction process can be defined as a function f ( ⋅ ) , such that: (1) where f ( ⋅ ) represents the mapping process of the input sequence X to obtain the predicted value for the future time point, and − is the noise term, representing uncertainty and random disturbance in the prediction process. To improve the prediction performance, a composite loss function is often used to evaluate the error between the predicted value and the true value: (2) where N represents the number of samples, is the predicted value for the i-th sample, is the corresponding true value, and α is the balancing parameter that controls the weight between Mean Squared Error (MSE) and Mean Absolute Error (MAE). To describe the anomaly data detection problem, suppose there are anomaly data points in the dataset X. Our goal is to detect these anomalies by minimizing the following objective function: (3) where represents the loss function between the predicted value and the true value, β is the regularization parameter, represents the k-th regularization term to enhance the model’s capability of detecting anomalies, and K is the number of regularization terms. Problem 1. The ultimate goal of stock price prediction can be expressed as: (4) where represents the mapping process of input data features, ∗ is the loss function, = is the multi-objective weighting coefficient, is the multi-dimensional socio-economic constraint function, and K is the number of constraint terms, which comprehensively considers the complex influence of various socio-economic factors on stock prices. 2.2 Interaction between stocks: combination of graph attention network and variational graph autoencoder 2.2.1 Synergistic advantages of graph networks and variational inference. Traditional stock prediction methods often face limitations when capturing complex interrelationships between stocks, failing to effectively model dynamic associations and nonlinear features among stocks [13,15,26]. These methods typically rely on fixed statistical models or single neural network architectures, making it difficult to fully reflect the potential interactions between different stocks, thereby compromising the accuracy and reliability of prediction results [16,28]. The proposed algorithm in this study combines the Graph Attention Network (GAT) with the Variational Autoencoder (VAE). Through graph structure modeling and latent space feature learning, it can effectively capture complex nonlinear relationships and dynamic interaction features among stocks. GAT is used to aggregate correlations between stocks, enabling the model to identify the influence of each stock in the overall market, while VAE further encodes stock features deeply to generate high-dimensional latent space representations, thereby enhancing the model’s robustness to anomalies and complex relationships. Fig 1 shows a framework of the stock prediction model. The framework inputs stock data into the Graph Attention Network (GAT) to learn complex interrelationships between stocks. Subsequently, the model uses the Variational Autoencoder (VAE) to deeply encode stock features and generate high-dimensional latent space representations. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Capturing dynamic associations between stocks. https://doi.org/10.1371/journal.pone.0318939.g001 Algorithm 1: Stock relationship capture algorithm based on graph attentionand variational autoencoder. 1: Require: Stock set V, edge set E, adjacency matrix A, node feature matrix X Ensure: Final representation matrix of stock nodes Z 2: Initialize model parameters: 3: Initialize GAT parameters: weight matrix W, attention vector a 4: Initialize VAE parameters: encoders , , decoder 5: Initialize hyperparameters: = , p < 0 . 1, = 6: Graph attention feature aggregation: 7: for each node = do 8: Calculate attention weight between node i and its neighbor nodes p < 0 . 1 (refer to Eq 6) 9: Update node feature (refer to Eq 5) 10: end for 11: Variational autoencoder processing: 12: for each node = do 13: Encoding process: 14: Calculate the mean and variance of the latent variable (refer to Eq 7) 15: Sample latent variable (refer to Eq 8) 16: Decoding process: 17: Reconstruct node feature (refer to Eq 9) 18: end for 19: Loss function calculation and optimization: 20: Calculate VAE loss (refer to Eq 10) 21: Optimize model parameters using gradient descent 22: Generate final node representation: 23: Construct final representation matrix Z (refer to Eq 11) 24: Generate sparse representation: 25: Calculate sparse representation (refer to Eq 13) return Zsparse Let there be N stocks, whose interrelationships can be represented as an undirected graph = , where V denotes the set of stocks, and E represents the edges between stocks. Define the adjacency matrix of the graph structure as , and the node feature matrix as , where F is the feature dimension of nodes. The feature aggregation for each node is performed through a graph attention layer, yielding the updated feature representation for node i: (5) where p < 0 . 1 represents the set of neighbors of node i, is the attention weight between node i and node j, is a learnable weight matrix, λ is a regularization parameter used to control neighborhood differences, and σ is an activation function. The attention weight is computed as follows: (6) where is a learnable attention weight vector, ∥ denotes the feature concatenation operation, and represents an integral term introduced using the mean value theorem to better capture the feature variation trend. To model the complex nonlinear relationships between stocks, the study further uses the Variational Autoencoder to encode node features. Let the node features be , which are used to learn latent space representations via the VAE. First, map the node features to the mean and variance of the latent space: (7) where and are two independent feedforward neural networks for generating the mean and variance of the latent variables, is a small perturbation of node features, and c is a constant. Based on the mean and variance, the latent variable can be sampled as: (8) where ⊙ denotes element-wise multiplication, and ϵ is a random noise vector that follows a standard normal distribution, with an additional noise term controlled by the gradient of the KL divergence loss to ensure diversity in sampling. The latent variable is then used for reconstruction to recover the original node features: (9) where is the decoder for reconstructing node features from latent variables, ĉ is an adjustable constant to increase the nonlinearity of reconstruction, and is an additional nonlinear modulation function. To optimize the training of the VAE, the following loss function is used: (10) where MSE is the mean squared error loss, is the Kullback-Leibler divergence used to measure the difference between the prior distribution and the posterior distribution of latent variables, with additional gradient and second derivative terms used to capture the sensitivity of attention weights and reconstruction functions to feature changes. Finally, GAT and VAE are combined and jointly trained to obtain the final representation for each stock node, which serves as the input for anomaly detection based on Sparse Spatiotemporal Convolutional Networks: (11) Theorem 1. Let G = ( V , E ) be the stock relationship graph, and the node representations be the final latent variable representations Z obtained through the joint training of GAT and VAE. The optimal solution to the joint loss function minimizes the reconstruction error, KL divergence, and regularization terms: (12) where γ is the weight coefficient for the second-order derivative regularization term, ensuring effective control of the model’s high-order sensitivity to feature changes. Corollary 1. From Theorem 1, the sparse representation of the node latent variable Z as the input for anomaly detection satisfies the following condition to ensure the preservation of important features during anomaly detection: (13) where η is the sparsification regularization parameter, and is the second-order derivative of the node representation with respect to time, used to improve the adaptability of the sparse representation to temporal dynamics. 2.3 Anomaly detection: dynamic modeling with sparse spatiotemporal convolutional network 2.3.1 Efficient modeling of sparsity and spatiotemporal features. Traditional anomaly detection methods in stock prediction often struggle to effectively capture the complex dependencies between spatiotemporal features, lacking a deep perception of the dynamic changes and anomalous characteristics in stock data [17,18]. They are often unable to accurately identify potential anomalies in complex financial market environments [19]. The anomaly detection method based on Sparse Spatiotemporal Convolutional Network (STCN) can effectively capture the dynamic changes of stock features in both spatial and temporal dimensions. STCN can comprehensively model the spatiotemporal dependencies of stock latent variables and enhance the sparsity and sensitivity to feature changes through regularization techniques, better handling the complexity and volatility of financial data. Fig 2 illustrates how causal relationship-based stock data is used for anomaly detection through sparse graph structure and temporal feature construction. The causal graph data is transformed into a sparse graph structure, effectively capturing the dynamic changes of stock features in spatial and temporal dimensions through temporal feature construction. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Anomaly detection in stock data. https://doi.org/10.1371/journal.pone.0318939.g002 2.3.2 Algorithm 2: Anomaly detection based on sparse spatiotemporal convolutional network. Let the node representations obtained in the previous section be , where N is the number of stocks, and is the dimension of the latent variable features. The node representation Z is preprocessed through the input layer of the STCN to construct the input tensor , where T represents the time dimension, indicating the sequence of node features over multiple time steps. For the input tensor , define a sparse convolutional kernel , where , , and are the kernel sizes for spatial, feature, and temporal dimensions, respectively. The convolution operation can be expressed as: (14) where is the output feature map after convolution, is the bias term, and is a weighting function used to adjust the interaction between features, with an additional integral term to capture subtle changes during convolution. To enhance sparsity, an activation function ϕ ( ⋅ ) and regularization are applied to the convolution result: (15) Algorithm 2: Stock relationship capture algorithm based on graph attentionand variational autoencoder. Require: Stock set V, edge set E, adjacency matrix A, node feature matrix X Ensure: Final representation matrix of stock nodes Z 1: Initialize model parameters: 2: Initialize GAT parameters: weight matrix W, attention vector a 3: Initialize VAE parameters: encoders , , decoder 4: Initialize hyperparameters: λ, γ, η 5: Graph attention feature aggregation: 6: for each node i ∈ V do 7: Calculate attention weight between node i and its neighboring nodes j ∈ N ( i ) 8: (refer to Eq 6) 9: Update node feature (refer to Eq 5) 10: end for 11: Variational autoencoder processing: 12: for each node i ∈ V do 13: Encoding process: 14: Calculate the mean and variance of the latent variable 15: (refer to Eq 7) 16: Sample latent variable (refer to Eq 8) 17: Decoding process: 18: Reconstruct node feature (refer to Eq 9) 19: end for 20: Loss function calculation and optimization: 21: Calculate VAE loss (refer to Eq 10) 22: Optimize model parameters using gradient descent 23: Generate final node representation: 24: Construct final representation matrix Z (refer to Eq 11) 25: Generate sparse representation: 26: Calculate sparse representation 27: (refer to Eq 13) return where an additional second-order derivative regularization term is used to enhance the model’s sparsity and capture the impact of input feature changes on the convolution result. A pooling operation is applied to the convolved features to reduce the size of the feature map. Let the pooling kernel size be × × , and the pooling operation can be defined as: (16) where is a temporal smoothing function related to the pooling operation, which adjusts the smoothness of the pooling result through an integral term. To capture spatiotemporal dependencies between features, a residual connection module is defined for the STCN, which improves gradient propagation through residual connections, specifically: (17) where is the output of the residual module, with an additional second-order derivative term used to capture the complex nonlinear relationships between features. To detect anomalies, a self-attention mechanism is introduced to compute the importance weight of each node feature over the entire time series. Let the weight matrix be , and the weights are calculated as follows: (18) where θ ( ⋅ ) , ϕ ( ⋅ ) , and are independent feedforward neural networks used to compute the similarity between node features, with an integral term to capture the variation trend of node features. Based on the attention weights, the node features are weighted and summed to obtain the final anomaly score : (19) where additional second-order and third-order derivative regularization terms are used to control the smoothness and nonlinear variation of the score. To determine the anomaly threshold, a threshold calculation formula based on the normal distribution assumption is introduced, with the mean and standard deviation of the anomaly scores S denoted as and , respectively: (20) where δ is an adjustment parameter, and the additional integral term captures the tail characteristics of the score distribution for more accurate threshold setting in the offline detection scenario. This formulation leverages the availability of complete historical data to establish a robust anomaly threshold. By comparing each node’s anomaly score with the threshold τ, it is determined whether the node is an anomaly: (21) where represents the anomaly indicator of node i, with an additional integral term for dynamic threshold adjustment. Theorem 2. Let the input tensor be , and the anomaly score obtained after STCN, pooling, self-attention, and residual connections satisfies the following optimal condition: (22) where is the anomaly detection loss function, which includes regularization, second-order derivative regularization, and an integral term to enhance the model’s sparsity, temporal smoothness, and anomaly detection accuracy. Corollary 2. From Theorem 1, the final anomaly indicator of a node satisfies the following condition to determine whether the node is an anomaly: (23) where and are the mean and standard deviation of the anomaly scores, δ is a parameter to adjust sensitivity, and the additional integral term is used to better capture the variation of anomaly scores. 2.4 Complete algorithm: STAGE (Spatiotemporal attention graph embedding) framework Time complexity analysis: The time complexity of the sparse convolution operation is O ( N × T ) , where N is the number of nodes and T is the number of time steps. The activation and regularization processes require traversing each node and its feature dimensions, resulting in a time complexity of , where is the feature dimension of the nodes. The pooling and residual connection operations both have a time complexity of O ( N × T ) , where pooling is used to reduce the feature map dimensions and improve computational efficiency, and residual connections accelerate gradient propagation by skipping certain layers to reduce computation time. The self-attention mechanism has a time complexity of O ( N × T ) since it only needs to weight and sum the features for each node within the time steps, and the weight computation can be vectorized for efficient parallelization. The anomaly score calculation and anomaly detection process also have a time complexity of O(N). Therefore, the overall time complexity of the algorithm framework is . Space complexity analysis: The space complexity for the input tensor is , indicating the need to store all feature values for all nodes across all time steps. Only the intermediate results for the current time step need to be stored, with previous computation results being released, which makes the overall space complexity . Thus, the overall space complexity is . The STAGE framework implements offline anomaly detection, where the complete historical dataset is analyzed to capture comprehensive spatiotemporal dependencies and market patterns. This design enables accurate anomaly threshold determination through global context analysis. For online detection scenarios, the framework can be adapted using sliding window processing and local statistics-based threshold calculations. 2.1 Problem description In the stock prediction problem, given a series of stock price data at different time points, the goal is to predict the stock price at a future time point. Suppose the time series data is , our goal is to predict the stock price at time ρ. Let the input sequence be , and the corresponding predicted value be . The prediction process can be defined as a function f ( ⋅ ) , such that: (1) where f ( ⋅ ) represents the mapping process of the input sequence X to obtain the predicted value for the future time point, and − is the noise term, representing uncertainty and random disturbance in the prediction process. To improve the prediction performance, a composite loss function is often used to evaluate the error between the predicted value and the true value: (2) where N represents the number of samples, is the predicted value for the i-th sample, is the corresponding true value, and α is the balancing parameter that controls the weight between Mean Squared Error (MSE) and Mean Absolute Error (MAE). To describe the anomaly data detection problem, suppose there are anomaly data points in the dataset X. Our goal is to detect these anomalies by minimizing the following objective function: (3) where represents the loss function between the predicted value and the true value, β is the regularization parameter, represents the k-th regularization term to enhance the model’s capability of detecting anomalies, and K is the number of regularization terms. Problem 1. The ultimate goal of stock price prediction can be expressed as: (4) where represents the mapping process of input data features, ∗ is the loss function, = is the multi-objective weighting coefficient, is the multi-dimensional socio-economic constraint function, and K is the number of constraint terms, which comprehensively considers the complex influence of various socio-economic factors on stock prices. 2.2 Interaction between stocks: combination of graph attention network and variational graph autoencoder 2.2.1 Synergistic advantages of graph networks and variational inference. Traditional stock prediction methods often face limitations when capturing complex interrelationships between stocks, failing to effectively model dynamic associations and nonlinear features among stocks [13,15,26]. These methods typically rely on fixed statistical models or single neural network architectures, making it difficult to fully reflect the potential interactions between different stocks, thereby compromising the accuracy and reliability of prediction results [16,28]. The proposed algorithm in this study combines the Graph Attention Network (GAT) with the Variational Autoencoder (VAE). Through graph structure modeling and latent space feature learning, it can effectively capture complex nonlinear relationships and dynamic interaction features among stocks. GAT is used to aggregate correlations between stocks, enabling the model to identify the influence of each stock in the overall market, while VAE further encodes stock features deeply to generate high-dimensional latent space representations, thereby enhancing the model’s robustness to anomalies and complex relationships. Fig 1 shows a framework of the stock prediction model. The framework inputs stock data into the Graph Attention Network (GAT) to learn complex interrelationships between stocks. Subsequently, the model uses the Variational Autoencoder (VAE) to deeply encode stock features and generate high-dimensional latent space representations. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Capturing dynamic associations between stocks. https://doi.org/10.1371/journal.pone.0318939.g001 Algorithm 1: Stock relationship capture algorithm based on graph attentionand variational autoencoder. 1: Require: Stock set V, edge set E, adjacency matrix A, node feature matrix X Ensure: Final representation matrix of stock nodes Z 2: Initialize model parameters: 3: Initialize GAT parameters: weight matrix W, attention vector a 4: Initialize VAE parameters: encoders , , decoder 5: Initialize hyperparameters: = , p < 0 . 1, = 6: Graph attention feature aggregation: 7: for each node = do 8: Calculate attention weight between node i and its neighbor nodes p < 0 . 1 (refer to Eq 6) 9: Update node feature (refer to Eq 5) 10: end for 11: Variational autoencoder processing: 12: for each node = do 13: Encoding process: 14: Calculate the mean and variance of the latent variable (refer to Eq 7) 15: Sample latent variable (refer to Eq 8) 16: Decoding process: 17: Reconstruct node feature (refer to Eq 9) 18: end for 19: Loss function calculation and optimization: 20: Calculate VAE loss (refer to Eq 10) 21: Optimize model parameters using gradient descent 22: Generate final node representation: 23: Construct final representation matrix Z (refer to Eq 11) 24: Generate sparse representation: 25: Calculate sparse representation (refer to Eq 13) return Zsparse Let there be N stocks, whose interrelationships can be represented as an undirected graph = , where V denotes the set of stocks, and E represents the edges between stocks. Define the adjacency matrix of the graph structure as , and the node feature matrix as , where F is the feature dimension of nodes. The feature aggregation for each node is performed through a graph attention layer, yielding the updated feature representation for node i: (5) where p < 0 . 1 represents the set of neighbors of node i, is the attention weight between node i and node j, is a learnable weight matrix, λ is a regularization parameter used to control neighborhood differences, and σ is an activation function. The attention weight is computed as follows: (6) where is a learnable attention weight vector, ∥ denotes the feature concatenation operation, and represents an integral term introduced using the mean value theorem to better capture the feature variation trend. To model the complex nonlinear relationships between stocks, the study further uses the Variational Autoencoder to encode node features. Let the node features be , which are used to learn latent space representations via the VAE. First, map the node features to the mean and variance of the latent space: (7) where and are two independent feedforward neural networks for generating the mean and variance of the latent variables, is a small perturbation of node features, and c is a constant. Based on the mean and variance, the latent variable can be sampled as: (8) where ⊙ denotes element-wise multiplication, and ϵ is a random noise vector that follows a standard normal distribution, with an additional noise term controlled by the gradient of the KL divergence loss to ensure diversity in sampling. The latent variable is then used for reconstruction to recover the original node features: (9) where is the decoder for reconstructing node features from latent variables, ĉ is an adjustable constant to increase the nonlinearity of reconstruction, and is an additional nonlinear modulation function. To optimize the training of the VAE, the following loss function is used: (10) where MSE is the mean squared error loss, is the Kullback-Leibler divergence used to measure the difference between the prior distribution and the posterior distribution of latent variables, with additional gradient and second derivative terms used to capture the sensitivity of attention weights and reconstruction functions to feature changes. Finally, GAT and VAE are combined and jointly trained to obtain the final representation for each stock node, which serves as the input for anomaly detection based on Sparse Spatiotemporal Convolutional Networks: (11) Theorem 1. Let G = ( V , E ) be the stock relationship graph, and the node representations be the final latent variable representations Z obtained through the joint training of GAT and VAE. The optimal solution to the joint loss function minimizes the reconstruction error, KL divergence, and regularization terms: (12) where γ is the weight coefficient for the second-order derivative regularization term, ensuring effective control of the model’s high-order sensitivity to feature changes. Corollary 1. From Theorem 1, the sparse representation of the node latent variable Z as the input for anomaly detection satisfies the following condition to ensure the preservation of important features during anomaly detection: (13) where η is the sparsification regularization parameter, and is the second-order derivative of the node representation with respect to time, used to improve the adaptability of the sparse representation to temporal dynamics. 2.2.1 Synergistic advantages of graph networks and variational inference. Traditional stock prediction methods often face limitations when capturing complex interrelationships between stocks, failing to effectively model dynamic associations and nonlinear features among stocks [13,15,26]. These methods typically rely on fixed statistical models or single neural network architectures, making it difficult to fully reflect the potential interactions between different stocks, thereby compromising the accuracy and reliability of prediction results [16,28]. The proposed algorithm in this study combines the Graph Attention Network (GAT) with the Variational Autoencoder (VAE). Through graph structure modeling and latent space feature learning, it can effectively capture complex nonlinear relationships and dynamic interaction features among stocks. GAT is used to aggregate correlations between stocks, enabling the model to identify the influence of each stock in the overall market, while VAE further encodes stock features deeply to generate high-dimensional latent space representations, thereby enhancing the model’s robustness to anomalies and complex relationships. Fig 1 shows a framework of the stock prediction model. The framework inputs stock data into the Graph Attention Network (GAT) to learn complex interrelationships between stocks. Subsequently, the model uses the Variational Autoencoder (VAE) to deeply encode stock features and generate high-dimensional latent space representations. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Capturing dynamic associations between stocks. https://doi.org/10.1371/journal.pone.0318939.g001 Algorithm 1: Stock relationship capture algorithm based on graph attentionand variational autoencoder. 1: Require: Stock set V, edge set E, adjacency matrix A, node feature matrix X Ensure: Final representation matrix of stock nodes Z 2: Initialize model parameters: 3: Initialize GAT parameters: weight matrix W, attention vector a 4: Initialize VAE parameters: encoders , , decoder 5: Initialize hyperparameters: = , p < 0 . 1, = 6: Graph attention feature aggregation: 7: for each node = do 8: Calculate attention weight between node i and its neighbor nodes p < 0 . 1 (refer to Eq 6) 9: Update node feature (refer to Eq 5) 10: end for 11: Variational autoencoder processing: 12: for each node = do 13: Encoding process: 14: Calculate the mean and variance of the latent variable (refer to Eq 7) 15: Sample latent variable (refer to Eq 8) 16: Decoding process: 17: Reconstruct node feature (refer to Eq 9) 18: end for 19: Loss function calculation and optimization: 20: Calculate VAE loss (refer to Eq 10) 21: Optimize model parameters using gradient descent 22: Generate final node representation: 23: Construct final representation matrix Z (refer to Eq 11) 24: Generate sparse representation: 25: Calculate sparse representation (refer to Eq 13) return Zsparse Let there be N stocks, whose interrelationships can be represented as an undirected graph = , where V denotes the set of stocks, and E represents the edges between stocks. Define the adjacency matrix of the graph structure as , and the node feature matrix as , where F is the feature dimension of nodes. The feature aggregation for each node is performed through a graph attention layer, yielding the updated feature representation for node i: (5) where p < 0 . 1 represents the set of neighbors of node i, is the attention weight between node i and node j, is a learnable weight matrix, λ is a regularization parameter used to control neighborhood differences, and σ is an activation function. The attention weight is computed as follows: (6) where is a learnable attention weight vector, ∥ denotes the feature concatenation operation, and represents an integral term introduced using the mean value theorem to better capture the feature variation trend. To model the complex nonlinear relationships between stocks, the study further uses the Variational Autoencoder to encode node features. Let the node features be , which are used to learn latent space representations via the VAE. First, map the node features to the mean and variance of the latent space: (7) where and are two independent feedforward neural networks for generating the mean and variance of the latent variables, is a small perturbation of node features, and c is a constant. Based on the mean and variance, the latent variable can be sampled as: (8) where ⊙ denotes element-wise multiplication, and ϵ is a random noise vector that follows a standard normal distribution, with an additional noise term controlled by the gradient of the KL divergence loss to ensure diversity in sampling. The latent variable is then used for reconstruction to recover the original node features: (9) where is the decoder for reconstructing node features from latent variables, ĉ is an adjustable constant to increase the nonlinearity of reconstruction, and is an additional nonlinear modulation function. To optimize the training of the VAE, the following loss function is used: (10) where MSE is the mean squared error loss, is the Kullback-Leibler divergence used to measure the difference between the prior distribution and the posterior distribution of latent variables, with additional gradient and second derivative terms used to capture the sensitivity of attention weights and reconstruction functions to feature changes. Finally, GAT and VAE are combined and jointly trained to obtain the final representation for each stock node, which serves as the input for anomaly detection based on Sparse Spatiotemporal Convolutional Networks: (11) Theorem 1. Let G = ( V , E ) be the stock relationship graph, and the node representations be the final latent variable representations Z obtained through the joint training of GAT and VAE. The optimal solution to the joint loss function minimizes the reconstruction error, KL divergence, and regularization terms: (12) where γ is the weight coefficient for the second-order derivative regularization term, ensuring effective control of the model’s high-order sensitivity to feature changes. Corollary 1. From Theorem 1, the sparse representation of the node latent variable Z as the input for anomaly detection satisfies the following condition to ensure the preservation of important features during anomaly detection: (13) where η is the sparsification regularization parameter, and is the second-order derivative of the node representation with respect to time, used to improve the adaptability of the sparse representation to temporal dynamics. 2.3 Anomaly detection: dynamic modeling with sparse spatiotemporal convolutional network 2.3.1 Efficient modeling of sparsity and spatiotemporal features. Traditional anomaly detection methods in stock prediction often struggle to effectively capture the complex dependencies between spatiotemporal features, lacking a deep perception of the dynamic changes and anomalous characteristics in stock data [17,18]. They are often unable to accurately identify potential anomalies in complex financial market environments [19]. The anomaly detection method based on Sparse Spatiotemporal Convolutional Network (STCN) can effectively capture the dynamic changes of stock features in both spatial and temporal dimensions. STCN can comprehensively model the spatiotemporal dependencies of stock latent variables and enhance the sparsity and sensitivity to feature changes through regularization techniques, better handling the complexity and volatility of financial data. Fig 2 illustrates how causal relationship-based stock data is used for anomaly detection through sparse graph structure and temporal feature construction. The causal graph data is transformed into a sparse graph structure, effectively capturing the dynamic changes of stock features in spatial and temporal dimensions through temporal feature construction. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Anomaly detection in stock data. https://doi.org/10.1371/journal.pone.0318939.g002 2.3.2 Algorithm 2: Anomaly detection based on sparse spatiotemporal convolutional network. Let the node representations obtained in the previous section be , where N is the number of stocks, and is the dimension of the latent variable features. The node representation Z is preprocessed through the input layer of the STCN to construct the input tensor , where T represents the time dimension, indicating the sequence of node features over multiple time steps. For the input tensor , define a sparse convolutional kernel , where , , and are the kernel sizes for spatial, feature, and temporal dimensions, respectively. The convolution operation can be expressed as: (14) where is the output feature map after convolution, is the bias term, and is a weighting function used to adjust the interaction between features, with an additional integral term to capture subtle changes during convolution. To enhance sparsity, an activation function ϕ ( ⋅ ) and regularization are applied to the convolution result: (15) Algorithm 2: Stock relationship capture algorithm based on graph attentionand variational autoencoder. Require: Stock set V, edge set E, adjacency matrix A, node feature matrix X Ensure: Final representation matrix of stock nodes Z 1: Initialize model parameters: 2: Initialize GAT parameters: weight matrix W, attention vector a 3: Initialize VAE parameters: encoders , , decoder 4: Initialize hyperparameters: λ, γ, η 5: Graph attention feature aggregation: 6: for each node i ∈ V do 7: Calculate attention weight between node i and its neighboring nodes j ∈ N ( i ) 8: (refer to Eq 6) 9: Update node feature (refer to Eq 5) 10: end for 11: Variational autoencoder processing: 12: for each node i ∈ V do 13: Encoding process: 14: Calculate the mean and variance of the latent variable 15: (refer to Eq 7) 16: Sample latent variable (refer to Eq 8) 17: Decoding process: 18: Reconstruct node feature (refer to Eq 9) 19: end for 20: Loss function calculation and optimization: 21: Calculate VAE loss (refer to Eq 10) 22: Optimize model parameters using gradient descent 23: Generate final node representation: 24: Construct final representation matrix Z (refer to Eq 11) 25: Generate sparse representation: 26: Calculate sparse representation 27: (refer to Eq 13) return where an additional second-order derivative regularization term is used to enhance the model’s sparsity and capture the impact of input feature changes on the convolution result. A pooling operation is applied to the convolved features to reduce the size of the feature map. Let the pooling kernel size be × × , and the pooling operation can be defined as: (16) where is a temporal smoothing function related to the pooling operation, which adjusts the smoothness of the pooling result through an integral term. To capture spatiotemporal dependencies between features, a residual connection module is defined for the STCN, which improves gradient propagation through residual connections, specifically: (17) where is the output of the residual module, with an additional second-order derivative term used to capture the complex nonlinear relationships between features. To detect anomalies, a self-attention mechanism is introduced to compute the importance weight of each node feature over the entire time series. Let the weight matrix be , and the weights are calculated as follows: (18) where θ ( ⋅ ) , ϕ ( ⋅ ) , and are independent feedforward neural networks used to compute the similarity between node features, with an integral term to capture the variation trend of node features. Based on the attention weights, the node features are weighted and summed to obtain the final anomaly score : (19) where additional second-order and third-order derivative regularization terms are used to control the smoothness and nonlinear variation of the score. To determine the anomaly threshold, a threshold calculation formula based on the normal distribution assumption is introduced, with the mean and standard deviation of the anomaly scores S denoted as and , respectively: (20) where δ is an adjustment parameter, and the additional integral term captures the tail characteristics of the score distribution for more accurate threshold setting in the offline detection scenario. This formulation leverages the availability of complete historical data to establish a robust anomaly threshold. By comparing each node’s anomaly score with the threshold τ, it is determined whether the node is an anomaly: (21) where represents the anomaly indicator of node i, with an additional integral term for dynamic threshold adjustment. Theorem 2. Let the input tensor be , and the anomaly score obtained after STCN, pooling, self-attention, and residual connections satisfies the following optimal condition: (22) where is the anomaly detection loss function, which includes regularization, second-order derivative regularization, and an integral term to enhance the model’s sparsity, temporal smoothness, and anomaly detection accuracy. Corollary 2. From Theorem 1, the final anomaly indicator of a node satisfies the following condition to determine whether the node is an anomaly: (23) where and are the mean and standard deviation of the anomaly scores, δ is a parameter to adjust sensitivity, and the additional integral term is used to better capture the variation of anomaly scores. 2.3.1 Efficient modeling of sparsity and spatiotemporal features. Traditional anomaly detection methods in stock prediction often struggle to effectively capture the complex dependencies between spatiotemporal features, lacking a deep perception of the dynamic changes and anomalous characteristics in stock data [17,18]. They are often unable to accurately identify potential anomalies in complex financial market environments [19]. The anomaly detection method based on Sparse Spatiotemporal Convolutional Network (STCN) can effectively capture the dynamic changes of stock features in both spatial and temporal dimensions. STCN can comprehensively model the spatiotemporal dependencies of stock latent variables and enhance the sparsity and sensitivity to feature changes through regularization techniques, better handling the complexity and volatility of financial data. Fig 2 illustrates how causal relationship-based stock data is used for anomaly detection through sparse graph structure and temporal feature construction. The causal graph data is transformed into a sparse graph structure, effectively capturing the dynamic changes of stock features in spatial and temporal dimensions through temporal feature construction. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Anomaly detection in stock data. https://doi.org/10.1371/journal.pone.0318939.g002 2.3.2 Algorithm 2: Anomaly detection based on sparse spatiotemporal convolutional network. Let the node representations obtained in the previous section be , where N is the number of stocks, and is the dimension of the latent variable features. The node representation Z is preprocessed through the input layer of the STCN to construct the input tensor , where T represents the time dimension, indicating the sequence of node features over multiple time steps. For the input tensor , define a sparse convolutional kernel , where , , and are the kernel sizes for spatial, feature, and temporal dimensions, respectively. The convolution operation can be expressed as: (14) where is the output feature map after convolution, is the bias term, and is a weighting function used to adjust the interaction between features, with an additional integral term to capture subtle changes during convolution. To enhance sparsity, an activation function ϕ ( ⋅ ) and regularization are applied to the convolution result: (15) Algorithm 2: Stock relationship capture algorithm based on graph attentionand variational autoencoder. Require: Stock set V, edge set E, adjacency matrix A, node feature matrix X Ensure: Final representation matrix of stock nodes Z 1: Initialize model parameters: 2: Initialize GAT parameters: weight matrix W, attention vector a 3: Initialize VAE parameters: encoders , , decoder 4: Initialize hyperparameters: λ, γ, η 5: Graph attention feature aggregation: 6: for each node i ∈ V do 7: Calculate attention weight between node i and its neighboring nodes j ∈ N ( i ) 8: (refer to Eq 6) 9: Update node feature (refer to Eq 5) 10: end for 11: Variational autoencoder processing: 12: for each node i ∈ V do 13: Encoding process: 14: Calculate the mean and variance of the latent variable 15: (refer to Eq 7) 16: Sample latent variable (refer to Eq 8) 17: Decoding process: 18: Reconstruct node feature (refer to Eq 9) 19: end for 20: Loss function calculation and optimization: 21: Calculate VAE loss (refer to Eq 10) 22: Optimize model parameters using gradient descent 23: Generate final node representation: 24: Construct final representation matrix Z (refer to Eq 11) 25: Generate sparse representation: 26: Calculate sparse representation 27: (refer to Eq 13) return where an additional second-order derivative regularization term is used to enhance the model’s sparsity and capture the impact of input feature changes on the convolution result. A pooling operation is applied to the convolved features to reduce the size of the feature map. Let the pooling kernel size be × × , and the pooling operation can be defined as: (16) where is a temporal smoothing function related to the pooling operation, which adjusts the smoothness of the pooling result through an integral term. To capture spatiotemporal dependencies between features, a residual connection module is defined for the STCN, which improves gradient propagation through residual connections, specifically: (17) where is the output of the residual module, with an additional second-order derivative term used to capture the complex nonlinear relationships between features. To detect anomalies, a self-attention mechanism is introduced to compute the importance weight of each node feature over the entire time series. Let the weight matrix be , and the weights are calculated as follows: (18) where θ ( ⋅ ) , ϕ ( ⋅ ) , and are independent feedforward neural networks used to compute the similarity between node features, with an integral term to capture the variation trend of node features. Based on the attention weights, the node features are weighted and summed to obtain the final anomaly score : (19) where additional second-order and third-order derivative regularization terms are used to control the smoothness and nonlinear variation of the score. To determine the anomaly threshold, a threshold calculation formula based on the normal distribution assumption is introduced, with the mean and standard deviation of the anomaly scores S denoted as and , respectively: (20) where δ is an adjustment parameter, and the additional integral term captures the tail characteristics of the score distribution for more accurate threshold setting in the offline detection scenario. This formulation leverages the availability of complete historical data to establish a robust anomaly threshold. By comparing each node’s anomaly score with the threshold τ, it is determined whether the node is an anomaly: (21) where represents the anomaly indicator of node i, with an additional integral term for dynamic threshold adjustment. Theorem 2. Let the input tensor be , and the anomaly score obtained after STCN, pooling, self-attention, and residual connections satisfies the following optimal condition: (22) where is the anomaly detection loss function, which includes regularization, second-order derivative regularization, and an integral term to enhance the model’s sparsity, temporal smoothness, and anomaly detection accuracy. Corollary 2. From Theorem 1, the final anomaly indicator of a node satisfies the following condition to determine whether the node is an anomaly: (23) where and are the mean and standard deviation of the anomaly scores, δ is a parameter to adjust sensitivity, and the additional integral term is used to better capture the variation of anomaly scores. 2.4 Complete algorithm: STAGE (Spatiotemporal attention graph embedding) framework Time complexity analysis: The time complexity of the sparse convolution operation is O ( N × T ) , where N is the number of nodes and T is the number of time steps. The activation and regularization processes require traversing each node and its feature dimensions, resulting in a time complexity of , where is the feature dimension of the nodes. The pooling and residual connection operations both have a time complexity of O ( N × T ) , where pooling is used to reduce the feature map dimensions and improve computational efficiency, and residual connections accelerate gradient propagation by skipping certain layers to reduce computation time. The self-attention mechanism has a time complexity of O ( N × T ) since it only needs to weight and sum the features for each node within the time steps, and the weight computation can be vectorized for efficient parallelization. The anomaly score calculation and anomaly detection process also have a time complexity of O(N). Therefore, the overall time complexity of the algorithm framework is . Space complexity analysis: The space complexity for the input tensor is , indicating the need to store all feature values for all nodes across all time steps. Only the intermediate results for the current time step need to be stored, with previous computation results being released, which makes the overall space complexity . Thus, the overall space complexity is . The STAGE framework implements offline anomaly detection, where the complete historical dataset is analyzed to capture comprehensive spatiotemporal dependencies and market patterns. This design enables accurate anomaly threshold determination through global context analysis. For online detection scenarios, the framework can be adapted using sliding window processing and local statistics-based threshold calculations. 3 Experiments and results 3.1 Experimental parameters and dataset description Dataset description: This study uses the publicly available dataset “Daily News for Stock Market Prediction” from the Kaggle platform. The dataset integrates news data from the Reddit WorldNews channel with stock data from the Dow Jones Industrial Average (DJIA). The dataset is provided in CSV format, containing three files: RedditNews.csv, DJIA_table.csv, and Combined_News_DJIA.csv. The primary research file is Combined_News_DJIA.csv, which contains 27 columns: date, stock movement label, and 25 daily top news headlines ranked by popularity. The study utilized a hardware platform equipped with an NVIDIA RTX 3090 graphics card (24GB VRAM), Intel Xeon processor (16 cores), and 128GB RAM to ensure efficient model training and inference. In addition, a 2TB SSD was used to store datasets and model checkpoints. The operating system used was Windows 10, providing good support and compatibility for deep learning frameworks. The specific parameters used in model training are detailed in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Detailed model parameter table. https://doi.org/10.1371/journal.pone.0313772.t002 3.2 Experimental results 3.2.1 Experimental results without anomaly detection. In this experiment, the performance of the baseline LSTM model, the complete STAGE framework, and the STAGE framework without key algorithms were compared, as shown in Fig 3. The accuracy of the baseline LSTM model stagnated at 55.1% after 20 training epochs, failing to capture deep features. The STAGE framework without Algorithm 1 (combination of Graph Attention Network and Variational Autoencoder) achieved an accuracy of 74.1% but showed significant fluctuations during the early stages of training (from epoch 5 to 10), highlighting the importance of Algorithm 1 in handling complex relationships and maintaining stability. The STAGE framework without Algorithm 2 (dynamic modeling with Sparse Spatiotemporal Convolutional Network) achieved an accuracy of 76.2%, slightly better than without Algorithm 1, indicating that while the contribution of Algorithm 2 to anomaly detection is smaller, it is still indispensable. The complete STAGE framework, combining both algorithms, stabilized at an accuracy of 85% after 20 epochs and showed a rapid and consistent improvement in the early stages. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Comparison of model accuracy performance without anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g003 Fig 4 compares the loss trends of the four models in the experiment. Loss is an important metric for evaluating model performance, reflecting the error level and convergence speed during training. The complete STAGE framework showed the best performance in terms of loss reduction rate and final loss value, reducing to 0.11 after 20 epochs, demonstrating efficient feature learning and good generalization capability. The baseline LSTM model showed a gradual decline in loss but converged slowly, with a final value of 0.43. The STAGE framework without Algorithm 1 stabilized at a loss of 0.34, indicating the crucial role of Algorithm 1 in capturing complex relationships. The STAGE framework without Algorithm 2 had a loss of 0.20, failing to reach the performance of the complete framework, highlighting the importance of Algorithm 2 in anomaly detection. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Comparison of model loss performance without anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g004 Considering that predicting model performance within a windowed time frame helps capture temporal dynamics of the data, validating model stability and robustness, Fig 5 compares the performance of four different models across three time windows (Window 1, Window 2, Window 3), evaluating five key metrics: Accuracy, Precision, Recall, Specificity, and Loss. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Performance of models in different windows (without anomaly detection). https://doi.org/10.1371/journal.pone.0318939.g005 Specifically, the STAGE framework achieved an accuracy of 85% in Window 1, and 82% and 80% in Window 2 and Window 3, respectively. Although there was a slight decline, the accuracy remained at a high level. The STAGE framework also showed excellent performance in terms of loss, with a loss of 0.11 in Window 1, and 0.15 and 0.18 in Window 2 and Window 3, respectively, which were still much lower than those of the other models. In contrast, the baseline LSTM model performed poorly, with an accuracy of only 55% in Window 1, which further decreased to 53% and 51% in Window 2 and Window 3, respectively. The loss values were 0.43 in Window 1, 0.45 in Window 2, and 0.48 in Window 3, all of which were significantly higher than those of the STAGE framework, indicating that the baseline model had a large prediction error and could not effectively adapt to the dynamic changes in the data. Meanwhile, the model without Algorithm 1 achieved an accuracy of 74% in Window 1, with a loss of 0.34, while the model without Algorithm 1 had an accuracy of 76% and a loss of 0.20 in Window 1. 3.2.2 Experimental results with anomaly detection. The experiment compared the accuracy performance of the baseline LSTM model, the complete STAGE framework, and the STAGE framework with either Algorithm 1 or Algorithm 2 removed, as shown in Fig 6. The complete STAGE framework quickly improved to 95% after the 9th epoch and remained stable, demonstrating its significant advantages in handling complex tasks. The STAGE framework without Algorithm 2 ultimately reached an accuracy of 85.2% but showed deficiencies in handling anomalous data. The framework without Algorithm 1 performed worse, with a final accuracy of 72.5%. In contrast, the baseline LSTM model achieved an accuracy of only 63.4%, which was significantly lower than that of the other models. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Comparison of model accuracy performance with anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g006 To further verify the impact of anomaly detection on model performance, we compared the loss of four model configurations: the baseline LSTM model, the STAGE framework without Algorithm 1, the STAGE framework without Algorithm 2, and the complete STAGE framework. The experimental results are shown in Fig 7. The complete STAGE framework significantly outperformed the other model configurations in terms of convergence speed and final performance. In the early training phase (epochs 1–5), this framework exhibited a sharp decline in loss, rapidly dropping from an initial value of 0.83 to 0.19, and eventually reaching a minimum loss value of 0.008. In contrast, the STAGE framework without Algorithm 2 performed second best, with a final loss value of 0.05; the framework without Algorithm 1 performed slightly worse, with a final loss value of 0.1. The baseline LSTM model showed the worst performance, with the loss only dropping to 0.25. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Comparison of model loss performance with anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g007 Based on the results in Fig 8, we analyzed the models’ performance across different time windows. The complete STAGE framework performed exceptionally across all time windows, with accuracy remaining between 93% and 95%, and very low loss values (0.008 to 0.015), demonstrating good stability and generalization capability. In contrast, the baseline LSTM model performed the worst, with accuracy between 59% and 63%, and high loss values (0.25 to 0.29), with evaluation metrics significantly lower than the other models. The ablation study showed that removing Algorithm 1 resulted in a significant drop in accuracy to 68%–72%, highlighting its importance to model performance; the impact of removing Algorithm 2 was relatively smaller, but accuracy still decreased to 81%–85%. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 8. Performance of models in different windows (with anomaly detection). https://doi.org/10.1371/journal.pone.0318939.g008 3.2.3 Comparison with state-of-the-art studies. The STAGE framework proposed in this experiment demonstrated significant advantages in model performance from multiple perspectives. First, as shown in Table 3, compared to traditional deep learning models, including the RNN-LSTM model in [20] (accuracy 89.2%, precision 85.1%, recall 87.3%) and the baseline LSTM model in [21] (accuracy 88.4%, precision 84.7%, recall 86.2%), the STAGE framework showed better results. The LSTM model used in [22] achieved an accuracy of 92.1%, while the study combining LSTM and ARIMA in [23] reached 93.0%. These results indicate that although existing models have shown good performance in terms of accuracy, the STAGE framework, by integrating multiple advanced techniques, achieved higher prediction accuracy. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Comparison of model performance with other studies. https://doi.org/10.1371/journal.pone.0313772.t003 Furthermore, we conducted additional comparisons with recent state-of-the-art hybrid models, as shown in Table 4. The SMP-DL framework [24] combines LSTM with BiGRU and achieves good performance in terms of RMSE and MAE. Similarly, the DLEF-SM approach [25] demonstrates impressive accuracy across different market conditions through its integration of deep reinforcement learning with artificial neural networks. Our STAGE framework shows competitive performance against these recent advances, particularly in terms of prediction accuracy and error metrics, while maintaining better computational efficiency through its staged learning process. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Comparison with recent state-of-the-art hybrid models. https://doi.org/10.1371/journal.pone.0313772.t004 Comparing with recent hybrid models (Table 4), our STAGE framework demonstrates both strengths and trade-offs in different aspects. While DLEF-SM [25] achieves a higher accuracy of 98.23% compared to our 95.3%, our framework shows better stability in error metrics with lower RMSE (0.2534 vs 0.2657) and MAE (0.1865 vs 0.1986). This indicates that although DLEF-SM may have better classification performance, our model produces more stable and consistent predictions with smaller average errors. Moreover, our framework achieves the highest R2 value (0.9972), suggesting better explanatory power for price movements. Compared to SMP-DL [24], our model shows comprehensive improvements across all metrics, with relative improvements of 12.1% in RMSE and 11.1% in MAE. To further validate the practical value and robustness of the framework, additional financial performance analysis and statistical stability tests were conducted, as shown in Table 5. From a financial perspective, the STAGE model achieves a Sharpe ratio of 1.78 and Maximum Drawdown (MDD) of –16.5%, outperforming both SMP-DL (Sharpe: 1.45, MDD: –21.3%) and DLEF-SM (Sharpe: 1.62, MDD: –18.7%). These improvements in risk-adjusted returns and downside protection demonstrate the model’s practical value for real-world trading applications. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. Financial performance and statistical stability analysis. https://doi.org/10.1371/journal.pone.0313772.t005 The robustness analysis based on bootstrap samples reveals strong statistical stability in the STAGE framework’s performance. The standard deviation of accuracy (1.56%) and RMSE (0.0184) are notably lower than both comparison models, with SMP-DL showing the highest variations (Accuracy Std: 2.24%, RMSE Std: 0.0248) and DLEF-SM displaying intermediate stability (Accuracy Std: 1.89%, RMSE Std: 0.0215). While DLEF-SM shows impressive accuracy, its DRL-ANN architecture may be more susceptible to overfitting on specific market patterns, as evidenced by its higher performance variations. 3.3 Discussion The STAGE framework proposed in this study, based on Graph Attention Network (GAT), Variational Autoencoder (VAE), and Sparse Spatiotemporal Convolutional Network (STCN), demonstrated significant advantages in addressing complex relationships and anomaly issues in the stock market. Based on the experimental results, the following three points are discussed: Importance of capturing complex relationships between stocks: The experimental results indicate that models without GAT exhibit clear deficiencies in capturing the complex dynamic interactions between stocks, resulting in a nearly 15% decrease in prediction accuracy, from 95% to 80%. In contrast, the complete STAGE framework models the relationships between stocks using GAT, enabling the model to have a stronger awareness of the associations between different stocks in the market. Role of sparse spatiotemporal convolutional network in anomaly detection: The Sparse Spatiotemporal Convolutional Network (STCN) demonstrated significant advantages in anomaly detection. Experimental data showed that removing STCN resulted in an approximately 25% increase in the loss value for handling anomalous data, from an original value of 0.35 to 0.44, further validating the importance of STCN in improving anomaly detection accuracy. Moreover, through regularization strategies, STCN significantly enhanced the model’s sensitivity to anomalies, making the complete framework more stable when dealing with anomalous data. Limitations of the STAGE framework and future applications: Although the STAGE framework demonstrated excellent robustness and adaptability in stock prediction tasks, with a rapid convergence of loss values and significantly improved prediction accuracy, it still has certain limitations in handling more diverse types of financial data and complex market environments. Future research will focus on further optimizing the model structure to ensure that the STAGE framework maintains stable prediction performance in broader financial scenarios. 3.1 Experimental parameters and dataset description Dataset description: This study uses the publicly available dataset “Daily News for Stock Market Prediction” from the Kaggle platform. The dataset integrates news data from the Reddit WorldNews channel with stock data from the Dow Jones Industrial Average (DJIA). The dataset is provided in CSV format, containing three files: RedditNews.csv, DJIA_table.csv, and Combined_News_DJIA.csv. The primary research file is Combined_News_DJIA.csv, which contains 27 columns: date, stock movement label, and 25 daily top news headlines ranked by popularity. The study utilized a hardware platform equipped with an NVIDIA RTX 3090 graphics card (24GB VRAM), Intel Xeon processor (16 cores), and 128GB RAM to ensure efficient model training and inference. In addition, a 2TB SSD was used to store datasets and model checkpoints. The operating system used was Windows 10, providing good support and compatibility for deep learning frameworks. The specific parameters used in model training are detailed in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Detailed model parameter table. https://doi.org/10.1371/journal.pone.0313772.t002 3.2 Experimental results 3.2.1 Experimental results without anomaly detection. In this experiment, the performance of the baseline LSTM model, the complete STAGE framework, and the STAGE framework without key algorithms were compared, as shown in Fig 3. The accuracy of the baseline LSTM model stagnated at 55.1% after 20 training epochs, failing to capture deep features. The STAGE framework without Algorithm 1 (combination of Graph Attention Network and Variational Autoencoder) achieved an accuracy of 74.1% but showed significant fluctuations during the early stages of training (from epoch 5 to 10), highlighting the importance of Algorithm 1 in handling complex relationships and maintaining stability. The STAGE framework without Algorithm 2 (dynamic modeling with Sparse Spatiotemporal Convolutional Network) achieved an accuracy of 76.2%, slightly better than without Algorithm 1, indicating that while the contribution of Algorithm 2 to anomaly detection is smaller, it is still indispensable. The complete STAGE framework, combining both algorithms, stabilized at an accuracy of 85% after 20 epochs and showed a rapid and consistent improvement in the early stages. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Comparison of model accuracy performance without anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g003 Fig 4 compares the loss trends of the four models in the experiment. Loss is an important metric for evaluating model performance, reflecting the error level and convergence speed during training. The complete STAGE framework showed the best performance in terms of loss reduction rate and final loss value, reducing to 0.11 after 20 epochs, demonstrating efficient feature learning and good generalization capability. The baseline LSTM model showed a gradual decline in loss but converged slowly, with a final value of 0.43. The STAGE framework without Algorithm 1 stabilized at a loss of 0.34, indicating the crucial role of Algorithm 1 in capturing complex relationships. The STAGE framework without Algorithm 2 had a loss of 0.20, failing to reach the performance of the complete framework, highlighting the importance of Algorithm 2 in anomaly detection. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Comparison of model loss performance without anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g004 Considering that predicting model performance within a windowed time frame helps capture temporal dynamics of the data, validating model stability and robustness, Fig 5 compares the performance of four different models across three time windows (Window 1, Window 2, Window 3), evaluating five key metrics: Accuracy, Precision, Recall, Specificity, and Loss. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Performance of models in different windows (without anomaly detection). https://doi.org/10.1371/journal.pone.0318939.g005 Specifically, the STAGE framework achieved an accuracy of 85% in Window 1, and 82% and 80% in Window 2 and Window 3, respectively. Although there was a slight decline, the accuracy remained at a high level. The STAGE framework also showed excellent performance in terms of loss, with a loss of 0.11 in Window 1, and 0.15 and 0.18 in Window 2 and Window 3, respectively, which were still much lower than those of the other models. In contrast, the baseline LSTM model performed poorly, with an accuracy of only 55% in Window 1, which further decreased to 53% and 51% in Window 2 and Window 3, respectively. The loss values were 0.43 in Window 1, 0.45 in Window 2, and 0.48 in Window 3, all of which were significantly higher than those of the STAGE framework, indicating that the baseline model had a large prediction error and could not effectively adapt to the dynamic changes in the data. Meanwhile, the model without Algorithm 1 achieved an accuracy of 74% in Window 1, with a loss of 0.34, while the model without Algorithm 1 had an accuracy of 76% and a loss of 0.20 in Window 1. 3.2.2 Experimental results with anomaly detection. The experiment compared the accuracy performance of the baseline LSTM model, the complete STAGE framework, and the STAGE framework with either Algorithm 1 or Algorithm 2 removed, as shown in Fig 6. The complete STAGE framework quickly improved to 95% after the 9th epoch and remained stable, demonstrating its significant advantages in handling complex tasks. The STAGE framework without Algorithm 2 ultimately reached an accuracy of 85.2% but showed deficiencies in handling anomalous data. The framework without Algorithm 1 performed worse, with a final accuracy of 72.5%. In contrast, the baseline LSTM model achieved an accuracy of only 63.4%, which was significantly lower than that of the other models. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Comparison of model accuracy performance with anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g006 To further verify the impact of anomaly detection on model performance, we compared the loss of four model configurations: the baseline LSTM model, the STAGE framework without Algorithm 1, the STAGE framework without Algorithm 2, and the complete STAGE framework. The experimental results are shown in Fig 7. The complete STAGE framework significantly outperformed the other model configurations in terms of convergence speed and final performance. In the early training phase (epochs 1–5), this framework exhibited a sharp decline in loss, rapidly dropping from an initial value of 0.83 to 0.19, and eventually reaching a minimum loss value of 0.008. In contrast, the STAGE framework without Algorithm 2 performed second best, with a final loss value of 0.05; the framework without Algorithm 1 performed slightly worse, with a final loss value of 0.1. The baseline LSTM model showed the worst performance, with the loss only dropping to 0.25. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Comparison of model loss performance with anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g007 Based on the results in Fig 8, we analyzed the models’ performance across different time windows. The complete STAGE framework performed exceptionally across all time windows, with accuracy remaining between 93% and 95%, and very low loss values (0.008 to 0.015), demonstrating good stability and generalization capability. In contrast, the baseline LSTM model performed the worst, with accuracy between 59% and 63%, and high loss values (0.25 to 0.29), with evaluation metrics significantly lower than the other models. The ablation study showed that removing Algorithm 1 resulted in a significant drop in accuracy to 68%–72%, highlighting its importance to model performance; the impact of removing Algorithm 2 was relatively smaller, but accuracy still decreased to 81%–85%. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 8. Performance of models in different windows (with anomaly detection). https://doi.org/10.1371/journal.pone.0318939.g008 3.2.3 Comparison with state-of-the-art studies. The STAGE framework proposed in this experiment demonstrated significant advantages in model performance from multiple perspectives. First, as shown in Table 3, compared to traditional deep learning models, including the RNN-LSTM model in [20] (accuracy 89.2%, precision 85.1%, recall 87.3%) and the baseline LSTM model in [21] (accuracy 88.4%, precision 84.7%, recall 86.2%), the STAGE framework showed better results. The LSTM model used in [22] achieved an accuracy of 92.1%, while the study combining LSTM and ARIMA in [23] reached 93.0%. These results indicate that although existing models have shown good performance in terms of accuracy, the STAGE framework, by integrating multiple advanced techniques, achieved higher prediction accuracy. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Comparison of model performance with other studies. https://doi.org/10.1371/journal.pone.0313772.t003 Furthermore, we conducted additional comparisons with recent state-of-the-art hybrid models, as shown in Table 4. The SMP-DL framework [24] combines LSTM with BiGRU and achieves good performance in terms of RMSE and MAE. Similarly, the DLEF-SM approach [25] demonstrates impressive accuracy across different market conditions through its integration of deep reinforcement learning with artificial neural networks. Our STAGE framework shows competitive performance against these recent advances, particularly in terms of prediction accuracy and error metrics, while maintaining better computational efficiency through its staged learning process. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Comparison with recent state-of-the-art hybrid models. https://doi.org/10.1371/journal.pone.0313772.t004 Comparing with recent hybrid models (Table 4), our STAGE framework demonstrates both strengths and trade-offs in different aspects. While DLEF-SM [25] achieves a higher accuracy of 98.23% compared to our 95.3%, our framework shows better stability in error metrics with lower RMSE (0.2534 vs 0.2657) and MAE (0.1865 vs 0.1986). This indicates that although DLEF-SM may have better classification performance, our model produces more stable and consistent predictions with smaller average errors. Moreover, our framework achieves the highest R2 value (0.9972), suggesting better explanatory power for price movements. Compared to SMP-DL [24], our model shows comprehensive improvements across all metrics, with relative improvements of 12.1% in RMSE and 11.1% in MAE. To further validate the practical value and robustness of the framework, additional financial performance analysis and statistical stability tests were conducted, as shown in Table 5. From a financial perspective, the STAGE model achieves a Sharpe ratio of 1.78 and Maximum Drawdown (MDD) of –16.5%, outperforming both SMP-DL (Sharpe: 1.45, MDD: –21.3%) and DLEF-SM (Sharpe: 1.62, MDD: –18.7%). These improvements in risk-adjusted returns and downside protection demonstrate the model’s practical value for real-world trading applications. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. Financial performance and statistical stability analysis. https://doi.org/10.1371/journal.pone.0313772.t005 The robustness analysis based on bootstrap samples reveals strong statistical stability in the STAGE framework’s performance. The standard deviation of accuracy (1.56%) and RMSE (0.0184) are notably lower than both comparison models, with SMP-DL showing the highest variations (Accuracy Std: 2.24%, RMSE Std: 0.0248) and DLEF-SM displaying intermediate stability (Accuracy Std: 1.89%, RMSE Std: 0.0215). While DLEF-SM shows impressive accuracy, its DRL-ANN architecture may be more susceptible to overfitting on specific market patterns, as evidenced by its higher performance variations. 3.2.1 Experimental results without anomaly detection. In this experiment, the performance of the baseline LSTM model, the complete STAGE framework, and the STAGE framework without key algorithms were compared, as shown in Fig 3. The accuracy of the baseline LSTM model stagnated at 55.1% after 20 training epochs, failing to capture deep features. The STAGE framework without Algorithm 1 (combination of Graph Attention Network and Variational Autoencoder) achieved an accuracy of 74.1% but showed significant fluctuations during the early stages of training (from epoch 5 to 10), highlighting the importance of Algorithm 1 in handling complex relationships and maintaining stability. The STAGE framework without Algorithm 2 (dynamic modeling with Sparse Spatiotemporal Convolutional Network) achieved an accuracy of 76.2%, slightly better than without Algorithm 1, indicating that while the contribution of Algorithm 2 to anomaly detection is smaller, it is still indispensable. The complete STAGE framework, combining both algorithms, stabilized at an accuracy of 85% after 20 epochs and showed a rapid and consistent improvement in the early stages. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Comparison of model accuracy performance without anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g003 Fig 4 compares the loss trends of the four models in the experiment. Loss is an important metric for evaluating model performance, reflecting the error level and convergence speed during training. The complete STAGE framework showed the best performance in terms of loss reduction rate and final loss value, reducing to 0.11 after 20 epochs, demonstrating efficient feature learning and good generalization capability. The baseline LSTM model showed a gradual decline in loss but converged slowly, with a final value of 0.43. The STAGE framework without Algorithm 1 stabilized at a loss of 0.34, indicating the crucial role of Algorithm 1 in capturing complex relationships. The STAGE framework without Algorithm 2 had a loss of 0.20, failing to reach the performance of the complete framework, highlighting the importance of Algorithm 2 in anomaly detection. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Comparison of model loss performance without anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g004 Considering that predicting model performance within a windowed time frame helps capture temporal dynamics of the data, validating model stability and robustness, Fig 5 compares the performance of four different models across three time windows (Window 1, Window 2, Window 3), evaluating five key metrics: Accuracy, Precision, Recall, Specificity, and Loss. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Performance of models in different windows (without anomaly detection). https://doi.org/10.1371/journal.pone.0318939.g005 Specifically, the STAGE framework achieved an accuracy of 85% in Window 1, and 82% and 80% in Window 2 and Window 3, respectively. Although there was a slight decline, the accuracy remained at a high level. The STAGE framework also showed excellent performance in terms of loss, with a loss of 0.11 in Window 1, and 0.15 and 0.18 in Window 2 and Window 3, respectively, which were still much lower than those of the other models. In contrast, the baseline LSTM model performed poorly, with an accuracy of only 55% in Window 1, which further decreased to 53% and 51% in Window 2 and Window 3, respectively. The loss values were 0.43 in Window 1, 0.45 in Window 2, and 0.48 in Window 3, all of which were significantly higher than those of the STAGE framework, indicating that the baseline model had a large prediction error and could not effectively adapt to the dynamic changes in the data. Meanwhile, the model without Algorithm 1 achieved an accuracy of 74% in Window 1, with a loss of 0.34, while the model without Algorithm 1 had an accuracy of 76% and a loss of 0.20 in Window 1. 3.2.2 Experimental results with anomaly detection. The experiment compared the accuracy performance of the baseline LSTM model, the complete STAGE framework, and the STAGE framework with either Algorithm 1 or Algorithm 2 removed, as shown in Fig 6. The complete STAGE framework quickly improved to 95% after the 9th epoch and remained stable, demonstrating its significant advantages in handling complex tasks. The STAGE framework without Algorithm 2 ultimately reached an accuracy of 85.2% but showed deficiencies in handling anomalous data. The framework without Algorithm 1 performed worse, with a final accuracy of 72.5%. In contrast, the baseline LSTM model achieved an accuracy of only 63.4%, which was significantly lower than that of the other models. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Comparison of model accuracy performance with anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g006 To further verify the impact of anomaly detection on model performance, we compared the loss of four model configurations: the baseline LSTM model, the STAGE framework without Algorithm 1, the STAGE framework without Algorithm 2, and the complete STAGE framework. The experimental results are shown in Fig 7. The complete STAGE framework significantly outperformed the other model configurations in terms of convergence speed and final performance. In the early training phase (epochs 1–5), this framework exhibited a sharp decline in loss, rapidly dropping from an initial value of 0.83 to 0.19, and eventually reaching a minimum loss value of 0.008. In contrast, the STAGE framework without Algorithm 2 performed second best, with a final loss value of 0.05; the framework without Algorithm 1 performed slightly worse, with a final loss value of 0.1. The baseline LSTM model showed the worst performance, with the loss only dropping to 0.25. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Comparison of model loss performance with anomaly detection. https://doi.org/10.1371/journal.pone.0318939.g007 Based on the results in Fig 8, we analyzed the models’ performance across different time windows. The complete STAGE framework performed exceptionally across all time windows, with accuracy remaining between 93% and 95%, and very low loss values (0.008 to 0.015), demonstrating good stability and generalization capability. In contrast, the baseline LSTM model performed the worst, with accuracy between 59% and 63%, and high loss values (0.25 to 0.29), with evaluation metrics significantly lower than the other models. The ablation study showed that removing Algorithm 1 resulted in a significant drop in accuracy to 68%–72%, highlighting its importance to model performance; the impact of removing Algorithm 2 was relatively smaller, but accuracy still decreased to 81%–85%. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 8. Performance of models in different windows (with anomaly detection). https://doi.org/10.1371/journal.pone.0318939.g008 3.2.3 Comparison with state-of-the-art studies. The STAGE framework proposed in this experiment demonstrated significant advantages in model performance from multiple perspectives. First, as shown in Table 3, compared to traditional deep learning models, including the RNN-LSTM model in [20] (accuracy 89.2%, precision 85.1%, recall 87.3%) and the baseline LSTM model in [21] (accuracy 88.4%, precision 84.7%, recall 86.2%), the STAGE framework showed better results. The LSTM model used in [22] achieved an accuracy of 92.1%, while the study combining LSTM and ARIMA in [23] reached 93.0%. These results indicate that although existing models have shown good performance in terms of accuracy, the STAGE framework, by integrating multiple advanced techniques, achieved higher prediction accuracy. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Comparison of model performance with other studies. https://doi.org/10.1371/journal.pone.0313772.t003 Furthermore, we conducted additional comparisons with recent state-of-the-art hybrid models, as shown in Table 4. The SMP-DL framework [24] combines LSTM with BiGRU and achieves good performance in terms of RMSE and MAE. Similarly, the DLEF-SM approach [25] demonstrates impressive accuracy across different market conditions through its integration of deep reinforcement learning with artificial neural networks. Our STAGE framework shows competitive performance against these recent advances, particularly in terms of prediction accuracy and error metrics, while maintaining better computational efficiency through its staged learning process. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Comparison with recent state-of-the-art hybrid models. https://doi.org/10.1371/journal.pone.0313772.t004 Comparing with recent hybrid models (Table 4), our STAGE framework demonstrates both strengths and trade-offs in different aspects. While DLEF-SM [25] achieves a higher accuracy of 98.23% compared to our 95.3%, our framework shows better stability in error metrics with lower RMSE (0.2534 vs 0.2657) and MAE (0.1865 vs 0.1986). This indicates that although DLEF-SM may have better classification performance, our model produces more stable and consistent predictions with smaller average errors. Moreover, our framework achieves the highest R2 value (0.9972), suggesting better explanatory power for price movements. Compared to SMP-DL [24], our model shows comprehensive improvements across all metrics, with relative improvements of 12.1% in RMSE and 11.1% in MAE. To further validate the practical value and robustness of the framework, additional financial performance analysis and statistical stability tests were conducted, as shown in Table 5. From a financial perspective, the STAGE model achieves a Sharpe ratio of 1.78 and Maximum Drawdown (MDD) of –16.5%, outperforming both SMP-DL (Sharpe: 1.45, MDD: –21.3%) and DLEF-SM (Sharpe: 1.62, MDD: –18.7%). These improvements in risk-adjusted returns and downside protection demonstrate the model’s practical value for real-world trading applications. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. Financial performance and statistical stability analysis. https://doi.org/10.1371/journal.pone.0313772.t005 The robustness analysis based on bootstrap samples reveals strong statistical stability in the STAGE framework’s performance. The standard deviation of accuracy (1.56%) and RMSE (0.0184) are notably lower than both comparison models, with SMP-DL showing the highest variations (Accuracy Std: 2.24%, RMSE Std: 0.0248) and DLEF-SM displaying intermediate stability (Accuracy Std: 1.89%, RMSE Std: 0.0215). While DLEF-SM shows impressive accuracy, its DRL-ANN architecture may be more susceptible to overfitting on specific market patterns, as evidenced by its higher performance variations. 3.3 Discussion The STAGE framework proposed in this study, based on Graph Attention Network (GAT), Variational Autoencoder (VAE), and Sparse Spatiotemporal Convolutional Network (STCN), demonstrated significant advantages in addressing complex relationships and anomaly issues in the stock market. Based on the experimental results, the following three points are discussed: Importance of capturing complex relationships between stocks: The experimental results indicate that models without GAT exhibit clear deficiencies in capturing the complex dynamic interactions between stocks, resulting in a nearly 15% decrease in prediction accuracy, from 95% to 80%. In contrast, the complete STAGE framework models the relationships between stocks using GAT, enabling the model to have a stronger awareness of the associations between different stocks in the market. Role of sparse spatiotemporal convolutional network in anomaly detection: The Sparse Spatiotemporal Convolutional Network (STCN) demonstrated significant advantages in anomaly detection. Experimental data showed that removing STCN resulted in an approximately 25% increase in the loss value for handling anomalous data, from an original value of 0.35 to 0.44, further validating the importance of STCN in improving anomaly detection accuracy. Moreover, through regularization strategies, STCN significantly enhanced the model’s sensitivity to anomalies, making the complete framework more stable when dealing with anomalous data. Limitations of the STAGE framework and future applications: Although the STAGE framework demonstrated excellent robustness and adaptability in stock prediction tasks, with a rapid convergence of loss values and significantly improved prediction accuracy, it still has certain limitations in handling more diverse types of financial data and complex market environments. Future research will focus on further optimizing the model structure to ensure that the STAGE framework maintains stable prediction performance in broader financial scenarios. 4 Conclusion This study proposed the STAGE framework, which combines Graph Attention Network (GAT), Variational Autoencoder (VAE), and Sparse Spatiotemporal Convolutional Network (STCN) to improve the accuracy of stock prediction and the robustness of anomaly detection. The experimental results demonstrated that the GAT and STCN algorithms played key roles in capturing complex relationships between stocks and handling anomalous data, significantly improving model performance. Compared to the baseline LSTM model, the complete STAGE framework performed better across multiple metrics, including accuracy and loss convergence speed, particularly showing stronger robustness and learning capabilities when dealing with anomalies in the stock market. Future research will consider further optimizing the structure of the STAGE framework to extend its advantages to more diverse financial scenarios. Appendix: Mathematical theorems and corollary proofs Theorem 1. Let G = ( V , E ) be the stock relationship graph, and let the node representations be the final latent variable representations Z obtained through the joint training of GAT and VAE. The optimal solution to the joint loss function minimizes the reconstruction error, KL divergence, and regularization terms: Proof 1: We first define the VAE loss function, which includes the reconstruction error, KL divergence, and regularization terms: To better capture the high-order nonlinear characteristics of the model, we introduce an additional regularization term in the VAE loss function: Combining the original loss function with the additional regularization term, the new loss function becomes: To ensure the sensitivity of the reconstructed node features to the input, we introduce a gradient constraint on the decoder’s output: Thus, the total loss function can be updated as: To further optimize the robustness of the model, we consider the mutual influence between nodes and introduce neighborhood difference regularization: The final optimization objective function is: Through the above steps, we prove Theorem 1, where the optimal solution minimizes the combination of reconstruction error, KL divergence, regularization terms, and neighborhood difference regularization. Corollary 1. From Theorem 1, the sparse representation of the node latent variable Z as the input for anomaly detection satisfies the following condition to ensure the preservation of important features during anomaly detection: Proof. To derive the sparse representation , we start by considering the latent variable representation Z obtained through the joint training of GAT and VAE. To enhance the robustness and sparsity of Z, we apply several transformations and regularizations. First, we introduce a nonlinear transformation to capture temporal dynamics more effectively. Let g ( Z , t ) be a nonlinear function that captures temporal variations, and we integrate it over the interval [ 0 , 1 ] : Next, to ensure the sparsity of the representation, we add a logistic sparsification term. This term helps in emphasizing the most relevant features while reducing the impact of less important ones: In addition to the sparsification, we incorporate a second-order temporal derivative term to ensure that the representation adapts well to temporal dynamics. This term penalizes rapid changes in the latent representation, thereby promoting smoothness: To further enhance the adaptability of the sparse representation to temporal dynamics, we introduce an additional regularization term based on the third-order temporal derivative: Moreover, we add a neighborhood interaction term to account for the influence of neighboring nodes. This term helps in maintaining consistency between neighboring nodes in the graph: To further refine the representation, we introduce a fourth-order interaction term between neighboring nodes to capture more complex dependencies: Additionally, we incorporate a cross-term regularization to capture interactions between different features within the same node: We also add a temporal cross-derivative term to account for the interaction between temporal changes and feature changes: Finally, we include a higher-order neighborhood smoothing term to further ensure that the representation is robust to minor variations in neighboring nodes: The final sparse representation is obtained by combining all the above terms: This comprehensive formulation ensures that the sparse representation effectively preserves important features, captures temporal dynamics, maintains smoothness, incorporates neighborhood information, and captures complex feature interactions, making it well-suited for anomaly detection. □ Theorem 2. Let the input tensor be , and the anomaly score obtained after STCN, pooling, self-attention, and residual connections satisfies the following optimal condition: (24) where is the anomaly detection loss function, which includes regularization, second-order derivative regularization, and an integral term to enhance the model’s sparsity, temporal smoothness, and anomaly detection accuracy. The input tensor is , and the anomaly score , obtained after applying the Sparse Spatiotemporal Convolutional Network (STCN), pooling, self-attention, and residual connections, satisfies the following optimal condition: To prove this theorem, we start with the convolution output and activation function. The convolution operation is given by: After applying the activation function, we have: Here, regularization and a second-order derivative regularization term are included to control the model’s sparsity and capture the influence of input features on the convolution results. Next, we define the pooling operation: An integral term is added to capture the temporal smoothness of the pooling results. The residual connection module is defined as: The second-order derivative term is added to capture complex nonlinear relationships between features. The self-attention mechanism computes the importance weight of each node feature over the entire time series as follows: The integral term captures the variation trend of the node features. Based on the attention weights, the final anomaly score is calculated as: Second-order and third-order derivative regularization terms are added to control the smoothness and nonlinear variation of the score. The anomaly detection loss function incorporates regularization, a second-order derivative term, and an integral term to enhance the model’s sparsity, temporal smoothness, and anomaly detection accuracy: An additional integral term is included to capture the tail characteristics of the score distribution, allowing for a more accurate threshold setting for anomaly detection. Corollary 2. From Theorem 1, the final anomaly indicator of a node satisfies the following condition to determine whether the node is an anomaly: (25) where and are the mean and standard deviation of the anomaly scores, δ is a parameter to adjust sensitivity, and the additional integral term is used to better capture the variation of anomaly scores. To prove this corollary, we first consider the assumption of the anomaly score distribution. The anomaly score is assumed to follow a normal distribution with mean and standard deviation . The threshold is defined as follows: The additional integral term captures the tail characteristics of the score distribution, improving the accuracy of the threshold setting. The anomaly indicator is determined by comparing the anomaly score with the threshold τ: The integral term is used for dynamic adjustment of the threshold to accommodate changes in the anomaly score. To better understand the definition of the anomaly indicator, we introduce the following supplemental formulas: Adjusted Mean: Here, is the adjusted mean, which better captures the tail characteristics of the anomaly scores. Adjusted Standard Deviation: is the adjusted standard deviation to increase the model’s sensitivity to anomalies. Final Threshold Calculation: The final threshold is calculated using the adjusted mean and standard deviation. Anomaly Condition: The final condition to determine whether node i is an anomaly is given by: Through this process, we derive the final expression for the anomaly indicator , which includes several adjustment terms to improve the accuracy and robustness of anomaly detection. TI - STAGE framework: A stock dynamic anomaly detection and trend prediction model based on graph attention network and sparse spatiotemporal convolutional network JF - PLoS ONE DO - 10.1371/journal.pone.0318939 DA - 2025-03-17 UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/stage-framework-a-stock-dynamic-anomaly-detection-and-trend-prediction-5PLSkJcgYJ SP - e0318939 VL - 20 IS - 3 DP - DeepDyve ER -