TY - JOUR AU - Du, Xuetong AB - Introduction Rock type identification has always been an important topic of research in the fields of geology and resource extraction. Accurate identification of rock information can provide critical data support for geologic analysis, resource assessment, and extraction strategies [1–3], which can not only help to predict subsurface structure [4] and resource distribution [5], but also improve the efficiency of exploration and development, and reduce the risk and cost. Application techniques related to rock type identification include resource assessment, stratigraphic analysis, drilling strategy development [6–8], and reservoir characterization [9], mining technology selection and environmental protection etcetera [10–12]. Therefore, accurately identifying and analyzing the rock structural composition is crucial to effectively develop oil and gas strategies that optimize resource production. There are many data tools for rock structural analysis, including traditional physical measurements and image-based processing methods, such as electron microscope scanning technology [13], computed tomography (CT) scanner [14], infrared remote sensing [15], numerical simulation [16], electrical impedance [17], rock images [18] and acoustic emission signals (AE) [19] etcetera. Researchers have conducted a series of studies combining various data tools. Shehata, Amer A et al. utilized an artificial neural network (ANN) in combination with core and borehole image data, logging data to identify five rock phases, and clarified the specific parameters that contribute to rock phase identification [20]. Liu, Haiqiang et al. propose a novel hybrid attention semantic segmentation (HASS) network for semantic segmentation of martian terrain rocks for Mars rover route planning and autonomous navigation [21]. These methods of using images to identify rocks may have limitations for structures deep in the earth. As AE measurement instruments become increasingly sophisticated, scholars begun to use AE to explore areas that are visually unreachable [22, 23]. AE refers to the generation of transient elastic waves due to the rapid release of localized energy in a material or structure. For instance, Lee, Choi. et al. proposed an AE processing method based on the Hilbert Huang transform (HHT), and the processed AE can be combined with signal parameters (AESP) to jointly identify rock fracture deformation stages [24]. Zhang et al. utilized short-time Fourier transform (STFT) to analyze the acoustic emission signal and obtain the peak frequency of the signal and used the fuzzy C-means method to classify the four types of acoustic emission signals [25]. These primitive AE processing methods are in urgent need of scientific and technological innovation at the present time, its susceptibility to noise interference, limited processing samples, low accuracy, slow speed, high operating costs and other shortcomings make it difficult to sustain the development. Rock structure identification based on traditional acoustic emission signal analysis methods still dominates, but these methods are gradually becoming auxiliary and are clearly unable to meet the growing engineering demand for high efficiency and high precision. At present, various engineering fields are facing the massive influx of sensor data from job platforms, which is increasing the demand for intelligent signal processing [26, 27]. Certain fundamental artificial intelligence algorithms are steadily gaining popularity, including support vector machines (SVMs) [28], convolutional neural networks (CNNs) [29], and generating adversarial networks (GANs) [30] and so on. The application of intelligent methods greatly improves the recognition efficiency and accuracy while also providing time and labor savings. Such as Byun, H. et al. proposed a method that employs a deep CNN to segment rock crack images, achieving effective recognition by effectively distinguishing cracks from various interference features [31]. However, AE is an unstructured continuous time sequence [32], rock fracture identification requires the determination of more parameters than just images, which result in multimodal data [33]. The recognition of rock activity processes based on AE signals entails analyzing the intricate features of material movement. Summarize the changes from traditional methods to intelligent methods mentioned above, there are three challenges that require attention: (1) The risk of overfitting when dealing with strongly noisy data. (2) The problem of recognition accuracy, the robustness of the model. (3) Experimental uncertainty leads to unbalanced sample sizes for different categories and scarce labeling data. Given these considerations, we propose a novel method for recognizing rock fracturing acoustic emission signals using a combination of CNNs. Currently, there have been many researches around transformer, but most of them are focused on processing machine language [34, 35] or visual images [36, 37], and there is no method to apply CNNs combined with transformer encoder to rock fracturing AE recognition. To address these challenges and improve the identification of rock fracture types, this study introduces a computational model called 3CTNet. To tackle the overfitting issues and sample imbalance problems associated with noisy data, the raw data are first wavelet denoised [27]. Solving the problem of imbalanced samples, five data oversampling methods were compared, including random process sampling [29], ADASYN [30], synthetic minority oversampling technique (SMOTE) [31], SVM-SMOTE and Kmeans-SMOTE [32]. Then, the AE are processed using the ADASYN oversampling for the first time, enhancing both the internal features of the data and the distinguishing features between different classes. To enhance the robustness of the model, we conbined CNNs with a transformer encoder and reconstructed the position encoding module. This combination allows the model to extract local features and capture global connections within the data. Additionally, the model employs Dropout, which randomly drops some neurons during the training process to prevent overfitting. For the validation of the new model, we carried out ablation experiments within the model and comparison experiments between the models, and quantitatively analyzed the effects of noise changes and data sample size changes on the robustness of the model, and finally concluded that the model’s recognition accuracy reaches 98.78%, and has good robustness to changes in noise and sample size. In the following article, the proposed model describes in detail the CNNs part and transformer encoder part of the model as well as the overall architecture and detailed parameters. Method describes the data processing and the rock mechanics analysis based on the AE. Experiments describes the overall implementation process of the architecture, the model comparison experiments and model evaluation indexes. Results and discussion describes the review of the experiments and the analysis and discussion of the results, and Conlusion is the summary of the overall study. Proposed model This section introduces the relevant modules proposed for the AE processing model. The model comprises three sequential convolution modules and a transformer module, leveraging fracturing AE signals for rock type identification. In this study, the proposed model separates the position encoding module of the transformer encoder from the multihead attention module. Initially, position encoding is applied to the original input data, mapping the discrete data representation into a continuous vector space. Subsequently, convolutional feature extraction is conducted on the output information pertaining to the mark’s relative position. These extracted high-level features are then input into the multihead attention layer for rock type classification. This model, termed 3CTNet, is introduced as part of our contributions. Feature extraction by CNN To extract high-level representations from AE signals, a CNN module comprising three convolutional layers is utilized for low-level feature extraction. This module consists of three CNNs connected sequentially. Following the addition of positional encoding to each time point record of the original AE signal, discrete time series are transformed into dense embedding vector representations. These representations are then fed into a three-layer concatenated CNNs with filters (64) and kernel_size (64x1) with a stride of 64, filters (128) and kernel_size (16x1) with a stride of 1, and filters (16) and kernel_size (16x1) with a stride of 1. Additionally, a dropout rate of 30% is applied to each layer to mitigate overfitting, and the rectified linear unit (ReLU) [38] activation function is used, defined as yi = max(0,xi). The softmax output of the classifier calculation is , where n denotes the number of inputs to the neuron. Finally, maximum pooling (MaxPooling) is applied to reduce the feature size, which aids in alleviating model overfitting. The convolutional layer’s operation on the input signal is illustrated in (Fig 1). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Convolutional operation of AE signals. https://doi.org/10.1371/journal.pone.0309165.g001 Multihead attention encoder The utilization of convolutional layer-based transformers is inspired by the success of transformers and their variants in audio signal processing applications [39, 40]. Given the similarity between audio signals and AE signals, the transformer is applied to AE signal recognition. However, transformers typically exhibit high computational complexity. When dealing with time-series data at the basic unit level, significant computing resources are consumed. To address this challenge, a CNN is employed to extract detailed features and preprocess the data before inputting it into the transformer. This approach ensures optimal utilization of the transformer’s advanced computational capabilities. After traversing the convolutional layer, the output xα size is 2x256. The processed input then undergoes further refinement through the multihead attention layer, followed by weight adjustment via layer normalization and residual connections. Additionally, a feedforward neural network is introduced between the multihead attention layer and the normalization layer. In this configuration, two fully connected layers are utilized to nonlinearly transform the output from the attention module. The constructed encoder architecture is illustrated in (Fig 2). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Single mode recognition transformer encoder architecture. https://doi.org/10.1371/journal.pone.0309165.g002 For multihead attention modules, linearly project the query , key and value m times with different linear projections. Here, m is the number of heads: (1) (2) (3) where are trainable parameters. Here, assuming dm as the output feature dimension. On each of these projected versions of queries , keys and values Next, the dot-product attention is performed in parallel, and the outputs of the attention functions, denoted as headi, i∈[1,m], are concatenated to obtain the final values Hhead, which are computed as follows: (4) If the original input signal X is segmented into n segments, each segment undergoes linear transformations separately. Subsequently, the aforementioned steps are executed independently for each segment. Finally, the transformed segments are concatenated to form a complete sequence with the same dimensionality. This process embodies the execution of the multihead attention mechanism: (5) where Therefore, the sequence length of the outputs Hhead is equal to the sequence length of the query Q. Model structure For the data (T is the sequence length, d is the embedding dimension), positional embedding was used to add information to each position of the sequence so that the model could capture the sequential relationship in the sequence, expressed as: (6) where pos was employed to denote the position within the input sequence, with a range of [1, T], and i represented the index of the embedded dimension, with a range of [0, d]. This process involved generating various periodic patterns using sine and cosine functions, and subsequently combining these patterns to generate a code corresponding to a specific position within the input sequence. Subsequently, the position code was incorporated into the input sequence to embed positional information. This process is mathematically represented as follows: (7) where Input is the original input matrix and Pos_emb is the positional embedding. After that, Output = X′∈ℝT×d. When the data reached the CNN blocks, convolve X′ with kernel k in width and height, used ReLU to activate the neurons to obtain the output y: (8) Regularize the output y and randomly discard 30% of the neurons, which is expressed as: (9) where E is a binary mask matrix with the same shape as the input y, and p is the discard probability. Then, these features were passed to the max-pooling layer, the most popular max-pooling operation that divides the input into nonoverlapping parts and obtains the highest gain from each relevant region. The proposed model went through 3 described blocks and finally reached the encoder. Assuming that the arrival sequence is Y∈ℝT×d, the encoding process is expressed as: (10) (11) (12) (13) (14) (15) where FFN is a feedforward neural network and obtains Output = Z∈ℝT×d after encoding. Then, batch normalization was performed on Z, followed by global average pooling to reduce the number of parameters: (16) Zijk represents the i-th channel, j-th row, and k-th column element of the input feature matrix. Z represents the i-th feature channel of the output and was sent to two fully connected layers for classification. Then, softmax was used for activation: (17) softmax(Z)i represents the probability of the i-th category, and C represents the number of categories. In this manner, after fully connected layers and softmax activation, the model output can be expressed as a probability distribution for each category. The detailed model architecture is illustrated in (Fig 3). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. 3CTNet for processing AE signals. https://doi.org/10.1371/journal.pone.0309165.g003 The number of layers, convolution kernel size, pooling kernel size, and stride were fine-tuned in the model through iterative trials. The specific parameters of the model are presented in Table 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Model architecture and parameters. https://doi.org/10.1371/journal.pone.0309165.t001 Feature extraction by CNN To extract high-level representations from AE signals, a CNN module comprising three convolutional layers is utilized for low-level feature extraction. This module consists of three CNNs connected sequentially. Following the addition of positional encoding to each time point record of the original AE signal, discrete time series are transformed into dense embedding vector representations. These representations are then fed into a three-layer concatenated CNNs with filters (64) and kernel_size (64x1) with a stride of 64, filters (128) and kernel_size (16x1) with a stride of 1, and filters (16) and kernel_size (16x1) with a stride of 1. Additionally, a dropout rate of 30% is applied to each layer to mitigate overfitting, and the rectified linear unit (ReLU) [38] activation function is used, defined as yi = max(0,xi). The softmax output of the classifier calculation is , where n denotes the number of inputs to the neuron. Finally, maximum pooling (MaxPooling) is applied to reduce the feature size, which aids in alleviating model overfitting. The convolutional layer’s operation on the input signal is illustrated in (Fig 1). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Convolutional operation of AE signals. https://doi.org/10.1371/journal.pone.0309165.g001 Multihead attention encoder The utilization of convolutional layer-based transformers is inspired by the success of transformers and their variants in audio signal processing applications [39, 40]. Given the similarity between audio signals and AE signals, the transformer is applied to AE signal recognition. However, transformers typically exhibit high computational complexity. When dealing with time-series data at the basic unit level, significant computing resources are consumed. To address this challenge, a CNN is employed to extract detailed features and preprocess the data before inputting it into the transformer. This approach ensures optimal utilization of the transformer’s advanced computational capabilities. After traversing the convolutional layer, the output xα size is 2x256. The processed input then undergoes further refinement through the multihead attention layer, followed by weight adjustment via layer normalization and residual connections. Additionally, a feedforward neural network is introduced between the multihead attention layer and the normalization layer. In this configuration, two fully connected layers are utilized to nonlinearly transform the output from the attention module. The constructed encoder architecture is illustrated in (Fig 2). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Single mode recognition transformer encoder architecture. https://doi.org/10.1371/journal.pone.0309165.g002 For multihead attention modules, linearly project the query , key and value m times with different linear projections. Here, m is the number of heads: (1) (2) (3) where are trainable parameters. Here, assuming dm as the output feature dimension. On each of these projected versions of queries , keys and values Next, the dot-product attention is performed in parallel, and the outputs of the attention functions, denoted as headi, i∈[1,m], are concatenated to obtain the final values Hhead, which are computed as follows: (4) If the original input signal X is segmented into n segments, each segment undergoes linear transformations separately. Subsequently, the aforementioned steps are executed independently for each segment. Finally, the transformed segments are concatenated to form a complete sequence with the same dimensionality. This process embodies the execution of the multihead attention mechanism: (5) where Therefore, the sequence length of the outputs Hhead is equal to the sequence length of the query Q. Model structure For the data (T is the sequence length, d is the embedding dimension), positional embedding was used to add information to each position of the sequence so that the model could capture the sequential relationship in the sequence, expressed as: (6) where pos was employed to denote the position within the input sequence, with a range of [1, T], and i represented the index of the embedded dimension, with a range of [0, d]. This process involved generating various periodic patterns using sine and cosine functions, and subsequently combining these patterns to generate a code corresponding to a specific position within the input sequence. Subsequently, the position code was incorporated into the input sequence to embed positional information. This process is mathematically represented as follows: (7) where Input is the original input matrix and Pos_emb is the positional embedding. After that, Output = X′∈ℝT×d. When the data reached the CNN blocks, convolve X′ with kernel k in width and height, used ReLU to activate the neurons to obtain the output y: (8) Regularize the output y and randomly discard 30% of the neurons, which is expressed as: (9) where E is a binary mask matrix with the same shape as the input y, and p is the discard probability. Then, these features were passed to the max-pooling layer, the most popular max-pooling operation that divides the input into nonoverlapping parts and obtains the highest gain from each relevant region. The proposed model went through 3 described blocks and finally reached the encoder. Assuming that the arrival sequence is Y∈ℝT×d, the encoding process is expressed as: (10) (11) (12) (13) (14) (15) where FFN is a feedforward neural network and obtains Output = Z∈ℝT×d after encoding. Then, batch normalization was performed on Z, followed by global average pooling to reduce the number of parameters: (16) Zijk represents the i-th channel, j-th row, and k-th column element of the input feature matrix. Z represents the i-th feature channel of the output and was sent to two fully connected layers for classification. Then, softmax was used for activation: (17) softmax(Z)i represents the probability of the i-th category, and C represents the number of categories. In this manner, after fully connected layers and softmax activation, the model output can be expressed as a probability distribution for each category. The detailed model architecture is illustrated in (Fig 3). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. 3CTNet for processing AE signals. https://doi.org/10.1371/journal.pone.0309165.g003 The number of layers, convolution kernel size, pooling kernel size, and stride were fine-tuned in the model through iterative trials. The specific parameters of the model are presented in Table 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Model architecture and parameters. https://doi.org/10.1371/journal.pone.0309165.t001 Method Data preprocessing Datasets and statistical characteristics. The AE data processed for the experiments were obtained from the Brazilian splitting tests conducted by Song et al. [41] on six sets of rock samples prepared in the laboratory. These specimens comprised three types of natural rocks with different mineral compositions and moduli of elasticity: dolomite (mainly composed of dolomite and calcite) (dl), sandstone (mainly composed of quartz and feldspar) (ss) and shale (mainly composed of quartz, kaolinite and pyrite) (sh). Three types of artificial rocks with different particle contents, pure cement (pc), cement and sand (1:1) (cs:1) and cement and sand (2:1) (cs2:1), formed rock-like concrete, and data visualization is shown in (Fig 4). Subsequently, the interquartile range (IQR) was employed to analyze the statistical characteristics of the dataset and observe its distribution shape, as depicted in (Fig 5). The IQR was compared with the full range (difference between maximum and minimum values) (FRD) of the data to glean information regarding the degree of data dispersion, as illustrated in (Fig 6). It is evident that the overall characteristics of the dataset exhibit a small IQR but a large FRD, indicating that the data are concentrated in the middle area with notable outliers or abnormal values, considerable overall variation, and uneven distribution. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. AE representation of Brazilian splitting of six types of rocks. https://doi.org/10.1371/journal.pone.0309165.g004 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Representing the distribution shape of data through IQR. https://doi.org/10.1371/journal.pone.0309165.g005 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Degree of dispersion of data. https://doi.org/10.1371/journal.pone.0309165.g006 The tests were conducted by treating each of the six specimens as a disk with a thickness-to-diameter ratio of 1:2. To collect the AE data released by each specimen after Brazilian splitting, eight AE transducers were placed on the surface of each disk specimen, three on the front and back sides and two on the sides. Following the operating principles of the sensors and the rock fracture mechanism, AE data were collected from the fracture of the six rock specimens. Each dataset comprises signal amplitudes measured at 1024 consecutive time points. Wavelet threshold denoising. To mitigate noise in the original AE signal, wavelet threshold denoising was utilized. This method involves separating the low-frequency and high-frequency signals and then effectively processing the high-frequency signals to isolate noise from the main information, thus preserving important features. The wavelet transform is a well-suited approach for signal and image denoising, characterized by its ability to efficiently capture both temporal and frequency-domain information. The principle of the wavelet transform can be succinctly expressed as follows: (18) where the wavelet basis functions are: (19) The study employs the Daubechies 4 (db4) wavelet basis function for the denoising process. The selection of the 4th order wavelet basis function was made after comparing it with other order wavelet functions. It was observed that the db4 function effectively preserves crucial signal information while efficiently filtering out noise. Through the application of this denoising technique, clean AE signal data are obtained, minimizing any potential impact on the accuracy of the model training. ADASYN oversampling. The acquired dataset, influenced by the physical properties of the rock and various factors during testing, is unevenly distributed. This uneven distribution has led to poor detection results for certain classes of samples, thereby affecting the overall recognition accuracy of the model [42]. To address this issue, the dataset was processed by filtering out the more disturbed data from the six sample classes. Subsequently, five data oversampling methods were validated: random process sampling, ADASYN, SMOTE, SVM-SMOTE, and Kmeans-SMOTE. It was found that when the model was trained on the dataset obtained using the ADASYN oversampling method, it achieved better fitting and higher recognition accuracy on a real test dataset. The ADASYN oversampling method accomplished two critical objectives: it mitigated the learning bias introduced by the initial data imbalance and dynamically adjusted the decision boundary to prioritize challenging-to-learn samples [43]. For the ADASYN sampling method, suppose the AE training dataset has m samples: {xi, yi}, (i = 1,2,3,…,m) where xi is an instance in the n-dimensional feature space X, yi∈Y = {1,−1} is the category label associated with xi, and ms and ml are defined as the number of minority samples and the number of majority samples, respectively. Then, it follows that ms≤ml and ms+ml = m. The specific implementation process of the ADASYN algorithm: 1) Calculate the degree of category imbalance: (20) where d∈[0,1]. 2) if d