# A neural network for noise correlation classification

A neural network for noise correlation classification Summary We present an artificial neural network (ANN) for the classification of ambient seismic noise correlations into two categories, suitable and unsuitable for noise tomography. By using only a small manually classified data subset for network training, the ANN allows us to classify large data volumes with low human effort and to encode the valuable subjective experience of data analysts that cannot be captured by a deterministic algorithm. Based on a new feature extraction procedure that exploits the wavelet-like nature of seismic time-series, we efficiently reduce the dimensionality of noise correlation data, still keeping relevant features needed for automated classification. Using global- and regional-scale data sets, we show that classification errors of 20  per cent or less can be achieved when the network training is performed with as little as 3.5  per cent and 16  per cent of the data sets, respectively. Furthermore, the ANN trained on the regional data can be applied to the global data, and vice versa, without a significant increase of the classification error. An experiment where four students manually classified the data, revealed that the classification error they would assign to each other is substantially larger than the classification error of the ANN (>35  per cent). This indicates that reproducibility would be hampered more by human subjectivity than by imperfections of the ANN. Neural networks, fuzzy logic, Computational seismology, Seismic noise 1 INTRODUCTION Ambient noise correlations have become a standard tool to investigate Earth structure and its temporal changes (e.g. Sabra et al. 2005; Shapiro et al. 2005; Brenguier et al. 2008; Stehly et al. 2009; Mordret et al. 2014; de Ridder et al. 2014). In contrast to earthquake- or explosion-based data, the amount of noise correlations scales with the square of the number of stations. The resulting data volumes preclude manual data selection and quality control, needed to eliminate correlations that are not plausible approximations of the inter-station Green’s function. While data selection algorithms based on various seismogram attributes have been applied successfully (Maggi et al. 2009; Krischer et al. 2015), it remains desirable to develop automated approaches that encode the invaluable human experience and intuition. Complementing attribute-based methods, artificial neural networks (ANNs) provide a means to mimic human behaviour in data analysis without compromising efficiency and reproducibility (e.g. Valentine & Woodhouse 2010; Agliz & Atmani 2013). The concept of ANNs for large-scale data classification is to manually analyse only a small data subset, and to use this subset for network training. The trained network can then be used to automatically classify the remaining data. In addition to reducing the human effort, the ANN also allows us to encode the valuable subjective experience of a data analyst that cannot be captured by a deterministic algorithm. An essential pre-requisite for using ANNs is the reduction of dimensionality of the input data, which is needed to reduce computational requirements. For this, the original data, where each time sample corresponds to a dimension, are transformed into a set of features that represent the data sufficiently well. Feature extraction is application-dependent, and it arguably represents the most subjective component of ANN design. The main objectives of this work are to improve earlier ANNs based on Fourier-domain feature extraction (Valentine & Woodhouse 2010), to quantify the required size of the training data set and to compare the ANN classification to human classification by different data analysts. This paper is organized as follows: in Section 2, we review the concept of a multilayer perceptron, the type of ANN we use for correlation classification providing the background for Section 3, where we introduce wavelet-based feature extraction. Real-data examples are presented in Sections 4 and 5. 2 ARTIFICIAL NEURAL NETWORKS An ANN is a mathematical construct, inspired by the human brain, and built from interconnected processing units called neurons. It has the ability to ‘learn’ complex relations between input and output that may otherwise be difficult to describe analytically. The theory of neural networks is well described in the literature (e.g. Poulton 2001; Haykin 2009; Aminzadeh et al. 2013), and will be summarized here only briefly: a single neuron, illustrated in Fig. 1, takes n input values x1, …, xn, and transforms them into one output value y,   $$y = \varphi \left( b + \sum _{i=1}^n w_{i} x_i \right)\, .$$ (1)The weights wi determine the sensitivity of the neuron to the individual input values, and the bias b controls the overall importance of the input. The reaction level of the neuron to some input stimulation is described by the activation function φ, typically chosen to be the sigmoid φ(v) = (1 + e−v)−1 or the linear function φ(v) = v. The bias is used to modify the activation function. Figure 1. View largeDownload slide Illustration of a single neuron after Haykin (2009). In this example, the neuron transforms three inputs x1, x2, x3, and the bias b into the output y, according to eq. (1). Figure 1. View largeDownload slide Illustration of a single neuron after Haykin (2009). In this example, the neuron transforms three inputs x1, x2, x3, and the bias b into the output y, according to eq. (1). The multilayer perceptron, illustrated in Fig. 2, organizes neurons in layers: an input layer with nodes xj, an output layer with nodes yi, and various hidden layers in between. The output of a neuron in layer l serves as input to all neurons in the subsequent layer l + 1. The weights and biases are allowed to be different for each neuron. Figure 2. View largeDownload slide Schematic illustration of a multilayer perceptron with two hidden layers. Circles represent individual neurons, shown in more detail in Fig. 1. Figure 2. View largeDownload slide Schematic illustration of a multilayer perceptron with two hidden layers. Circles represent individual neurons, shown in more detail in Fig. 1. Using a training data set, typically a small subset of all data that can be analysed manually, the weights are adjusted such that the ANN reproduces the expected output optimally. For this, we minimize the L2 misfit χ between the true output $$y_i^{\rm true}$$ and the computed output yi, summed over all output neurons, $$\chi =\sum _i ( y_i^{\rm true} - y_i )^2$$. This minimization can be performed efficiently using gradient-based descent algorithms with randomly chosen initial weights. The learning process is in more detail described in Valentine & Woodhouse (2010), and we use their approach to achieve a learning rate that decreases over the number of samples to ensure convergence. We also train multiple networks separately (the so-called ‘committees’) and average their output to have an increased confidence in the classification (Valentine & Woodhouse 2010). These and other technical details are described further in the Supporting Information, which includes software for noise correlation classification. For our application, we use an ANN with two hidden layers and two output values, (y1, y2). The hidden layers consist of 20 and 60 neurons, respectively. All activation functions are sigmoids, except in the output layer where they are linear. We train 10 networks separately and average their output for testing. We tried a variable number of combinations for the ANN parameters. By trial and error, we identified the two-hidden-layer network to work most accurately, while still being computationally feasible. Also by trial and error, we determined the number of training iterations and maximum misfit to stop the training. The network design and parameter decisions may have a strong influence on the classification results. A possible approach to overcome network architecture dependence was presented by MacKay (1996), who proposed to include many different network architectures within the committee of the neural networks. Because this would increase the computational costs in our case drastically, we chose our ‘best’ network architecture by trial and error. In the training data set, accepted (= suitable for tomography) and rejected (=unsuitable for tomography) correlations are labelled (y1, y2) = (1, 0) and (y1, y2) = (0, 1), respectively. While our manual classification of the training data set is binary, the ANN outputs a continuous range of tuples (y1, y2) with distances $$d=\sqrt{y_1^2 + (y_2-1)^2}$$ from the ideal rejected trace (0, 1). To enable classification despite the non-binary nature of d, we follow Valentine & Woodhouse (2010) in defining a threshold d0 above which a noise correlation is accepted. The data set-dependent and somewhat subjectively chosen d0 controls the trade-off between rejected traces classified as accepted, and accepted traces classified as rejected. The non-binary output (y1, y2) in combination with the distance d and threshold d0 add another degree of freedom and adjustability to the classification. The threshold d0 can be adjusted depending on the targeted application. It therefore can shift the classification towards one class for a trade-off of the other class without the need to re-train the network. While being beyond the scope of this work, the non-binary output additionally enables the definition of more than two classes. This could be, for instance, an intermediate class that would need to be investigated further by an expert. This flexibility is not available in Support Vector Machines (SVMs) that only produce a sharply defined binary classification (Haykin 2009). These could, for instance, be intermediate classes that offer a more refined quality grading. This flexibility is not available in SVMs (Haykin 2009), though a single ‘none-of-the-above’ class may be added (Reynen & Audet 2017). 3 FEATURE EXTRACTION The computational cost for training the ANN and the amount of data needed to achieve convergence increase rapidly with the number of input nodes; a facet of the curse of dimensionality (Keogh & Mueen 2011, pp. 257–258). Therefore, the dimension of the input data should be reduced as much as possible, which is typically achieved by feature extraction. It is important to design a feature extraction process that is sensitive to the particular characteristics of the data type(s) to be classified. In the long-period, teleseismic case, Valentine & Woodhouse (2010) found that a simple frequency-domain representation of the seismograms contains sufficient information to allow identification of high-quality waveforms. However, this ‘Frequency-Domain Feature Extraction’ (FDFE) proved too inaccurate for our problem, as described in Section 5. We therefore develop our own feature extraction algorithm, which is described below. Taking advantage of the wavelet-like nature of seismic time-series, we propose the following wavelet-based feature extraction algorithm (WBFE), schematically illustrated in Fig. 3: The waveform around the largest amplitude is approximated by a shifted and dilated Morlet wavelet:   $$\psi (t,\omega )=\pi ^{-0.25} (e^{i\omega t} - e^{-0.5\, \omega ^2}) e^{-0.5\, t^2},$$ (2)where the frequency ω is a proxy for the number of significant oscillations. The properties of the best-fitting wavelet, illustrated in Fig. 3, constitute a set of features. Figure 3. View largeDownload slide Illustration of two WBFE iterations that approximate the noise correlation time-series (grey) by two Morlet wavelets (black). For each of the wavelets, the following nine features are extracted: ω: frequency, ts: start time, td: length of wavelet, p: polarity (±1), A: amplitude ratio of the wavelet and the original noise correlation, T: ratio of estimated and predicted arrival time of a surface wave train (using a group velocity of 3900 m s−1 for the example shown), F: L1 misfit between noise correlation and the wavelet between ts and te, R: L1 norm of the complete noise correlation after subtraction of the wavelet, N: total number of time samples. The example is for a correlation of the regional data set introduced in Section 4. Figure 3. View largeDownload slide Illustration of two WBFE iterations that approximate the noise correlation time-series (grey) by two Morlet wavelets (black). For each of the wavelets, the following nine features are extracted: ω: frequency, ts: start time, td: length of wavelet, p: polarity (±1), A: amplitude ratio of the wavelet and the original noise correlation, T: ratio of estimated and predicted arrival time of a surface wave train (using a group velocity of 3900 m s−1 for the example shown), F: L1 misfit between noise correlation and the wavelet between ts and te, R: L1 norm of the complete noise correlation after subtraction of the wavelet, N: total number of time samples. The example is for a correlation of the regional data set introduced in Section 4. The features are chosen to include information about the waveform itself, about the amplitudes and energy content of the correlation, meta-information, and information needed to reconstruct the wavelet over the full length of the correlation. After finding the best-fitting wavelet and extracting the first set of features we subtract the wavelet from the correlation time-series. We then repeat the fitting and extraction iteratively, thereby producing an increasing number of features—nine per iteration. These features resolve more and more waveform details. This approach has the potential to subsequently include higher wave modes. A more detailed description of the WBFE algorithm is found in Supporting Information Section S3. For the ANN classification we used the features of two iterations of the WBFE. The input xj to the ANN therefore consists of 18 extracted features. The reason to use two iterations and not one is to include a more detailed description of the correlation within the classification process. 4 CLASSIFICATION TESTS To assess the performance of our ANN as a function of the training data set and the subjectively chosen acceptance threshold, we performed a series of classification tests that quantify the classification error. 4.1 Data sets and network setup We consider two distinct data sets : one at regional scale and one at global scale, to assess the ability of our ANN to classify noise correlations. The regional data were recorded from 2012 January 1 to 2013 December 30 at 37 stations within and around the African continent. One year of global data, previously analysed by Ermert et al. (2016), were recorded at 113 stations in 2014. The respective station distributions are shown in Fig. 4(a). We applied standard processing, including the removal of the instrument response, and geometric normalization by trace energy to suppress earthquake signals (Schimmel et al. 2011). In the period range from 50 to 100 s, our regional and global data sets comprise 1271 and 11451 vertical-component correlations, respectively. Figure 4. View largeDownload slide Data summary. (a) Regional and global distribution of stations, marked by red triangles. (b) Record section of selected regional-scale noise correlations for positive time lags and periods between 50 and 100 s. The dashed red line shows the group velocity of 3900 m s−1. (c) Examples of an accepted and a rejected noise correlation from the regional data set. Figure 4. View largeDownload slide Data summary. (a) Regional and global distribution of stations, marked by red triangles. (b) Record section of selected regional-scale noise correlations for positive time lags and periods between 50 and 100 s. The dashed red line shows the group velocity of 3900 m s−1. (c) Examples of an accepted and a rejected noise correlation from the regional data set. We vary the acceptance threshold d0 from 0.6–1.1 in order to investigate the influence of this subjectively chosen parameter. A group velocity of 3900 m s−1 (from Fig. 4b) within the period range from 50 to 100 s is useful for both data sets. 4.2 Classification errors To quantify the classification error, we manually classify both data sets completely (Fig. 4c). One part of the data set is then used for training, and the rest for testing the ANN’s ability to reproduce the human classification. The training data set always consists of the same number of ‘accepted’ and ‘rejected’ correlations that are chosen purely randomly. Conservatively starting with 500 of the regional correlations for training (39  per cent of all correlations) and 771 for testing, yields a classification error of 16 ± 1  per cent, where the range ± 1 per cent corresponds to d0 varying from 0.6–1.1. This error means that around 120 correlations in the testing data set were classified incorrectly. False positives and false negatives are weighted equally. They both are within the same range, with increasing false positives for decreasing values of d0, but coming with a decrease in false negatives. The balance of false positives and false negatives seems to be around a value for d0 of 1.0 . Depending on the preferred bias towards false positives or false negatives, this value of d0 might be chosen differently for various application examples. This does not require re-training of the network. Reducing the training data set to 200 correlations (16  per cent of all correlations) increases the classification error to 19 ± 1  per cent. For the global data set, we obtain similar classification errors: 14 ± 1  per cent for a training data set of 1000 correlations (8.7  per cent of all correlations), and 19 ± 2  per cent for a training data set of 400 correlations (3.5  per cent of all correlations). The ‘ideal’ number of training samples depends on the data set and the application. Since the use of ANNs for seismic data classification is in its infancy, it is still too early for more generally valid statements on the required size of a training data set. It is important that those classification errors were obtained under a carefully designed network and training setup. A range of maximum training iterations and training misfits have been investigated and tested. The parameters with the best general performance on the training data set have been adopted. Choosing the network and training parameters is a crucial part in the design of neural networks, especially to prevent over- or under-training. This issue is further discussed for example, in Valentine & Trampert (2012). Since network training is computationally expensive, it would be attractive to train the ANN on one data set only and then reuse it for the other one without retraining. Applying the ANN trained with 500 regional correlations to the testing data set of 10450 global correlations, results in a classification error of 18  per cent (d0 = 1.1) (compared to 14 ± 1  per cent for training with 1000 samples of the global data set). Conversely, the ANN trained with 400 global correlations classifies 19  per cent of the 1071 regional test data set correlations incorrectly (d0 = 1.1) (compared to 19 ± 1  per cent for training with 200 samples of the regional data set). This cross-data-set approach can also be used to estimate over- or under-training of the artificial neural network. The applicability of one trained network committee to another data set with similar classification results can be interpreted as a generalized classification and therefore a prevention of over-training. 5 COMPARISONS To put these classification errors into a broader perspective, we compare them to results obtained with our implementation of the FDFE of Valentine & Woodhouse (2010), and classifications by human data analysts. It must be mentioned that the FDFE algorithm we implemented does not consider event depth and source-receiver location, but only our interpretation of the N-point representation of the power-spectrum as well as the L1 norm of the power spectrum and the interstation distance. A more detailed description of our algorithm as well as visualized examples can be found in Supporting Information Section S3. 5.1 Fourier-domain feature extraction The FDFE, originally developed for the classification of earthquake recordings (Valentine & Woodhouse 2010), defines the spectral amplitudes of a time-series within a small number of frequency bands as features. These features are complemented by the integrated spectral power and the epicentral distance. To ensure a meaningful comparison, we implemented FDFE with 18 features, that is, the same number used for WBFE with two wavelets. In the classification tests described in Section 4, the ANN based on FDFE produced classification errors of around 40 ± 5  per cent, approximately twice the classification error of the ANN using WBFE. Despite extensive testing with different network topologies and acceptance thresholds, we were not able to improve these results. The reason could be that the power spectra of the ‘accepted’ and ‘rejected’ noise correlations are too similar. This is visualized in Fig. S2. The low classification accuracy, in fact, motivated the development of the WBFE in the early stages of this study. 5.2 Human neural networks As we can conclude from Section 4, our ANN can be considered a successful encoder of human intuition, with a failure rate of mostly less than 20  per cent. The inherent subjectivity in the preparation of the training data set raises the question how other human neural networks (brains) would have solved the classification task. To address this issue, four students, in the following labelled A, B, C and D, were asked to classify 638 of the regional-scale noise correlations. This is the initial regional data-set size before further available data was included. Because the visual classification is time-consuming, the experiment was not repeated with the full final data set. Student A is the main author of this paper. The presented traces are also included in the full regional data set—making the conditions comparable. While students A and C had prior experience with observed noise correlations, student B had only worked with earthquake data for tomography. Student D had only worked with synthetic noise correlations. The results of the experiment are summarized in Fig. 5. Figure 5. View largeDownload slide Human neural network experiment. Left: classification of 638 regional-scale correlations by four students, labelled A, B, C and D. Two of the students (A and C) had prior experience with noise correlation data. Right: total number of acceptances by the four students. Only 19 correlations were accepted by all of them. The first column labelled ‘NN(A)’ shows classifications by a neural network trained with the classifications of student A (‘NN Training’). The traces labelled ‘NN Testing’ are traces that are completely new to the trained ANN. Figure 5. View largeDownload slide Human neural network experiment. Left: classification of 638 regional-scale correlations by four students, labelled A, B, C and D. Two of the students (A and C) had prior experience with noise correlation data. Right: total number of acceptances by the four students. Only 19 correlations were accepted by all of them. The first column labelled ‘NN(A)’ shows classifications by a neural network trained with the classifications of student A (‘NN Training’). The traces labelled ‘NN Testing’ are traces that are completely new to the trained ANN. All classifications for the ANNs from Section 4 were made by student A and can be seen as the reference for the other participants in this study. To allow a comparison of the ‘human neural network’ to the artificial neural network, the ANN classification algorithm was applied to the classifications of student A. This is visualized in the first column of Fig. 5. The top 200 traces have been used for training the ANN and were chosen randomly. The traces labelled ‘NN Testing’ are the traces the neural network classified after training it with the ‘NN Training’ data set. The total classification error of the ANN (on the testing data set) is around 22  per cent, which is good considering the small sample and training data-set size. Students A and B produced identical classifications for ∼400 correlations, meaning that A would assign a classification error of ∼37  per cent to the human neural network B; and vice versa. Data analyst C and the noise correlation theoretician, D, accepted significantly less traces, so that only 19 correlations were accepted by all four students. The large discrepancies show that even within a group of seismologists, the subjective opinion of what is a ‘useful’ and ‘useless’ trace for tomography strongly varies. The difference in the classifications might also be due to different contextual information. The waveforms are usually not classified in isolation, but taking prior knowledge such as the data origin into account. The discrepancies in the classification between student A and the ANN trained with the ‘intuition’ of student A could probably also be inconsistencies in the classification by student A. While the small number of participants in our test precludes general conclusions, it still indicates that an ANN may reproduce the data selection intuition of an expert more reliably than another expert (comparing A and B). Reproducibility and data quality, both major issues in seismological studies, may thus be improved by consistently using ANNs trained by an experienced data analyst. 6 DISCUSSION AND CONCLUSIONS We presented an ANN for the classification of ambient noise correlations, using a new WBFE. Applied to a regional and a global correlation data set, we achieve classification errors of 20  per cent or less, when using as little as 16  per cent and 3.5  per cent of all data for network training, respectively. In the following paragraphs, we further discuss the origin of the remaining misclassifications, the subjectivity in the design of the network and its possibly limited universality, and the use of the ANN-classified correlations in future ambient noise tomographies. 6.1 Origin of classification errors While the tests presented in Section 4 naturally cannot cover all possible scenarios, they indicate that a classification error below ∼20  per cent can be achieved when the training data set comprises ∼5–10  per cent and ∼15–20  per cent of all data for the global and regional data set, respectively. Increasing the size of the training data set did not reduce these errors substantially. These numbers and statements are, of course, specific to our application examples, and their general validity remains to be assessed in future applications Since the network training converged almost perfectly (an example convergence plot is presented in Fig. S6), classification inconsistencies remain as one of the most likely origins of the classification errors. For instance, one of two similar time-series may be accepted while the other is rejected by the data analyst who prepares the training data set. If yet another time-series of ambiguous quality is then to be classified by the ANN, the outcome is rather unpredictable and not necessarily in-line with the analyst’s subjective opinion. When applied to a genuinely different data set, the classification error of the ANN naturally increases because it may encounter data with properties that it has indeed not seen before. This is the case for the ANN trained on regional data, applied to global data (Section 4). A possible origin of the classification errors are inconsistencies within the classification by the data analyst as well as on contextual information about the background of the data set. To overcome some of the mentioned problems with the presented learning algorithm, more advanced but also more complex machine learning algorithms such as ‘deep learning’ and unsupervised learning may be worth exploring in the future. Popular algorithms are discussed in Valentine & Trampert (2012) and Valentine & Kalnins (2016). Also other dimensionality reduction algorithms such as autoencoders, as discussed in Valentine & Trampert (2012) could be considered in seismology. 6.2 Subjectivity and universality We determined many of the network properties—including its topology and feature extraction—by trial and error. Since not all possible options can be tested, network design has an unavoidable subjective and artistic component. This inherent subjectivity and the need to design a network using some specific data, have far-reaching consequences: (1) The ANN presented here may not be the best one for our purpose. (2) The design of a network is data-dependent. Therefore, the ANN that we applied successfully to our data, may not be useful for other data. This lack of universality may constitute a significant drawback of ANNs. If an ANN, including the feature extractor, needed to be re-designed to achieve acceptable results for a new data set, its usability would be rather limited. The extent to which this lack of universality is indeed a problem remains to be seen, as ANN applications in seismology become more widespread. Considering the wide range of geophysical data types and their application range, it also seems unreasonable to expect a ‘universally true’ machine learning solution for data classification. The adaption of methods to target applications is certainly time-consuming and the suitability of machine learning techniques for data classification must be decided on a case-by-case basis. ACKNOWLEDGEMENTS The authors want to thank the editor Joerg Renner as well as the reviewer Andrew Valentine and one anonymous reviewer for reviews and comments of the manuscript. This work was supported by the Swiss National Supercomputing Centre (CSCS) through project ch1. We thank Laura Ermert, Korbinian Sager and Dirk-Philip van Herwaarden for participating in our classification experiment. All seismic data were obtained from the IRIS data centre (www.iris.edu). The Supporting Information includes software for the visual classification of seismic data, feature extraction, the preparation of training data sets, and the ANN-based seismogram classification. The documentation covers a small step-by-step example as well as more background on the feature extraction methods. The software and the example are available on the ETH Computational Seismology website (http://www.cos.ethz.ch/software.html). This download includes a documentation with further descriptions of the discussed algorithms and the accompanied software. REFERENCES Agliz D., Atmani A., 2013. Seismic signal classification using a multi-layer preceptron neural network, Int. J. Comput. Appl. , 79, 35– 43. Aminzadeh F., Sandham W., Leggett M., 2013. Geophysical Applications of Artificial Neural Networks and Fuzzy Logic , Springer. Brenguier F., Campillo M., Haziioannou C., Shapiro N.M., Nadeau R.M., Larose E., 2008. Postseismic relaxation along the San Andreas fault at Parkfield from continuous seismological observations, Science , 321, 1478– 1481. https://doi.org/10.1126/science.1160943 Google Scholar CrossRef Search ADS PubMed  de Ridder S.A.L., Biondi B.L., Clapp R.G., 2014. Time-lapse seismic noise correlation tomography at Valhall, Geophys. Res. Lett. , 41, 6116– 6122. https://doi.org/10.1002/2014GL061156 Google Scholar CrossRef Search ADS   Ermert L., Villasenor A., Fichtner A., 2016. Cross-correlation imaging of ambient noise sources, Geophys. J. Int. , 204, 347– 364. https://doi.org/10.1093/gji/ggv460 Google Scholar CrossRef Search ADS   Haykin S., 2009. Neural Networks and Learning Machines , Prentice Hall. Keogh E., Mueen A., 2011. Encyclopedia of Machine Learning , Springer. Krischer L., Fichtner A., Žukauskaitė S., Igel H., 2015. Large-scale seismic inversion framework, Seismol. Res. Lett. , 86, 1198– 1207. https://doi.org/10.1785/0220140248 Google Scholar CrossRef Search ADS   MacKay D.J., 1996. Hyperparameters: optimize, or integrate out?, Fundam. Theories Phys. , 62, 43– 60. Maggi A., Tape C., Chen M., Chao D., Tromp J., 2009. An automated time-window selection algorithm for seismic tomography, Geophys. J. Int. , 178, 257– 281. https://doi.org/10.1111/j.1365-246X.2009.04099.x Google Scholar CrossRef Search ADS   Mordret A., Shapiro N., Singh S., 2014. Seismic noise-based time-lapse monitoring of the Valhall overburden, Geophys. Res. Lett. , 41, 4945– 4952. https://doi.org/10.1002/2014GL060602 Google Scholar CrossRef Search ADS   Poulton M., 2001. Computational Neural Networks for Geophysical Data Processing , Elsevier. Reynen A., Audet P., 2017. Supervised machine learning on a network scale: application to seismic event classification and detection, Geophys. J. Int. , 210, 1394– 1409. https://doi.org/10.1093/gji/ggx238 Google Scholar CrossRef Search ADS   Sabra K.G., Gerstoft P., Roux P., Kuperman W.A., 2005. Surface wave tomography from microseisms in Southern California, Geophys. Res. Lett. , 32, doi:10.1029/2005GL023155. https://doi.org/10.1029/2005GL023155 Schimmel M., Stutzmann E., Gallart J., 2011. Using instantaneous phase coherence for signal extraction from ambient noise data at a local to a global scale, Geophys. J. Int. , 184, 494– 506. https://doi.org/10.1111/j.1365-246X.2010.04861.x Google Scholar CrossRef Search ADS   Shapiro N.M., Campillo M., Stehly L., Ritzwoller M., 2005. High resolution surface wave tomography from ambient seismic noise, Science , 307, 1615– 1618. https://doi.org/10.1126/science.1108339 Google Scholar CrossRef Search ADS PubMed  Stehly L., Fry B., Campillo M., Shapiro N.M., Guilbert J., Boschi L., Giardini D., 2009. Tomography of the Alpine region from observations of seismic ambient noise, Geophys. J. Int. , 178, 338– 350. https://doi.org/10.1111/j.1365-246X.2009.04132.x Google Scholar CrossRef Search ADS   Valentine A., Kalnins L., 2016. An introduction to learning algorithms and potential applications in geomorphometry and earth surface dynamics., Earth Surf. Dyn. , 4, 445– 460. https://doi.org/10.5194/esurf-4-445-2016 Google Scholar CrossRef Search ADS   Valentine A.P., Trampert J., 2012. Data space reduction, quality assessment and searching of seismograms: autoencoder networks for waveform data, Geophys. J. Int. , 189( 2), 1183– 1202. https://doi.org/10.1111/j.1365-246X.2012.05429.x Google Scholar CrossRef Search ADS   Valentine A.P., Woodhouse J.H., 2010. Approaches to automated data selection for global seismic tomography, Geophys. J. Int. , 182, 1001– 1012. https://doi.org/10.1111/j.1365-246X.2010.04658.x Google Scholar CrossRef Search ADS   SUPPORTING INFORMATION Supplementary data are available at GJI online. supplement_homepage_AUG17 Figure S1: The interface of the classification tool. The red line indicates the estimated surface wave arrival based on a specified surface wave velocity. (Note: The seismograms are plotted normalized.) Figure S2: Example for the FDFE for two accepted (top) and two rejected noise correlations (bottom). The upper picture shows the processed noise correlation and the lower picture shows the power spectrum (black line) as well as the extracted FDFE features (grey dots, connected with dashed grey line). Figure S3: Example output plot of the fplt.py function (2 WBFE iterations, example dataset, seismogram 30). Figure S4: Example output plot of the fd fplt.py function (2 FDFE iterations, example dataset, seismogram 30). Figure S5: Example csv output (RESULTS). Figure S6: Example convergence plot for an Ann that is trained with WBFE and FDFE data. Table S1: Extracted variables for one WBFE iteration. If more than one iteration is performed, the next iteration is performed on the noise correlation with the first waveform removed. So every iteration returns 9 features. Table S2: CSV output columns of the ANN classification. Please note: Oxford University Press is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the paper. © The Author(s) 2017. Published by Oxford University Press on behalf of The Royal Astronomical Society. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Geophysical Journal International Oxford University Press

# A neural network for noise correlation classification

, Volume 212 (2) – Feb 1, 2018
7 pages

/lp/ou_press/a-neural-network-for-noise-correlation-classification-RZG2URrIoH
Publisher
Oxford University Press
ISSN
0956-540X
eISSN
1365-246X
D.O.I.
10.1093/gji/ggx495
Publisher site
See Article on Publisher Site

### Abstract

Summary We present an artificial neural network (ANN) for the classification of ambient seismic noise correlations into two categories, suitable and unsuitable for noise tomography. By using only a small manually classified data subset for network training, the ANN allows us to classify large data volumes with low human effort and to encode the valuable subjective experience of data analysts that cannot be captured by a deterministic algorithm. Based on a new feature extraction procedure that exploits the wavelet-like nature of seismic time-series, we efficiently reduce the dimensionality of noise correlation data, still keeping relevant features needed for automated classification. Using global- and regional-scale data sets, we show that classification errors of 20  per cent or less can be achieved when the network training is performed with as little as 3.5  per cent and 16  per cent of the data sets, respectively. Furthermore, the ANN trained on the regional data can be applied to the global data, and vice versa, without a significant increase of the classification error. An experiment where four students manually classified the data, revealed that the classification error they would assign to each other is substantially larger than the classification error of the ANN (>35  per cent). This indicates that reproducibility would be hampered more by human subjectivity than by imperfections of the ANN. Neural networks, fuzzy logic, Computational seismology, Seismic noise 1 INTRODUCTION Ambient noise correlations have become a standard tool to investigate Earth structure and its temporal changes (e.g. Sabra et al. 2005; Shapiro et al. 2005; Brenguier et al. 2008; Stehly et al. 2009; Mordret et al. 2014; de Ridder et al. 2014). In contrast to earthquake- or explosion-based data, the amount of noise correlations scales with the square of the number of stations. The resulting data volumes preclude manual data selection and quality control, needed to eliminate correlations that are not plausible approximations of the inter-station Green’s function. While data selection algorithms based on various seismogram attributes have been applied successfully (Maggi et al. 2009; Krischer et al. 2015), it remains desirable to develop automated approaches that encode the invaluable human experience and intuition. Complementing attribute-based methods, artificial neural networks (ANNs) provide a means to mimic human behaviour in data analysis without compromising efficiency and reproducibility (e.g. Valentine & Woodhouse 2010; Agliz & Atmani 2013). The concept of ANNs for large-scale data classification is to manually analyse only a small data subset, and to use this subset for network training. The trained network can then be used to automatically classify the remaining data. In addition to reducing the human effort, the ANN also allows us to encode the valuable subjective experience of a data analyst that cannot be captured by a deterministic algorithm. An essential pre-requisite for using ANNs is the reduction of dimensionality of the input data, which is needed to reduce computational requirements. For this, the original data, where each time sample corresponds to a dimension, are transformed into a set of features that represent the data sufficiently well. Feature extraction is application-dependent, and it arguably represents the most subjective component of ANN design. The main objectives of this work are to improve earlier ANNs based on Fourier-domain feature extraction (Valentine & Woodhouse 2010), to quantify the required size of the training data set and to compare the ANN classification to human classification by different data analysts. This paper is organized as follows: in Section 2, we review the concept of a multilayer perceptron, the type of ANN we use for correlation classification providing the background for Section 3, where we introduce wavelet-based feature extraction. Real-data examples are presented in Sections 4 and 5. 2 ARTIFICIAL NEURAL NETWORKS An ANN is a mathematical construct, inspired by the human brain, and built from interconnected processing units called neurons. It has the ability to ‘learn’ complex relations between input and output that may otherwise be difficult to describe analytically. The theory of neural networks is well described in the literature (e.g. Poulton 2001; Haykin 2009; Aminzadeh et al. 2013), and will be summarized here only briefly: a single neuron, illustrated in Fig. 1, takes n input values x1, …, xn, and transforms them into one output value y,   $$y = \varphi \left( b + \sum _{i=1}^n w_{i} x_i \right)\, .$$ (1)The weights wi determine the sensitivity of the neuron to the individual input values, and the bias b controls the overall importance of the input. The reaction level of the neuron to some input stimulation is described by the activation function φ, typically chosen to be the sigmoid φ(v) = (1 + e−v)−1 or the linear function φ(v) = v. The bias is used to modify the activation function. Figure 1. View largeDownload slide Illustration of a single neuron after Haykin (2009). In this example, the neuron transforms three inputs x1, x2, x3, and the bias b into the output y, according to eq. (1). Figure 1. View largeDownload slide Illustration of a single neuron after Haykin (2009). In this example, the neuron transforms three inputs x1, x2, x3, and the bias b into the output y, according to eq. (1). The multilayer perceptron, illustrated in Fig. 2, organizes neurons in layers: an input layer with nodes xj, an output layer with nodes yi, and various hidden layers in between. The output of a neuron in layer l serves as input to all neurons in the subsequent layer l + 1. The weights and biases are allowed to be different for each neuron. Figure 2. View largeDownload slide Schematic illustration of a multilayer perceptron with two hidden layers. Circles represent individual neurons, shown in more detail in Fig. 1. Figure 2. View largeDownload slide Schematic illustration of a multilayer perceptron with two hidden layers. Circles represent individual neurons, shown in more detail in Fig. 1. Using a training data set, typically a small subset of all data that can be analysed manually, the weights are adjusted such that the ANN reproduces the expected output optimally. For this, we minimize the L2 misfit χ between the true output $$y_i^{\rm true}$$ and the computed output yi, summed over all output neurons, $$\chi =\sum _i ( y_i^{\rm true} - y_i )^2$$. This minimization can be performed efficiently using gradient-based descent algorithms with randomly chosen initial weights. The learning process is in more detail described in Valentine & Woodhouse (2010), and we use their approach to achieve a learning rate that decreases over the number of samples to ensure convergence. We also train multiple networks separately (the so-called ‘committees’) and average their output to have an increased confidence in the classification (Valentine & Woodhouse 2010). These and other technical details are described further in the Supporting Information, which includes software for noise correlation classification. For our application, we use an ANN with two hidden layers and two output values, (y1, y2). The hidden layers consist of 20 and 60 neurons, respectively. All activation functions are sigmoids, except in the output layer where they are linear. We train 10 networks separately and average their output for testing. We tried a variable number of combinations for the ANN parameters. By trial and error, we identified the two-hidden-layer network to work most accurately, while still being computationally feasible. Also by trial and error, we determined the number of training iterations and maximum misfit to stop the training. The network design and parameter decisions may have a strong influence on the classification results. A possible approach to overcome network architecture dependence was presented by MacKay (1996), who proposed to include many different network architectures within the committee of the neural networks. Because this would increase the computational costs in our case drastically, we chose our ‘best’ network architecture by trial and error. In the training data set, accepted (= suitable for tomography) and rejected (=unsuitable for tomography) correlations are labelled (y1, y2) = (1, 0) and (y1, y2) = (0, 1), respectively. While our manual classification of the training data set is binary, the ANN outputs a continuous range of tuples (y1, y2) with distances $$d=\sqrt{y_1^2 + (y_2-1)^2}$$ from the ideal rejected trace (0, 1). To enable classification despite the non-binary nature of d, we follow Valentine & Woodhouse (2010) in defining a threshold d0 above which a noise correlation is accepted. The data set-dependent and somewhat subjectively chosen d0 controls the trade-off between rejected traces classified as accepted, and accepted traces classified as rejected. The non-binary output (y1, y2) in combination with the distance d and threshold d0 add another degree of freedom and adjustability to the classification. The threshold d0 can be adjusted depending on the targeted application. It therefore can shift the classification towards one class for a trade-off of the other class without the need to re-train the network. While being beyond the scope of this work, the non-binary output additionally enables the definition of more than two classes. This could be, for instance, an intermediate class that would need to be investigated further by an expert. This flexibility is not available in Support Vector Machines (SVMs) that only produce a sharply defined binary classification (Haykin 2009). These could, for instance, be intermediate classes that offer a more refined quality grading. This flexibility is not available in SVMs (Haykin 2009), though a single ‘none-of-the-above’ class may be added (Reynen & Audet 2017). 3 FEATURE EXTRACTION The computational cost for training the ANN and the amount of data needed to achieve convergence increase rapidly with the number of input nodes; a facet of the curse of dimensionality (Keogh & Mueen 2011, pp. 257–258). Therefore, the dimension of the input data should be reduced as much as possible, which is typically achieved by feature extraction. It is important to design a feature extraction process that is sensitive to the particular characteristics of the data type(s) to be classified. In the long-period, teleseismic case, Valentine & Woodhouse (2010) found that a simple frequency-domain representation of the seismograms contains sufficient information to allow identification of high-quality waveforms. However, this ‘Frequency-Domain Feature Extraction’ (FDFE) proved too inaccurate for our problem, as described in Section 5. We therefore develop our own feature extraction algorithm, which is described below. Taking advantage of the wavelet-like nature of seismic time-series, we propose the following wavelet-based feature extraction algorithm (WBFE), schematically illustrated in Fig. 3: The waveform around the largest amplitude is approximated by a shifted and dilated Morlet wavelet:   $$\psi (t,\omega )=\pi ^{-0.25} (e^{i\omega t} - e^{-0.5\, \omega ^2}) e^{-0.5\, t^2},$$ (2)where the frequency ω is a proxy for the number of significant oscillations. The properties of the best-fitting wavelet, illustrated in Fig. 3, constitute a set of features. Figure 3. View largeDownload slide Illustration of two WBFE iterations that approximate the noise correlation time-series (grey) by two Morlet wavelets (black). For each of the wavelets, the following nine features are extracted: ω: frequency, ts: start time, td: length of wavelet, p: polarity (±1), A: amplitude ratio of the wavelet and the original noise correlation, T: ratio of estimated and predicted arrival time of a surface wave train (using a group velocity of 3900 m s−1 for the example shown), F: L1 misfit between noise correlation and the wavelet between ts and te, R: L1 norm of the complete noise correlation after subtraction of the wavelet, N: total number of time samples. The example is for a correlation of the regional data set introduced in Section 4. Figure 3. View largeDownload slide Illustration of two WBFE iterations that approximate the noise correlation time-series (grey) by two Morlet wavelets (black). For each of the wavelets, the following nine features are extracted: ω: frequency, ts: start time, td: length of wavelet, p: polarity (±1), A: amplitude ratio of the wavelet and the original noise correlation, T: ratio of estimated and predicted arrival time of a surface wave train (using a group velocity of 3900 m s−1 for the example shown), F: L1 misfit between noise correlation and the wavelet between ts and te, R: L1 norm of the complete noise correlation after subtraction of the wavelet, N: total number of time samples. The example is for a correlation of the regional data set introduced in Section 4. The features are chosen to include information about the waveform itself, about the amplitudes and energy content of the correlation, meta-information, and information needed to reconstruct the wavelet over the full length of the correlation. After finding the best-fitting wavelet and extracting the first set of features we subtract the wavelet from the correlation time-series. We then repeat the fitting and extraction iteratively, thereby producing an increasing number of features—nine per iteration. These features resolve more and more waveform details. This approach has the potential to subsequently include higher wave modes. A more detailed description of the WBFE algorithm is found in Supporting Information Section S3. For the ANN classification we used the features of two iterations of the WBFE. The input xj to the ANN therefore consists of 18 extracted features. The reason to use two iterations and not one is to include a more detailed description of the correlation within the classification process. 4 CLASSIFICATION TESTS To assess the performance of our ANN as a function of the training data set and the subjectively chosen acceptance threshold, we performed a series of classification tests that quantify the classification error. 4.1 Data sets and network setup We consider two distinct data sets : one at regional scale and one at global scale, to assess the ability of our ANN to classify noise correlations. The regional data were recorded from 2012 January 1 to 2013 December 30 at 37 stations within and around the African continent. One year of global data, previously analysed by Ermert et al. (2016), were recorded at 113 stations in 2014. The respective station distributions are shown in Fig. 4(a). We applied standard processing, including the removal of the instrument response, and geometric normalization by trace energy to suppress earthquake signals (Schimmel et al. 2011). In the period range from 50 to 100 s, our regional and global data sets comprise 1271 and 11451 vertical-component correlations, respectively. Figure 4. View largeDownload slide Data summary. (a) Regional and global distribution of stations, marked by red triangles. (b) Record section of selected regional-scale noise correlations for positive time lags and periods between 50 and 100 s. The dashed red line shows the group velocity of 3900 m s−1. (c) Examples of an accepted and a rejected noise correlation from the regional data set. Figure 4. View largeDownload slide Data summary. (a) Regional and global distribution of stations, marked by red triangles. (b) Record section of selected regional-scale noise correlations for positive time lags and periods between 50 and 100 s. The dashed red line shows the group velocity of 3900 m s−1. (c) Examples of an accepted and a rejected noise correlation from the regional data set. We vary the acceptance threshold d0 from 0.6–1.1 in order to investigate the influence of this subjectively chosen parameter. A group velocity of 3900 m s−1 (from Fig. 4b) within the period range from 50 to 100 s is useful for both data sets. 4.2 Classification errors To quantify the classification error, we manually classify both data sets completely (Fig. 4c). One part of the data set is then used for training, and the rest for testing the ANN’s ability to reproduce the human classification. The training data set always consists of the same number of ‘accepted’ and ‘rejected’ correlations that are chosen purely randomly. Conservatively starting with 500 of the regional correlations for training (39  per cent of all correlations) and 771 for testing, yields a classification error of 16 ± 1  per cent, where the range ± 1 per cent corresponds to d0 varying from 0.6–1.1. This error means that around 120 correlations in the testing data set were classified incorrectly. False positives and false negatives are weighted equally. They both are within the same range, with increasing false positives for decreasing values of d0, but coming with a decrease in false negatives. The balance of false positives and false negatives seems to be around a value for d0 of 1.0 . Depending on the preferred bias towards false positives or false negatives, this value of d0 might be chosen differently for various application examples. This does not require re-training of the network. Reducing the training data set to 200 correlations (16  per cent of all correlations) increases the classification error to 19 ± 1  per cent. For the global data set, we obtain similar classification errors: 14 ± 1  per cent for a training data set of 1000 correlations (8.7  per cent of all correlations), and 19 ± 2  per cent for a training data set of 400 correlations (3.5  per cent of all correlations). The ‘ideal’ number of training samples depends on the data set and the application. Since the use of ANNs for seismic data classification is in its infancy, it is still too early for more generally valid statements on the required size of a training data set. It is important that those classification errors were obtained under a carefully designed network and training setup. A range of maximum training iterations and training misfits have been investigated and tested. The parameters with the best general performance on the training data set have been adopted. Choosing the network and training parameters is a crucial part in the design of neural networks, especially to prevent over- or under-training. This issue is further discussed for example, in Valentine & Trampert (2012). Since network training is computationally expensive, it would be attractive to train the ANN on one data set only and then reuse it for the other one without retraining. Applying the ANN trained with 500 regional correlations to the testing data set of 10450 global correlations, results in a classification error of 18  per cent (d0 = 1.1) (compared to 14 ± 1  per cent for training with 1000 samples of the global data set). Conversely, the ANN trained with 400 global correlations classifies 19  per cent of the 1071 regional test data set correlations incorrectly (d0 = 1.1) (compared to 19 ± 1  per cent for training with 200 samples of the regional data set). This cross-data-set approach can also be used to estimate over- or under-training of the artificial neural network. The applicability of one trained network committee to another data set with similar classification results can be interpreted as a generalized classification and therefore a prevention of over-training. 5 COMPARISONS To put these classification errors into a broader perspective, we compare them to results obtained with our implementation of the FDFE of Valentine & Woodhouse (2010), and classifications by human data analysts. It must be mentioned that the FDFE algorithm we implemented does not consider event depth and source-receiver location, but only our interpretation of the N-point representation of the power-spectrum as well as the L1 norm of the power spectrum and the interstation distance. A more detailed description of our algorithm as well as visualized examples can be found in Supporting Information Section S3. 5.1 Fourier-domain feature extraction The FDFE, originally developed for the classification of earthquake recordings (Valentine & Woodhouse 2010), defines the spectral amplitudes of a time-series within a small number of frequency bands as features. These features are complemented by the integrated spectral power and the epicentral distance. To ensure a meaningful comparison, we implemented FDFE with 18 features, that is, the same number used for WBFE with two wavelets. In the classification tests described in Section 4, the ANN based on FDFE produced classification errors of around 40 ± 5  per cent, approximately twice the classification error of the ANN using WBFE. Despite extensive testing with different network topologies and acceptance thresholds, we were not able to improve these results. The reason could be that the power spectra of the ‘accepted’ and ‘rejected’ noise correlations are too similar. This is visualized in Fig. S2. The low classification accuracy, in fact, motivated the development of the WBFE in the early stages of this study. 5.2 Human neural networks As we can conclude from Section 4, our ANN can be considered a successful encoder of human intuition, with a failure rate of mostly less than 20  per cent. The inherent subjectivity in the preparation of the training data set raises the question how other human neural networks (brains) would have solved the classification task. To address this issue, four students, in the following labelled A, B, C and D, were asked to classify 638 of the regional-scale noise correlations. This is the initial regional data-set size before further available data was included. Because the visual classification is time-consuming, the experiment was not repeated with the full final data set. Student A is the main author of this paper. The presented traces are also included in the full regional data set—making the conditions comparable. While students A and C had prior experience with observed noise correlations, student B had only worked with earthquake data for tomography. Student D had only worked with synthetic noise correlations. The results of the experiment are summarized in Fig. 5. Figure 5. View largeDownload slide Human neural network experiment. Left: classification of 638 regional-scale correlations by four students, labelled A, B, C and D. Two of the students (A and C) had prior experience with noise correlation data. Right: total number of acceptances by the four students. Only 19 correlations were accepted by all of them. The first column labelled ‘NN(A)’ shows classifications by a neural network trained with the classifications of student A (‘NN Training’). The traces labelled ‘NN Testing’ are traces that are completely new to the trained ANN. Figure 5. View largeDownload slide Human neural network experiment. Left: classification of 638 regional-scale correlations by four students, labelled A, B, C and D. Two of the students (A and C) had prior experience with noise correlation data. Right: total number of acceptances by the four students. Only 19 correlations were accepted by all of them. The first column labelled ‘NN(A)’ shows classifications by a neural network trained with the classifications of student A (‘NN Training’). The traces labelled ‘NN Testing’ are traces that are completely new to the trained ANN. All classifications for the ANNs from Section 4 were made by student A and can be seen as the reference for the other participants in this study. To allow a comparison of the ‘human neural network’ to the artificial neural network, the ANN classification algorithm was applied to the classifications of student A. This is visualized in the first column of Fig. 5. The top 200 traces have been used for training the ANN and were chosen randomly. The traces labelled ‘NN Testing’ are the traces the neural network classified after training it with the ‘NN Training’ data set. The total classification error of the ANN (on the testing data set) is around 22  per cent, which is good considering the small sample and training data-set size. Students A and B produced identical classifications for ∼400 correlations, meaning that A would assign a classification error of ∼37  per cent to the human neural network B; and vice versa. Data analyst C and the noise correlation theoretician, D, accepted significantly less traces, so that only 19 correlations were accepted by all four students. The large discrepancies show that even within a group of seismologists, the subjective opinion of what is a ‘useful’ and ‘useless’ trace for tomography strongly varies. The difference in the classifications might also be due to different contextual information. The waveforms are usually not classified in isolation, but taking prior knowledge such as the data origin into account. The discrepancies in the classification between student A and the ANN trained with the ‘intuition’ of student A could probably also be inconsistencies in the classification by student A. While the small number of participants in our test precludes general conclusions, it still indicates that an ANN may reproduce the data selection intuition of an expert more reliably than another expert (comparing A and B). Reproducibility and data quality, both major issues in seismological studies, may thus be improved by consistently using ANNs trained by an experienced data analyst. 6 DISCUSSION AND CONCLUSIONS We presented an ANN for the classification of ambient noise correlations, using a new WBFE. Applied to a regional and a global correlation data set, we achieve classification errors of 20  per cent or less, when using as little as 16  per cent and 3.5  per cent of all data for network training, respectively. In the following paragraphs, we further discuss the origin of the remaining misclassifications, the subjectivity in the design of the network and its possibly limited universality, and the use of the ANN-classified correlations in future ambient noise tomographies. 6.1 Origin of classification errors While the tests presented in Section 4 naturally cannot cover all possible scenarios, they indicate that a classification error below ∼20  per cent can be achieved when the training data set comprises ∼5–10  per cent and ∼15–20  per cent of all data for the global and regional data set, respectively. Increasing the size of the training data set did not reduce these errors substantially. These numbers and statements are, of course, specific to our application examples, and their general validity remains to be assessed in future applications Since the network training converged almost perfectly (an example convergence plot is presented in Fig. S6), classification inconsistencies remain as one of the most likely origins of the classification errors. For instance, one of two similar time-series may be accepted while the other is rejected by the data analyst who prepares the training data set. If yet another time-series of ambiguous quality is then to be classified by the ANN, the outcome is rather unpredictable and not necessarily in-line with the analyst’s subjective opinion. When applied to a genuinely different data set, the classification error of the ANN naturally increases because it may encounter data with properties that it has indeed not seen before. This is the case for the ANN trained on regional data, applied to global data (Section 4). A possible origin of the classification errors are inconsistencies within the classification by the data analyst as well as on contextual information about the background of the data set. To overcome some of the mentioned problems with the presented learning algorithm, more advanced but also more complex machine learning algorithms such as ‘deep learning’ and unsupervised learning may be worth exploring in the future. Popular algorithms are discussed in Valentine & Trampert (2012) and Valentine & Kalnins (2016). Also other dimensionality reduction algorithms such as autoencoders, as discussed in Valentine & Trampert (2012) could be considered in seismology. 6.2 Subjectivity and universality We determined many of the network properties—including its topology and feature extraction—by trial and error. Since not all possible options can be tested, network design has an unavoidable subjective and artistic component. This inherent subjectivity and the need to design a network using some specific data, have far-reaching consequences: (1) The ANN presented here may not be the best one for our purpose. (2) The design of a network is data-dependent. Therefore, the ANN that we applied successfully to our data, may not be useful for other data. This lack of universality may constitute a significant drawback of ANNs. If an ANN, including the feature extractor, needed to be re-designed to achieve acceptable results for a new data set, its usability would be rather limited. The extent to which this lack of universality is indeed a problem remains to be seen, as ANN applications in seismology become more widespread. Considering the wide range of geophysical data types and their application range, it also seems unreasonable to expect a ‘universally true’ machine learning solution for data classification. The adaption of methods to target applications is certainly time-consuming and the suitability of machine learning techniques for data classification must be decided on a case-by-case basis. ACKNOWLEDGEMENTS The authors want to thank the editor Joerg Renner as well as the reviewer Andrew Valentine and one anonymous reviewer for reviews and comments of the manuscript. This work was supported by the Swiss National Supercomputing Centre (CSCS) through project ch1. We thank Laura Ermert, Korbinian Sager and Dirk-Philip van Herwaarden for participating in our classification experiment. All seismic data were obtained from the IRIS data centre (www.iris.edu). The Supporting Information includes software for the visual classification of seismic data, feature extraction, the preparation of training data sets, and the ANN-based seismogram classification. The documentation covers a small step-by-step example as well as more background on the feature extraction methods. The software and the example are available on the ETH Computational Seismology website (http://www.cos.ethz.ch/software.html). This download includes a documentation with further descriptions of the discussed algorithms and the accompanied software. REFERENCES Agliz D., Atmani A., 2013. Seismic signal classification using a multi-layer preceptron neural network, Int. J. Comput. Appl. , 79, 35– 43. Aminzadeh F., Sandham W., Leggett M., 2013. Geophysical Applications of Artificial Neural Networks and Fuzzy Logic , Springer. Brenguier F., Campillo M., Haziioannou C., Shapiro N.M., Nadeau R.M., Larose E., 2008. Postseismic relaxation along the San Andreas fault at Parkfield from continuous seismological observations, Science , 321, 1478– 1481. https://doi.org/10.1126/science.1160943 Google Scholar CrossRef Search ADS PubMed  de Ridder S.A.L., Biondi B.L., Clapp R.G., 2014. Time-lapse seismic noise correlation tomography at Valhall, Geophys. Res. Lett. , 41, 6116– 6122. https://doi.org/10.1002/2014GL061156 Google Scholar CrossRef Search ADS   Ermert L., Villasenor A., Fichtner A., 2016. Cross-correlation imaging of ambient noise sources, Geophys. J. Int. , 204, 347– 364. https://doi.org/10.1093/gji/ggv460 Google Scholar CrossRef Search ADS   Haykin S., 2009. Neural Networks and Learning Machines , Prentice Hall. Keogh E., Mueen A., 2011. Encyclopedia of Machine Learning , Springer. Krischer L., Fichtner A., Žukauskaitė S., Igel H., 2015. Large-scale seismic inversion framework, Seismol. Res. Lett. , 86, 1198– 1207. https://doi.org/10.1785/0220140248 Google Scholar CrossRef Search ADS   MacKay D.J., 1996. Hyperparameters: optimize, or integrate out?, Fundam. Theories Phys. , 62, 43– 60. Maggi A., Tape C., Chen M., Chao D., Tromp J., 2009. An automated time-window selection algorithm for seismic tomography, Geophys. J. Int. , 178, 257– 281. https://doi.org/10.1111/j.1365-246X.2009.04099.x Google Scholar CrossRef Search ADS   Mordret A., Shapiro N., Singh S., 2014. Seismic noise-based time-lapse monitoring of the Valhall overburden, Geophys. Res. Lett. , 41, 4945– 4952. https://doi.org/10.1002/2014GL060602 Google Scholar CrossRef Search ADS   Poulton M., 2001. Computational Neural Networks for Geophysical Data Processing , Elsevier. Reynen A., Audet P., 2017. Supervised machine learning on a network scale: application to seismic event classification and detection, Geophys. J. Int. , 210, 1394– 1409. https://doi.org/10.1093/gji/ggx238 Google Scholar CrossRef Search ADS   Sabra K.G., Gerstoft P., Roux P., Kuperman W.A., 2005. Surface wave tomography from microseisms in Southern California, Geophys. Res. Lett. , 32, doi:10.1029/2005GL023155. https://doi.org/10.1029/2005GL023155 Schimmel M., Stutzmann E., Gallart J., 2011. Using instantaneous phase coherence for signal extraction from ambient noise data at a local to a global scale, Geophys. J. Int. , 184, 494– 506. https://doi.org/10.1111/j.1365-246X.2010.04861.x Google Scholar CrossRef Search ADS   Shapiro N.M., Campillo M., Stehly L., Ritzwoller M., 2005. High resolution surface wave tomography from ambient seismic noise, Science , 307, 1615– 1618. https://doi.org/10.1126/science.1108339 Google Scholar CrossRef Search ADS PubMed  Stehly L., Fry B., Campillo M., Shapiro N.M., Guilbert J., Boschi L., Giardini D., 2009. Tomography of the Alpine region from observations of seismic ambient noise, Geophys. J. Int. , 178, 338– 350. https://doi.org/10.1111/j.1365-246X.2009.04132.x Google Scholar CrossRef Search ADS   Valentine A., Kalnins L., 2016. An introduction to learning algorithms and potential applications in geomorphometry and earth surface dynamics., Earth Surf. Dyn. , 4, 445– 460. https://doi.org/10.5194/esurf-4-445-2016 Google Scholar CrossRef Search ADS   Valentine A.P., Trampert J., 2012. Data space reduction, quality assessment and searching of seismograms: autoencoder networks for waveform data, Geophys. J. Int. , 189( 2), 1183– 1202. https://doi.org/10.1111/j.1365-246X.2012.05429.x Google Scholar CrossRef Search ADS   Valentine A.P., Woodhouse J.H., 2010. Approaches to automated data selection for global seismic tomography, Geophys. J. Int. , 182, 1001– 1012. https://doi.org/10.1111/j.1365-246X.2010.04658.x Google Scholar CrossRef Search ADS   SUPPORTING INFORMATION Supplementary data are available at GJI online. supplement_homepage_AUG17 Figure S1: The interface of the classification tool. The red line indicates the estimated surface wave arrival based on a specified surface wave velocity. (Note: The seismograms are plotted normalized.) Figure S2: Example for the FDFE for two accepted (top) and two rejected noise correlations (bottom). The upper picture shows the processed noise correlation and the lower picture shows the power spectrum (black line) as well as the extracted FDFE features (grey dots, connected with dashed grey line). Figure S3: Example output plot of the fplt.py function (2 WBFE iterations, example dataset, seismogram 30). Figure S4: Example output plot of the fd fplt.py function (2 FDFE iterations, example dataset, seismogram 30). Figure S5: Example csv output (RESULTS). Figure S6: Example convergence plot for an Ann that is trained with WBFE and FDFE data. Table S1: Extracted variables for one WBFE iteration. If more than one iteration is performed, the next iteration is performed on the noise correlation with the first waveform removed. So every iteration returns 9 features. Table S2: CSV output columns of the ANN classification. Please note: Oxford University Press is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the paper. © The Author(s) 2017. Published by Oxford University Press on behalf of The Royal Astronomical Society.

### Journal

Geophysical Journal InternationalOxford University Press

Published: Feb 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations