TY - JOUR AU1 - Kojima, Simon AU2 - Kanoh, Shin’ichiro AB - Introduction The brain-computer interface (BCI) has been studied as an effective communication tool for patients who have neuromuscular disorders, such as amyotrophic lateral sclerosis (ALS) [1, 2]. In this system, the user’s electrophysiological signal from the brain (e.g., electroencephalogram (EEG)) is measured and analyzed to detect their intention and to control external devices. To date, many types of BCI systems have been reported. In many BCI systems, one way to realize it is to detect event-related potentials (ERPs). ERPs are a time-locked response by the brain that occur at a fixed time after a particular external or internal event [3]. The oddball paradigm is the specific set of circumstances for eliciting the P300 ERP [4]. In the oddball paradigm, standard stimuli are presented repeatedly and deviant stimuli are presented randomly at a low probability. When a subject attends to the deviant stimulus, P300 which is a largely positive response is elicited approximately 300ms after the onset of the deviant stimulus. P300 can be composed into two major components, a frontally maximal P3a and a parietally maximal P3b [5]. It was reported the P3b component is observed for targets that are infrequent but are in some sense expected or awaited, whereas the frontal P3 wave is elicited by stimuli that are truly unexpected or surprising [5]. The amplitude of P300 changes over the midline electrodes (Fz, Cz and Pz) that increases from the frontal to the parietal electrode [6]. The context-updating model is known as the account of the functional role of P300. According to the model, as stimuli are presented and evaluated, the degree to which the events are consistent with the current model of the context is assessed. When an event violates the expectations dictated by the model, and when the violation requires the model to be rivised (i.e., context updating), a P300 is elicited [4]. Auditory BCIs P300-based BCI systems reported to date mainly use a visual modality. One of the renowned visual P300 BCIs, a P300 speller, was attempted by Farwell and Donchin [7]. In this system, six-by-six matrices containing alphabet letters are presented, one of these rows or columns flashes in random order, and subjects are requested to attend to a target letter. The elicited P300 component was analyzed and detected to determine the target letter. Many studies have been done on P300 speller [8–10]. However, systems using visual stimuli occupy the user’s sight, and visually impaired people cannot use the system. Another way to create a BCI is to use auditory stimuli. One of the renowned earlier studies of auditory BCIs was attempted by Hill et al. [11]. Two oddball sequences with different interstimulus intervals (ISIs) were presented to each ear of a participant. Subjects were requested to pay attention to one of the two sequences. Recorded EEG signals were classified by a support vector machine (SVM) to detect the user’s intention. Schreuder et al. proposed an auditory BCI that could be used to select one out of eight sequences, each of which was presented from eight speakers surrounding the subject [12]. Eight different tone stimuli were presented from these speakers with a fixed interval in random order, and each speaker provided a fixed tone. Subjects were requested to pay attention to one of these sound sources. The stimuli from the attended sound source elicited P300 activity, and machine learning was used to detect the sound source that users attended. Auditory BCI systems using auditory steady-state response (ASSR) have also been reported. Lopez et al. [13] and Kim et al. [14] proposed ASSR-based auditory BCIs. In these systems, two modulated (e.g., amplitude modulation) tone with different frequencies were presented to each participant’s ear. By attending to or ignoring these stimuli, the power of the alpha-band EEG and ASSR were changed, and the user’s selection could be detected. P300 spellers using auditory stimuli were tested by Furdea et al. [15]. A five-by-five matrix containing the letters of the alphabet was visually presented, and auditory stimuli of spoken numbers were assigned to each row and column. The visually presented matrix was only used for support and did not flash. Subjects were requested to attend to the target spoken number. The output character was decided in two-steps; the row was chosen in the first step, and the column was chosen in the second. Markovinović et al. proposed a similar auditory speller with the help of a convolutional neural network (CNN) [16]. In most of the proposed auditory BCI systems, users were requested to pay attention to one out of multiple tone sequences that were presented from different audio sources (e.g., left or right, select one from spatially located multiple audio sources), and these systems did not use the properties of tones (e.g., frequency, intensity, and timbre) to create a variety of stimuli. The number of audio sources should be increased if the number of selections is increased. However, more audio sources would decrease the participant’s ability to detect the target source. Additionally, it is not practical to place many speakers around a subject. The authors proposed a 2-class auditory BCI system based on auditory stream segregation [17]. Auditory stream segregation is one of the auditory illusions that is studied in the field of psychoacoustics [18–20]. When two kinds of tones (A and B) are presented alternately (ABABAB….), such a tone sequence is perceived as two different auditory streams (AAAA…. and BBBB….). When two tone sequences have a larger gap in the frequency and shorter time intervals, the tones are perceived more clearly. In this BCI system, two oddball sequences consisting of tone bursts with two different frequency ranges were presented alternately to the subject’s right ear with a short time interval so that they would be perceived as two segregated tone streams. Subjects were requested to pay attention to one of two oddball sequences. P300 activity and mismatch negativity (MMN) were elicited by target stimuli in the attended oddball tone sequence, and recorded EEG signals could be classified to detect which oddball sequence the subject attended. A similar system based on our research was also used by another researcher, Pokorny et al, on minimally conscious patients [21]. Current auditory BCIs based on stream segregation only offers binary selection; therefore, selection capability needs to be increased for the system to be used as a practical BCI. If the number of choices is increased, the number of presenting streams is increased; however, it would be more difficult to segregate presented tone sequences as multiple streams. Thus, in this study, instead of pure tones used in the previous study, musical tones with complex harmonics were adopted to facilitate discrimination among these streams. Since musical tones with complex harmonics could have more information that would allow users to group similar tones easier than pure tones, it is expected that discrimination among tone streams would be easier. We tested a 3-class auditory BCI system based on auditory stream segregation in which tone sequences consist of musical tones. Fig 1 shows a conceptual diagram of this system. Three tone streams consist of musical tones are presented to the subject, and the subject paid attention to one of the streams. The attention to the streams was detected by analyzing and classifying the subjects’ EEG. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. The conceptual diagram of this system. https://doi.org/10.1371/journal.pone.0303565.g001 Auditory BCIs P300-based BCI systems reported to date mainly use a visual modality. One of the renowned visual P300 BCIs, a P300 speller, was attempted by Farwell and Donchin [7]. In this system, six-by-six matrices containing alphabet letters are presented, one of these rows or columns flashes in random order, and subjects are requested to attend to a target letter. The elicited P300 component was analyzed and detected to determine the target letter. Many studies have been done on P300 speller [8–10]. However, systems using visual stimuli occupy the user’s sight, and visually impaired people cannot use the system. Another way to create a BCI is to use auditory stimuli. One of the renowned earlier studies of auditory BCIs was attempted by Hill et al. [11]. Two oddball sequences with different interstimulus intervals (ISIs) were presented to each ear of a participant. Subjects were requested to pay attention to one of the two sequences. Recorded EEG signals were classified by a support vector machine (SVM) to detect the user’s intention. Schreuder et al. proposed an auditory BCI that could be used to select one out of eight sequences, each of which was presented from eight speakers surrounding the subject [12]. Eight different tone stimuli were presented from these speakers with a fixed interval in random order, and each speaker provided a fixed tone. Subjects were requested to pay attention to one of these sound sources. The stimuli from the attended sound source elicited P300 activity, and machine learning was used to detect the sound source that users attended. Auditory BCI systems using auditory steady-state response (ASSR) have also been reported. Lopez et al. [13] and Kim et al. [14] proposed ASSR-based auditory BCIs. In these systems, two modulated (e.g., amplitude modulation) tone with different frequencies were presented to each participant’s ear. By attending to or ignoring these stimuli, the power of the alpha-band EEG and ASSR were changed, and the user’s selection could be detected. P300 spellers using auditory stimuli were tested by Furdea et al. [15]. A five-by-five matrix containing the letters of the alphabet was visually presented, and auditory stimuli of spoken numbers were assigned to each row and column. The visually presented matrix was only used for support and did not flash. Subjects were requested to attend to the target spoken number. The output character was decided in two-steps; the row was chosen in the first step, and the column was chosen in the second. Markovinović et al. proposed a similar auditory speller with the help of a convolutional neural network (CNN) [16]. In most of the proposed auditory BCI systems, users were requested to pay attention to one out of multiple tone sequences that were presented from different audio sources (e.g., left or right, select one from spatially located multiple audio sources), and these systems did not use the properties of tones (e.g., frequency, intensity, and timbre) to create a variety of stimuli. The number of audio sources should be increased if the number of selections is increased. However, more audio sources would decrease the participant’s ability to detect the target source. Additionally, it is not practical to place many speakers around a subject. The authors proposed a 2-class auditory BCI system based on auditory stream segregation [17]. Auditory stream segregation is one of the auditory illusions that is studied in the field of psychoacoustics [18–20]. When two kinds of tones (A and B) are presented alternately (ABABAB….), such a tone sequence is perceived as two different auditory streams (AAAA…. and BBBB….). When two tone sequences have a larger gap in the frequency and shorter time intervals, the tones are perceived more clearly. In this BCI system, two oddball sequences consisting of tone bursts with two different frequency ranges were presented alternately to the subject’s right ear with a short time interval so that they would be perceived as two segregated tone streams. Subjects were requested to pay attention to one of two oddball sequences. P300 activity and mismatch negativity (MMN) were elicited by target stimuli in the attended oddball tone sequence, and recorded EEG signals could be classified to detect which oddball sequence the subject attended. A similar system based on our research was also used by another researcher, Pokorny et al, on minimally conscious patients [21]. Current auditory BCIs based on stream segregation only offers binary selection; therefore, selection capability needs to be increased for the system to be used as a practical BCI. If the number of choices is increased, the number of presenting streams is increased; however, it would be more difficult to segregate presented tone sequences as multiple streams. Thus, in this study, instead of pure tones used in the previous study, musical tones with complex harmonics were adopted to facilitate discrimination among these streams. Since musical tones with complex harmonics could have more information that would allow users to group similar tones easier than pure tones, it is expected that discrimination among tone streams would be easier. We tested a 3-class auditory BCI system based on auditory stream segregation in which tone sequences consist of musical tones. Fig 1 shows a conceptual diagram of this system. Three tone streams consist of musical tones are presented to the subject, and the subject paid attention to one of the streams. The attention to the streams was detected by analyzing and classifying the subjects’ EEG. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. The conceptual diagram of this system. https://doi.org/10.1371/journal.pone.0303565.g001 Materials and methods Musical tones generated by a digital auditory workstation (Cakewalk by BandLab, BandLab Technologies, Singapore) were used as auditory stimuli. Piano tones (Grand Piano 1 SE) included in the MIDI sound source (SampleTank3, IK multimedia Production, Italy) were used. A digital signal processor (System3, Tucker-Davis Technologies, USA) and headphones (HDA200, Sennheiser) were used to present these tones to the participants. Timing of presented tones was generated by Arduino UNO (Arduino, USA). Fig 2 denotes the auditory paradigm used in the experiment, and Table 1 shows the frequency of each tone. Each stream n (n = 1, 2, 3) consists of standard tone Sn and deviant tone Dn. The probabilities of target and nontarget stimuli were 0.1 and 0.9, respectively. The duration of each tone was 150 ms, and the stimulus onset asynchrony (SOA) was set to 180 ms. Auditory stimuli were presented only to the right ear of each participant. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Time chart of the presented tone sequence. https://doi.org/10.1371/journal.pone.0303565.g002 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Frequencies of tones. https://doi.org/10.1371/journal.pone.0303565.t001 The 64-channel EEG signals (Fp1, Fp2, AF7, AF3, AFz, AF4, AF8, F7, F5, F3, F1, Fz, F2, F4, F6, F8, FT9, FT7, FC5, FC3, FC1, FCz, FC2, FC4, FC6, FT8, FT10, T7, C5, C3, C1, Cz, C2, C4, C6, T8, TP9, TP7, CP5, CP3, CP1, CPz, CP2, CP4, CP6, TP8, TP10, P7, P5, P3, P1, Pz, P2, P4, P6, P8, PO7, PO3, POz, PO4, PO8, O1, Oz, and O2) were measured by Ag-AgCl electrodes (EASYCAP GmbH, Germany). See Fig 3 for EEG montage. Brain Amp DC (Brain Products GmbH, Germany) and MR plus (Brain Products GmbH, Germany) were used for data acquisition. Reference and ground electrodes were placed on the right and left earlobes, respectively. Vertical and horizontal electrooculogram (EOG) signals were also recorded. Amplified signals were bandpass filtered at 0.1 Hz to 100 Hz and recorded with a sampling frequency of 1000 Hz. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. EEG montage. https://doi.org/10.1371/journal.pone.0303565.g003 Ten male and one female (aged between 22–23 years) subjects participated in the experiment. This study protocol was approved by the Review Board on Bioengineering Research Ethics of Shibaura Institute of Technology and was conducted in accordance with the Declaration of Helsinki. Before the experiment, subjects were given information orally and in writing, and written informed consent was obtained from all subjects. Subjects were recruited from July 18, 2023, to November 27, 2023. Fig 4 shows time chart of the session. Firstly, all participants had a familiarization block to learn the paradigm. Each experiment consists of two task blocks. Three runs were conducted in each task block. Each measurement took five minutes to complete. Subjects were requested to count the number of target stimuli in Streams 1, 2, and 3 on the 1st, 2nd, and 3rd measurements, respectively. The same block was repeated twice, and participants rested between blocks. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Time chart of the session. In a session, firstly, the subject was familiarized with the task. The task block consisted of three runs. In the first, second, and third runs, the subject was requested to attend to Stream 1, 2 and 3, respectively. The task block was conducted two times in total. https://doi.org/10.1371/journal.pone.0303565.g004 Data analysis was performed by MATLAB (2021b, MathWorks). Recorded signals were bandpass filtered at 0.1 Hz to 40 Hz (zero-phase 2nd-order Butterworth IIR filter, slope 24 dB/octave). It is suggested to use a filter with a slope of between 12 and 24 dB/octave for ERP analysis to avoid distortions produced by filtering [5]. In this study, the filter with a slope of 24 dB/octave was selected to minimize distortion while removing noises. Responses to each target stimulus in the range of −100 ms to 500 ms from onset were extracted, and the mean amplitude at baseline (−50 ms to 0 ms) was subtracted from each response. Epochs in which the amplitude exceeded ±100μV on EEG recordings and ±500μV on EOG recordings were excluded from further analysis. Epoched data were averaged over trials in attended and nonattended streams. Scalp topographies were plotted by EEGLAB (v2021.0) [22]. Responses to the target stimuli corresponding to the attended stream and nonattended stream were tested by Student’s t-test (p < 0.01). Let as responses to the target stimuli corresponding to the attended stream, and let and (Nch is the number of channels, Nt is the number of time samples and Nex is the number of epochs) as responses to the target stimuli corresponding to the nonattended stream, since there were one attended stream and two nonattended streams. Responses to the stimuli corresponding to the nonattended streams were concatenated over epochs . For each channel and time sample, the significant difference between Dt and Dnt was tested. Pattern classification Fig 5 shows the flowchart of the classification pipeline. Pattern classification was performed in Python. Recorded signals were bandpass filtered at 1 Hz to 40 Hz (zero-phase 2nd-order Butterworth IIR filter, slope 24 dB/octave). All the responses to the target stimulus included in both attended and nonattended streams were extracted. Each epoch was extracted in the range of −100 ms to 500 ms from the onset, and the mean amplitude at baseline (−50 ms to 0 ms) was subtracted from each response. Epochs that exceeded ±100μV on EEG and ±500μV on EOG recordings were rejected and excluded from further analysis. After that, xDAWN filters [23, 24] (number of components = 3) were extracted for both attended and nonattended epochs to enhance ERP responses and reduce the dimension of the feature vector. The algorithm xDAWN was used to obtain the spatial filter that estimates the evoked subspace that contains most ERP responses and improves the signal-to-signal + noise ratio (SSNR) of ERP responses. xDAWN filter u, which maximizes SSNR can be estimated as following [24]. Let , where Nt is the number of time samples and Ns is the number of channels, as recorded EEG data, , where N1 is the number of time samples of the epoch, as the ERP response. SSNR ρ(u) is defined as with and Σx = E[XTX], where is Toeplitz matrix with the first element of each row is 1 for stimulus onset and 0 for otherwise. Finally, estimated xDAWN spatial filter is . Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. The flowchart of classification pipeline. To detect the selective attention to the each stream, the responses to the deviant stimuli Dn in stream n was classified. The classification was done for every n ∈ [1, 2, 3]. https://doi.org/10.1371/journal.pone.0303565.g005 Classifier based on Riemannian geometry Barachant et al. [25–27] proposed a novel classifier based on Riemannian geometry. The main idea of this classifier is to identify covariance matrices that have spatial information directly and to extract spatial information without using spatial filtering. By using Riemannian geometry, the distances between covariance metrics can be measured. In this framework, the center of gravity (Riemannian mean) of covariance matrices derived by each epoch in each class is calculated, and the Riemannian distances between the covariance matrix of an unlabeled epoch and the Riemannian mean of each class are determined. Then, the epoch is labeled as the nearest class. This method is called minimum distance to mean (MDM). To classify covariance matrices by conventional classifiers (e.g., linear discriminant analysis (LDA) and logistic regression), covariance matrices are projected onto the Riemannian tangent space so that they can be manipulated in Euclidean space and vectorized. Covariance metrics were calculated by the following method [23, 24, 28]. Let (C = number of xDAWN components, Ns = number of sample) be the estimated signal subspace derived by the xDAWN filter. Since two xDAWN filters were estimated for both attended and nonattended classes, dimensions of P1 were 2C × Ns. Each epoch was defined as Xi ∈ R2C×N. Then, we calculated a super by concatenating P1 and Xi. (1) Covariance matrices were built by these supertrials by using the sample covariance matrix (SCM) estimator [26]. (2) The Riemannian distance between two covariance matrices can be computed by the following equation [25]. Where λi, i = 1…4C are the real eigenvalues of . (3) Then, the Riemannian mean of the covariance matrices will be computed by the following [25]. (4) Each covariance matrix can be vectorized by projecting onto Riemannian tangent space to classify with a conventional classifier [25]. (5) The feature vector vi was classified by logistic regression. Three binary classifiers were built to detect whether each stream was attended or not. Each stream consisted of Dn and Sn, and all responses to Dn were used for classification where responses to Sn were not. Responses to Dn when selective attention to Stream n was paid and responses when it was not paid were used, and it was classified whether each stream was attended or not (binary classification). The classification performance was evaluated by 10-fold cross validation. MNE-Python (0.23.3) [29], Pyriemann (0.2.7) [30], and scikit-learn (0.23.2) [31] were used to implement the classifier. The classification results were evaluated by two matrices, accuracy and MCC (Matthews correlation coefficient) [32, 33]. MCC can be derived by following formula, and it takes a value from −1 (worst) to 1 (best). When it takes 0, the output is the random answer. (6) Where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. The classification results were also evaluated by plotting the averaged confusion matrices. Since we have three classifiers for detecting the selective attention to each stream, three confusion matrices were derived. For each subject and class, a confusion matrix was derived from the results of 10-fold cross validation, and it was averaged over the subjects. Pattern classification Fig 5 shows the flowchart of the classification pipeline. Pattern classification was performed in Python. Recorded signals were bandpass filtered at 1 Hz to 40 Hz (zero-phase 2nd-order Butterworth IIR filter, slope 24 dB/octave). All the responses to the target stimulus included in both attended and nonattended streams were extracted. Each epoch was extracted in the range of −100 ms to 500 ms from the onset, and the mean amplitude at baseline (−50 ms to 0 ms) was subtracted from each response. Epochs that exceeded ±100μV on EEG and ±500μV on EOG recordings were rejected and excluded from further analysis. After that, xDAWN filters [23, 24] (number of components = 3) were extracted for both attended and nonattended epochs to enhance ERP responses and reduce the dimension of the feature vector. The algorithm xDAWN was used to obtain the spatial filter that estimates the evoked subspace that contains most ERP responses and improves the signal-to-signal + noise ratio (SSNR) of ERP responses. xDAWN filter u, which maximizes SSNR can be estimated as following [24]. Let , where Nt is the number of time samples and Ns is the number of channels, as recorded EEG data, , where N1 is the number of time samples of the epoch, as the ERP response. SSNR ρ(u) is defined as with and Σx = E[XTX], where is Toeplitz matrix with the first element of each row is 1 for stimulus onset and 0 for otherwise. Finally, estimated xDAWN spatial filter is . Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. The flowchart of classification pipeline. To detect the selective attention to the each stream, the responses to the deviant stimuli Dn in stream n was classified. The classification was done for every n ∈ [1, 2, 3]. https://doi.org/10.1371/journal.pone.0303565.g005 Classifier based on Riemannian geometry Barachant et al. [25–27] proposed a novel classifier based on Riemannian geometry. The main idea of this classifier is to identify covariance matrices that have spatial information directly and to extract spatial information without using spatial filtering. By using Riemannian geometry, the distances between covariance metrics can be measured. In this framework, the center of gravity (Riemannian mean) of covariance matrices derived by each epoch in each class is calculated, and the Riemannian distances between the covariance matrix of an unlabeled epoch and the Riemannian mean of each class are determined. Then, the epoch is labeled as the nearest class. This method is called minimum distance to mean (MDM). To classify covariance matrices by conventional classifiers (e.g., linear discriminant analysis (LDA) and logistic regression), covariance matrices are projected onto the Riemannian tangent space so that they can be manipulated in Euclidean space and vectorized. Covariance metrics were calculated by the following method [23, 24, 28]. Let (C = number of xDAWN components, Ns = number of sample) be the estimated signal subspace derived by the xDAWN filter. Since two xDAWN filters were estimated for both attended and nonattended classes, dimensions of P1 were 2C × Ns. Each epoch was defined as Xi ∈ R2C×N. Then, we calculated a super by concatenating P1 and Xi. (1) Covariance matrices were built by these supertrials by using the sample covariance matrix (SCM) estimator [26]. (2) The Riemannian distance between two covariance matrices can be computed by the following equation [25]. Where λi, i = 1…4C are the real eigenvalues of . (3) Then, the Riemannian mean of the covariance matrices will be computed by the following [25]. (4) Each covariance matrix can be vectorized by projecting onto Riemannian tangent space to classify with a conventional classifier [25]. (5) The feature vector vi was classified by logistic regression. Three binary classifiers were built to detect whether each stream was attended or not. Each stream consisted of Dn and Sn, and all responses to Dn were used for classification where responses to Sn were not. Responses to Dn when selective attention to Stream n was paid and responses when it was not paid were used, and it was classified whether each stream was attended or not (binary classification). The classification performance was evaluated by 10-fold cross validation. MNE-Python (0.23.3) [29], Pyriemann (0.2.7) [30], and scikit-learn (0.23.2) [31] were used to implement the classifier. The classification results were evaluated by two matrices, accuracy and MCC (Matthews correlation coefficient) [32, 33]. MCC can be derived by following formula, and it takes a value from −1 (worst) to 1 (best). When it takes 0, the output is the random answer. (6) Where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. The classification results were also evaluated by plotting the averaged confusion matrices. Since we have three classifiers for detecting the selective attention to each stream, three confusion matrices were derived. For each subject and class, a confusion matrix was derived from the results of 10-fold cross validation, and it was averaged over the subjects. Results and discussions Fig 6 shows averaged responses (electrode Cz) to target stimuli when subject A attended Stream 1 (red), Stream 2 (green), and Stream 3 (blue). Fig 6(a)–6(c) shows responses to target stimuli corresponding to Streams 1, 2, and 3, respectively. The gray boxes denote the significant difference between responses when subjects attended the corresponding stream and responses when subjects attended a noncorresponding stream. For subject A, P300 activity was elicited by target stimuli with a latency of approximately 350 ms when attending to Stream 1 and a latency of approximately 300 ms when Stream 2 or 3 was attended. The peak amplitudes were quite large, and P300 activity was elicited only by the target stimuli in the attended stream. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Averaged responses (electrode Cz) to target stimuli when subject A attended to each stream. In each figure, red, green, and blue denote Stream 1, Stream 2, and Stream 3. Responses to target stimuli corresponding to Stream 1, 2, and 3 are shown in (a), (b), and (c), respectively. Gray boxes show the t-test results and denote the significant difference between responses when subjects attended to the corresponding stream and did not attend to the corresponding stream. (Student’s t-test p < 0.01). https://doi.org/10.1371/journal.pone.0303565.g006 For subject B, P300 activity was elicited by target stimuli with a latency of approximately 250 ms when attending to Stream 2 or 3. MMN was elicited by the target stimuli in the attended stream when the subject attended to Stream 1. For subject C, positive responses were observed in the occipital region with a latency of 250−500 ms. The activity cannot be concluded as P300, but it was elicited only by the target stimuli corresponding to the attended stream. For subject G, P300 activity with small amplitudes was elicited by the target stimulus when attending to Stream 1; however, no significant response was observed when the subject attended to Streams 2 and 3. Fig 7 shows averaged responses that were averaged over all eight subjects. Red, blue, and green lines denote responses to target stimuli when subjects attended Streams 1, 2, and 3, respectively. Scalp topographies of each latency are also shown. Fig 7(a)–7(c) show responses to target stimuli corresponding to Streams 1, 2, and 3, respectively. The gray boxes show the t-test results and denote the significant difference between responses when subjects attended the corresponding stream and responses when subjects attended a noncorresponding stream. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Averaged responses averaged over all eleven subjects. Red, blue, and green lines denote responses to target stimuli when subjects attended to Stream 1, 2, and 3, respectively. Scalp topographies of each latency are also shown. Responses to target stimuli corresponding to Stream 1, 2, and 3 are shown in (a), (b), and (c), respectively. Gray boxes show the t-test results and denote the significant difference between responses when subjects attended to the corresponding stream and when they attended to a noncorresponding stream. https://doi.org/10.1371/journal.pone.0303565.g007 As shown in Fig 7(a), positive responses with the latency of approximately 300 400 ms had parietally maximal scalp topographies of P300 activity. Responses when subjects attended Stream 2 had smaller amplitudes. Responses when subjects attended Stream 3 also had smaller amplitudes and frontally maximal scalp topographies. At a latency of 100−250 ms, the amplitude of MMN responses when the subject attended to Stream 1 was significantly larger than when the subject attended to Stream 2 or 3. According to Fig 7(b), P300 activity could be observed clearly at a latency of approximately 250 ms. According to the scalp topographies, responses when the subject attended to Stream 2 had a maximum parietal amplitude, and responses when the subject attended to Stream 1 or 3 had a larger frontal amplitude. According to Fig 7(c), P300 responses with a latency of approximately 250 ms when subjects attended to Stream 3 were significantly larger than responses when subjects attended to Stream 1 or 2. According to the scalp topographies, although responses when subjects attended to Stream 3 had maximum amplitude on the parietal side, responses when subjects attended to Stream 1 or 2 had maximum amplitude on the frontal side. All P300 responses to target stimuli tended to have parietally maximal scalp topographies when subjects attended to the corresponding stream and had frontal maximal scalp topographies when subjects attended to a noncorresponding stream. We considered that parietally maximal responses were P3b, which reflects the subject’s selective attention, and frontally maximal responses were P3a, which is evoked exogenously and does not reflect the subject’s selective attention [5]. In five out of eleven subjects, P300 responses were elicited by the target stimulus only when subjects attended to the corresponding stream. Furthermore, in five subjects, the amplitude of MMN responses elicited by the target stimulus was larger when the subject attended to the corresponding stream than when the subject attended to the noncorresponding streams. Table 2 shows the classification result of each subject, and Fig 8 shows the averaged classification scores for each subject and each stream. The averaged classification accuracy was over 80% for five subjects, 75% − 79% for four subjects and 65% − 74% for two subject. In subject G, as mentioned above, no significant response was observed unless Stream 1 was attended, and the average accuracy was 74%. The classification score was the lowest for subject J. The classification accuracy was lower in some subjects than in others. From ERP plos for these subjects, it was found that P300 was elicited for the deviant stimuli which corresponds to nonattended stream, or the amplitude of P300 elited by deviant stimuli corresponds to attended stream was quite small. The possible reason for these results may be that subjects could not perceive the presented sequences as three segregated streams. In the absence of stream perception, it may be difficult to find the deviant stimuli corresponds to the target stream, or the the subjects may attend not only to deviant stimuli corresponds to target stream but to deviant stimuli corresponding to all streams. In subject B, average accuracy reached 88%. When this subject attended Stream 1, only MMN was elicited by the target stimuli, and P300 was not. However, 91% accuracy was obtained when Stream 1 was attended. S1 Fig Shows a grand averaged confusion matrix. In this study, three classifiers were trained, and the confusion matrix for each is shown. All confusion matrices from eleven subjects were averaged over subjects. The accuracy averaged over all subjects when Streams 1, 2, and 3 were attended were 82%, 77%, and 78%, respectively. According to this result, the accuracy for Stream 2 was slightly low. A few subjects mentioned that attending Stream 2 was harder than attending the other streams, and this is consistent with the result. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 8. Classification results. (a) Averaged classification accuracy for each subject. (b) Averaged classification accuracy for each stream. https://doi.org/10.1371/journal.pone.0303565.g008 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Classification results. https://doi.org/10.1371/journal.pone.0303565.t002 The average accuracy of overall results was 79%. In the proposed system, since the required time to present target stimuli in all streams was inconsistent, the classification interval fluctuated in the range of 0.54 − 9.18 s with a certain probability, and its expected value was 7.42 s. Assuming that 3-class classification was performed with an accuracy of 79% every 7.42 s, the information transfer rate (ITR) could be calculated as 5.12 bits/min. The ITR will be in the range of 4.14 − 70.39 bits/min depending on the classification interval when accuracy is consistent at 79%. Due to fluctuations in the classification interval, it is not appropriate to compare this system with other systems; however, our previous research achieved an ITR of approximately 5 bits/min (in this system, the classification interval was fixed to 10 s), and in the system developed by Schreuder et al. [12], average ITR of 17.39 bits/min was achieved. In the proposed system, minimizing the classification interval (e.g., by using shorter SOAs or modifying the probabilities of target and nontarget stimuli) is effective to improve the ITR; however, these approaches may also cause undesired modulations of ERP responses. Hence, optimization is the key to improving ITR and will be studied in future work. In this study, one female out of 11 subjects participated, and the gender distribution of subjects was uneven. Only limited research has been done on evaluating the influence of gender on BCI performance; it was reported that the performance of the motor-imagery-based and visual P300-based BCI is higher in females [34] and males [35], respectively. There is no such report on auditory BCI, and some other auditory BCI researches have an uneven gender distribution [12, 15, 36, 37]. Regarding auditory P300, it was reported that the amplitude of P300 tends to be higher in females, and the P300 latency is comparable between genders [38]. On auditory scene perception, some studies showed the better performance of males than females for localizing target sounds in a multi-source sound environment [39, 40]; however, they also pointed out that there was a large interindividual variability [40], and there is no clear conclusion. The effects of gender on auditory stream segregation are not yet known. The promising results were shown with a number of subjects in this study. Still, further research is required to investigate the influence of gender differences in the performance of BCI based on auditory stream segregation. The proposed system can be used by presenting stimuli to a person’s single ear, and patients who are deaf in one ear can use the system. Furthermore, this system does not require many speakers and a multichannel audio interface but only requires headphones and an audio interface with more than one channel. This makes the system easier to use and less expensive; moreover, it has potential for practical and medical usage. In this paper, 64-channel EEG signals of participants were recorded; however, this number of signals is too large for practical usage. Since the xDAWN filter was applied, the number of channels can be reduced based on the contribution of each channel in the spatial filter; moreover, and the channel reduction method based on the xDAWN algorithm, which was proposed by Rivet et al. [24], can be a possible option. Furthermore, a sophisticated machine learning method can be used to improve ITR and robustness. Additionally, optimizing the parameters of the tone itself and its sequences (e.g., frequency, tone, and SOA) can enhance users’ ability to discriminate among streams and can improve the system. Conclusion In this study, three oddball sequences consisting of musical tones were presented to each subject’s right ear. Subjects were asked to pay attention to one of the presented sequences and count the number of target stimuli, and responses to each target stimulus were analyzed and classified based on Riemannian geometry. P300 activity was elicited by the subject’s selective attention to the tone stream, and the subject’s attended stream could be detected with high accuracy by classifying the responses elicited by the target stimuli in each stream. Multiclass auditory systems that have been proposed to date mainly use the location of sound sources to make a variety of auditory stimuli [11, 12]. Hence, these systems do not make the best use of properties of tones, such as frequency, intensity, or timbre. In our previous research, an auditory illusion called stream segregation was tested, and its result was promising; however, it had allowed for a binary decision to be made. In this study, by utilizing musical tones, the BCI system based on auditory stream segregation was extended to three classes. The present results indicate that auditory stimuli based on stream segregation can be used by a multiclass auditory BCI system and enhance current systems. Supporting information S1 Fig. Grand averaged confusion matrix. From the classification results from 10-fold cross-validation for each subject, a confusion matrix was derived, and the confusion matrices from all subjects were averaged over subjects. https://doi.org/10.1371/journal.pone.0303565.s001 (TIF) TI - An auditory brain-computer interface based on selective attention to multiple tone streams JF - PLoS ONE DO - 10.1371/journal.pone.0303565 DA - 2024-05-23 UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/an-auditory-brain-computer-interface-based-on-selective-attention-to-0GwL1ekeCu SP - e0303565 VL - 19 IS - 5 DP - DeepDyve ER -