Learning long-term filter banks for audio source separation and audio scene classification

Learning long-term filter banks for audio source separation and audio scene classification Filter banks on short-time Fourier transform (STFT) spectrogram have long been studied to analyze and process audios. The frameshift in STFT procedure determines the temporal resolution. However, in many discriminative audio applications, long-term time and frequency correlations are needed. The authors in this work use Toeplitz matrix motivated filter banks to extract long-term time and frequency information. This paper investigates the mechanism of long-term filter banks and the corresponding spectrogram reconstruction method. The time duration and shape of the filter banks are well designed and learned using neural networks. We test our approach on different tasks. The spectrogram reconstruction error in audio source separation task is reduced by relatively 6.7% and the classification error in audio scene classification task is reduced by relatively 6.5%, when compared with the traditional frequency filter banks. The experiments also show that the time duration of long-term filter banks in classification task is much larger than in reconstruction task. Keywords: Long-term filter banks, Deep neural network, Audio scene classification, Audio source separation 1 Introduction makes it very different from natural images. For example Audios in a realistic environment are typically composed in Fig. 1,(a)and (b) are two audio fragments randomly of different sound sources. Yet humans have no problem selected from an audio of “cafe” scene. We first calculate in organizing the elements into their sources to recognize the average energy distribution of the two examples in the acoustic environment. This process is called auditory the frequency direction, which is shown in (c). And then scene analysis [1]. Studies in the central auditory sys- the temporal coherence of salient audio elements in each tem [2–4] have inspired numerous hypotheses and models frequency bin is measured as (d). It is obvious that the concerning the separation of audio elements. One promi- energy distribution and temporal coherence vary tremen- nent hypothesis that underlies most investigations is that dously in different frequency bins, but are similar in the audio elements are segregated whenever they activate same frequency bin of different spectrograms. Thus for well-separated populations of auditory neurons that are audio signals, the spectrogram structure is not equivalent selective to frequency [5, 6], which emphasizes the audio in time and frequency direction. In this paper, we propose distinction on the frequency dimension. At the same time, a novel network structure to learn the energy distribution other studies [7, 8] also suggest that auditory scenes are and temporal coherence in different frequency bins. essentially dynamic, containing many fast-changing, rela- tively brief acoustic events. Therefore an essential aspect 1.1 Related work of auditory scene analysis is the linking over time [9]. For audio separation [10, 11] and recognition [12, 13] Problems inherent to auditory scene analysis are sim- tasks, the time and frequency analysis is usually imple- ilar to those found in visual scene analysis. However, mented using well designed filter banks. the time and frequency characteristic of a spectrogram Filter banks are traditionally composed of finite or infi- nite response filters in principle [14], but the stability *Correspondence: teng-zhang10@mails.tsinghua.edu.cn of the filters is usually difficult to be guaranteed. For Department of Electronic Engineering, Tsinghua University, Beijing, China simplicity, filter banks on STFT spectrogram have been © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 2 of 13 Fig. 1 Spectrogram examples of “cafe” scene. a, b Two audio fragments randomly selected from “cafe” scene. c The average energy distribution of the two examples in frequency direction. d The temporal coherence of the two examples in different frequency bins investigated for a long time [15]. In this case, the time 1.2 Contribution of this paper resolution is determined by the frameshift in the STFT As shown in Fig. 1, when perceptual frequency scale procedure and the frequency resolution is modelled by is utilized to map the linear frequency domain to the the frequency response of the filter banks. Frequency fil- nonlinear perceptual frequency domain [27], the major ter banks can be parameterized in the frequency domain concern comes to be how to model the energy distri- with filter centre, bandwidth, gain and shapes [16]. If these bution and temporal coherence in different frequency parameters are learnable, deep neural networks (DNNs) bins. can be utilized to learn them discriminatively [17–19]. To obtain better time and frequency analysis results, These frequency filter banks are usually used to model the we divide the audio processing procedure into two stages. frequency selectivity of the auditory system, but cannot In the first stage, traditional frequency filter banks are represent the temporal coherence of audio elements. implemented on STFT spectrogram to extract frequency DNNs are often used as classifiers when the inputs features. Without loss of generality, the parameters of the are dynamic acoustic features such as filter bank-based frequency filter banks are set experimentally. In the sec- cepstral features and Mel-frequency cepstral coefficients ond stage, a novel long-term filter bank spanning several [20, 21]. When the input to DNNs is a magnitude spec- frames is constructed in each frequency bin. The long- trogram, time-frequency structure of the spectrogram term filter banks proposed here can be implemented by can be learned. Neural networks organized into a two- neural networks and trained jointly with the target of the dimensional space have been proposed to model the time specific task. and frequency organization of audio elements by Wang The major contributions are summarized as follows: and Chang [22]. They utilized two-dimensional Gaussian lateral connectivity and global inhibition to parameter- - Toeplitz matrix motivated long-term filter banks: ize the network, where the two dimensions correspond Unlike filter banks in frequency domain, our proposal to frequency and time respectively. In this model, time is of long-term filter banks spreads over the time converted into a spatial dimension, temporal coherence dimension. They can be parameterized with the time can take place in auditory organization much like in visual duration and shape constraints. For each frequency organization where an object is naturally represented in bin, the time duration is different, but for each frame, spatial dimensions. However, these two dimensions are the filter shape is constant. This mechanism can be not equivalent in a spectrogram according to our analysis. implemented using a Toeplitz matrix motivated And what is more, the parameters of the network are set network. empirically and not learnable, which is still significantly - Spectrogram reconstruction from filter bank dependent on domain knowledge and modelling skill. coefficients: Consistent with the audio processing In recent years, neural networks with special structures procedure, we also divide the reconstruction such as convolutional neural network (CNN) [23, 24]and procedure into two stages. The first stage is a dual long short-term memory (LSTM) [25, 26]havebeenused inverse process of the long-term filter banks and the to extract the long-term information of audios. But in both second stage is a dual inverse process of the network structures, the temporal coherence is considered frequency filter banks. This paper investigates the to be the same in different frequency bins, which is in spectrogram reconstruction problem using an contradiction with Fig. 1. elaborate neural network. Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 3 of 13 This paper is organized as follows. The next section When the number of frequency filters is equal to m,the describes the detailed mechanism of the long-term fil- long-term filter banks can be parameterized by m linear ter banks and the spectrogram reconstruction method. transformations. The parameters will be labelled as θ and Then network structures used in our proposed method discussed in the following part of this section in detail. are introduced in Section 3.Section 4 conducts several The back-end processing modules vary from different experiments to show the performance of long-term filter applications. For audio scene classification task, they will banks regarding source separation and audio scene classi- be deep convolutional neural networks followed by a fication. Finally, we conclude our paper and give directions softmax layer to convert the feature maps to the corre- for future work in Section 5. sponding categories. However, for audio source separation task, the modules will be composed by a binary gating 2 Long-term filter banks layer and some spectrogram reconstruction layers. We For generality, we consider in this section a long-term fil- define them as nonlinear functions f .The long-termfilter ter bank learning framework based on neural networks as bank parameters θ can be trained jointly with the back- Fig. 2. end parameters γ using back propagation method [33]in The input audio signal is first transformed to a sequence neural networks. of vectors using STFT [28]; the STFT result can be repre- sented as X ={x , x , ..., x }. T is determined by the 2.1 Toeplitz motivation 1...T 1 2 T frame shift in STFT, the dimension of each vector x can be The long-term filter banks in our proposed method are labelled as N, which is determined by the frame length. used to extract the energy distribution and temporal The frequency filter banks can be simplified as a lin- coherence in different frequency bins which have been T T T discussed in Section 1. As shown in Fig. 4,the long-term ear transformation y = f x , f x , ..., f x ,where f t t t t 1 2 m k filter banks can be implemented by a series of filters with is the weights of the k-th frequency filter. In the his- different time durations. If the output of the frequency tory of auditory frequency filter banks [29], the rounded filter banks is y , and the long-term filter banks are param- exponential family [30] and the gammatone family [31] eterized as W = {w , w , ..., w }, the operation of the 1 2 m are the most widely used families. We use the sim- long-term filter banks can be mathematically represented plest form of these two families, triangular shape for the as Eq. 1. T is the length of the STFT output, m is the rounded exponential family and Gaussian shape for the dimension of y , which also represents the number of fre- gammatone family. For triangular filter banks, the band- quency bins, w is a set of T positive weights to represent width is 50% overlapped between neighbouring filters. For the time duration and shape of the k-th filter. In Fig. 4 Gaussian filter banks, the bandwidth is 4σ,where σ rep- for example, w is a rectangular window with individual resents the standard deviation in the Gaussian function. width, each row of the spectrogram is convolved by the These two types of frequency filter banks are the base- corresponding filter. lines in this paper, respectively named TriFB and GaussFB. The triangular and gaussian examples distributed uni- formly in the Mel-frequency scale [32] can be seen z = y ∗ w ,1 ≤ k ≤ m (1) t,k i,k k,i−t in Fig. 3. i=1 Fig. 2 Long-term filter banks learning framework. The left part of the framework is the feature analysis procedure including STFT, frequency filter banks and long-term filter banks. The right part is the application examples of the extracted feature map, such as audio scene classification and audio source separation. Long-term filter banks in the feature analysis procedure and the back-end application modules are stacked into a deep neural network Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 4 of 13 Fig. 3 Shape of frequency filter banks. a The triangular filter banks. b The gaussian filter banks As a matter of fact, the operation in Eq. 1 is a series of frequency bin is T. This assumption is unreasonable espe- one-dimensional convolutions along time axis. We rewrite cially when T is extremely large. The long-term correla- it using the Toeplitz matrix [34] for simplicity. In Eq. 2,the tion should be limited to a certain range according to our tensor S ={S , S , ..., S } represents the linear transfor- intuition. Inspired by traditional frequency filter banks, 1 2 m mation form of long-term filter banks in each frequency we attempt to use the parameterized window shape to bin. z in Eq. 2 is equivalent to {z , z , ..., z } in Eq. 1. limit the time duration of long-term filter banks. k 1,k 2,k T,k In this case, long-term filter banks can be represented In Fig. 4, rectangular shapes with time durations of 3, as a simple form of tensor operation, which can be eas- 2, 1 and 2 frames are utilized as an interpretation. From ily implemented by a Toeplitz motivated network layer. the theory of frequency filter banks, triangular and gaus- According to [35], Toeplitz networks are mathematically sian shapes are also commonly used options. However, tractable and can be easily computed. rectangular and triangular shapes are not differentiable and unable to be incorporated into a scheme of a back- propagation algorithm. Thus in this paper, the shape of z = y ˆ S ,1 ≤ k ≤ m k k long-term filter banks is constrained using the Gaussian y ˆ = y , y , ..., y k 1,k 2,k T,k function as Eq. 3. The time duration of long-term filter ⎡ ⎤ w w ··· w banksislimited by σ , the strength of each frequency bin k,0 k,−1 k,1−T k ⎢ ⎥ w w ··· w is reconstructed by α , the total number of parameters k,1 k,0 k,2−T k ⎢ ⎥ S = (2) ⎢ ⎥ k . . . . reduces from 2mT in Eq. 2 to 2m in Eq. 3. . . . . ⎣ ⎦ . . . w w ··· w k,T −1 k,T −2 k,0 t w = α · exp − ,1 ≤ k ≤ m (3) k,t k 2.2 Shape constraint If W is totally independent, S is a dense Toeplitz matrix, When we initialize the parameters α and σ randomly, k k which means that the time duration of the filter in each we believe that the learning will be well behaved, which Fig. 4 Model architecture of long-term filter banks. Each row of the spectrogram is convolved by a filter bank with individual width. In this sketch map, time durations of the filter banks in the highest four frequency bins are 3, 2, 1 and 2 frames Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 5 of 13 −1 is the so-called “no bad local minim” hypothesis [36]. y = z S ,1 ≤ k ≤ m (5) However, a different view presented in [37] is that the However, considering that S is a Toeplitz matrix, R underlying easiness of optimizing deep networks is rather k k canberepresented in asimpleway [40]. R is given by tightly connected to the intrinsic characteristics of the k ˆ ˆ Eq. 6,where A , B , A and B all are lower triangular data these models are run on. Thus for us, the initializa- k k k k Toeplitz matrices given by Eq. 7. tion of parameters is a tricky problem, especially when α and σ have clear physical meanings. 1 T If σ in Eq. 3 is initialized with a value larger than ˆ ˆ R = A B − B A (6) k k k k 1.0, the corresponding S is approximately equal to a k- tridiagonal Toeplitz matrix [38], where k is less than 3. ⎛ ⎞ Thus, if the totally independent W is initialized with an ⎛ ⎞ 0 ··· 00 a 0 ··· 0 identity matrix, similar results with limited time dura- ⎜ ⎟ a a ··· 0 a ··· 00 ⎜ ⎟ 2 1 n ⎜ ⎟ tions should be obtained. Whether it is the Gaussian ⎜ ⎟ A = . . . , A = ⎜ ⎟ k . k . . . ⎝ . . . . ⎠ . . . . . ⎝ ⎠ shape-constrained algorithm as Eq. 3 or is the totally . . . . . . . a a ··· a n n−1 1 independent W in Eq. 2, the initialization of parameters a ··· a 0 2 n ⎛ ⎞ is important and intractable when adapting to differ- ⎛ ⎞ 0 ··· 00 b 0 ··· 0 ent tasks. More details will be discussed and tested in ⎜ ⎟ b b ··· 0 b ··· 00 ⎜ n−1 n ⎟ 1 ⎜ ⎟ Section 4. ⎜ ⎟ ˆ B = , B = . . . ⎜ ⎟ k . k . . . ⎝ ⎠ . . . . . . . . ⎝ ⎠ . . . . . . . b b ··· b 1 2 n 2.3 Spectrogram reconstruction b ··· b 0 n−1 1 In our proposal of learning framework as Fig. 2,STFT (7) spectrogram is transformed into subband coefficients Note that a and b canalsoberegardedasthe solutions after frequency filter banks and long-term filter banks. of two linear systems, which can be learned using a fully The dimension of subband coefficients z is usually much connected neural work layer. In this case, the number of less than x to reduce computational cost and extract sig- parameters reduces from mT to 2mT. nificant features. In this case, the subband coefficients In conclusion, the spectrogram reconstruction proce- are incomplete, perfect spectrogram reconstruction from dure can be implemented using a two-layer neural net- subband coefficients is impossible. work. When the first layer is implemented as Eq. 5,the The spectrogram vector x is firstly transformed using total number of parameters is mN + mT . While when frequency filter banks described at the beginning of this section. Then long-term filter banks work as Eq. 2 to get the first layer is represented as Eq. 6,the totalnumber thesubband coefficients.Thusthe processofthe con- is mN + 2mT. Experiments in Section 4.1 will show the version from spectrogram vector to filter subband coef- difference between these two methods. ficients and the dual reconversion can be represented as Eq. 4. The operation of frequency filter banks f can be 3 Training the models As described in Section 2, the long-term filter banks we simplified as a singular matrix F where the number of proposed here can be integrated into a neural network rows is much less than columns. The reconversion process −1 (NN) structure. The parameters of the models are learned f is approximately the Moore-Penrose pseudoinverse jointly with the target of the specific task. In this section, [39]of F; this module can be easily implemented using a we introduce two NN-based structures respectively for fully connected network layer. However, the tensor opera- audio source separation and audio scene classification tion of long-term filter banks f is much more intractable. tasks. z = f (f (x )) t 2 1 t 3.1 Audio source separation −1 −1 x = f f (z ) (4) t t 1 2 In Fig. 5a, the procedures of STFT and frequency fil- ter banks in Fig. 2 are excluded from the NN structure because they are implemented empirically and have no Without regard to the special structure of Toeplitz −1 parameters. The NN structure for audio source separation matrix, f can be mathematically represented as Eq. 5. task is divided into four steps, in which three steps have S is a nonsingular matrix which has been defined in −1 been discussed in Section 2. The layers of long-term filter Eq. 2. In general, S is another nonsingular matrix R banks and inverse of long-term filter banks are imple- which can be learned using a fully connected network mented respectively as Eqs. 2 and 5, which can be denoted layer independently. There are m frequency bins in total, as h and h . The reconstruction layer is constructed using so m parallel fully connected network layers are needed 1 2 a fully connected layer and can be denoted as h . and the number of parameters is mT . 4 Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 6 of 13 Fig. 5 NN-based structures with proposed method. a The NN structure for audio source separation task. b The NN structure for audio scene classification task We attempt the audio separation from an audio mixture 3.2 Audio scene classification using a simple masking method [41], which can be repre- In early pattern recognition studies [42], the input is first sented as the binary gating layer in Eq. 8 and denoted as converted into some features, which are usually defined h . The output of this layer is a linear projection modu- empirically by experts and believed to be identified with lated by the gates g . These gates multiply each element of the recognition targets. In Fig. 5b, a feature extraction the matrix Z and control the information passed on in the structure including the long-term filter banks is proposed hierarchy. Stacking these four layers on the top of input Y to systematically train the overall recognizer in a manner gives a representation of the separated clean spectrogram consistent with the minimization of recognition errors. X = h ◦ h ◦ h ◦ h (Y ). The NN structure for audio scene classification task can 4 3 2 1 also be divided into four steps, where the first layer of N long-term filter banks is implemented using Eq. 2.The g = sigmoid z v ti tj ji j=1 (8) convolutional layer and the pooling layer are conducted o = z g ti ti ti using the network structure described in [43]. In general, let z refer to the concatenation of frames after long- i:i+j Neural networks are trained on a frame error (FE) min- term filter banks z , z , ...z . The convolution operation i i+1 i+j hm imization criterion and the corresponding weights are involves a filter w ∈ R , which is applied to a window of adjusted to minimize the error squares over the whole h frames to produce a new feature. For example, a feature training data set. The error of the mapping is given c is generated from a window of frames z by Eq. 10, i:i+h−1 by Eq. 9,where x is the targeted clean spectrogram where b ∈ R is a bias term and f is a non-linear function. and x ˆ is the corresponding separated representation. This filter is applied to each possible window of frames to As commonly used, L2-regularization is typically chosen produce a feature map c =[ c , c , ...c ]. Then a max- 1 2 T −h+1 to impose a penalty on the complexity of the mapping, overtime pooling operation [44] over the feature map is which is the λ term in Eq. 9. However, when the layer applied and the maximum value c ˆ = max(c) is taken as of long-term filter banks is implemented by Eq. 3,the the feature corresponding to this filter. Thus one feature is elements of w have definitude physical meanings. Thus, extracted using one filter. This model uses multiple filters L2-regularization is operated only on the upper three lay- with varying window sizes to obtain multiple features. ers in this model. In this case, the network in Fig. 5a can be optimized by the back-propagation method. c = f (w · z + b) (10) i i:i+h−1 T 4 The features extracted from the convolutional and pool- 2 2 ˆ ing layers are then passed to a fully connected soft- =  x − x  +λ  w  (9) t t l t=1 max layer to output the probability distribution over l=2 Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 7 of 13 categories. The classification loss of this model is given by the MIR-1K dataset [46]. The dataset consists of 1000 Eq. 11,where n is the number of audios, k is the num- song clips recorded at a sample rate of 16kHz, with berofcategories, y is the category labels and p is the durations ranging from 4 to 13 s. The dataset is then probability distribution produced by the NN structure. In utilized with 4 training/testing splits. In each split, 700 this case, the network in Fig. 5b can be optimized by the of the examples are randomly selected for training and back-propagation method. the others for testing. We use the mean average accu- racy over the 4 splits as the evaluation criterion. In order n k 4 to achieve a fair comparison, we use this dataset to cre- = y · log(p ) + λ  w  (11) i,j i,j l ate3sets of mixtures.For each clip,wemix thevocal i=1 j=1 l=2 and music track under various conditions, where the energy ratio between music and voice takes 0.1, 1 and 10 4 Experiments respectively. To illustrate the properties and performance of long-term We first test our methods on the outputs of frequency filter banks proposed in this paper, we conduct two groups filter banks. In this case, the combination of classical of experiments respectively on audio source separation frequency filter banks and our proposed temporal filter and audio scene classification. To achieve a fair compar- banks work as two-dimensional filter banks on mag- ison with traditional frequency filter banks, all experi- nitude spectrograms. Classical CNN models can learn ments conducted in this section utilize the same settings two-dimensional filters on spectrograms directly. Thus we and structures except for the items listed below. introduce a 1-layer CNN model as a comparison. The CNN model is implemented as [22], but the convolutional - Models: The models tested in this section are layer here is composed of learnable parameters, instead of different from each other in two aspects. The variants constant Gaussian lateral connectivity in [22]. This con- of frequency filter banks include TriFB and GaussFB, volution layer works as a two-dimensional filter whose as described in Section 2. For long-term filter banks, size is set to be 5 × 5, the outputs of this layer is then Gaussian shape-constrained filters introduced in processed as Fig. 5a.Weuse theNNmodel in [47]and Section 2.2 are named GaussLTFB and totally the one-layer CNN model as our baseline models. For independent filters are named FullLTFB. The our proposed long-term filter banks, we test two vari- baseline of our experiments has no long-term filter ant modules: GaussLTFB and FullLTFB which have been banks, which is labelled as Null. The initials of the defined at the beginning of Section 4. For FullLTFB situ- names are used to differentiate models. For example, ation, two initialization methods discussed in Section 2.2 when TriFB and FullLTFB are used in the model, the are tested respectively. The three variant modules Gaus- model is named TriFB-FullLTFB. sLTFB, FullLTFB-Random and FullLTFB-Identity can be - Initialization: When we use totally independent utilized on two types of frequency filter banks TriFB and filters as the long-term filter banks, two initialization GaussFB respectively, thus a total of six long-term filter methods discussed in Section 2.2 are tested in this banks related experiments are conducted in this part. section. When the parameters are initialized Table 1 shows the results of these experiments. From randomly, the method is named Random, while when the results, we can get conclusions as follows. First, the the parameters are initialized using an identity best results in the table are obtained using long-term filter matrix, the method is named Identity. banks, which demonstrates the effectiveness of our pro- - Reconstruction: When the spectrogram posal, especially when the energy of interference is larger reconstruction is implemented as Eq. 5, the method than music. As an example, when we use gaussian fre- is named Re_inv, while when the reconstruction is quency filter banks and the energy ratio between music implemented as Eq. 6, the method is named Re_toep. and voice is 1, the reconstruction error is reduced by In all experiments, the audio signal is first transformed relatively 6.7% by using Gaussian shape-constrained long- using short-time Fourier transform with a frame length of term filter banks. Second, totally independent filters are 1024 and a frameshift of 220. The number of frequency severely influenced by the initialization. When the param- filters is set to be 64; the detailed settings of NN structures eters are initialized using an identity matrix, the perfor- are shown in Fig. 5. All parameters in the neural network mance is close to the Gaussian shape-constrained filters are trained jointly using Adam [45] optimizer; the learning in this task. However, when the parameters are initialized rate is initialized with 0.001. randomly, the reconstruction error seems to be unable to converge effectively. This result has to do with the task itself, which will be further tested in Section 4.3.Then, 4.1 Audio source separation the one-layer CNN model improves the performance only In this experiment, we investigate the application of long- when the energy ratio between music and voice is 0.1, term filter banks in audio source separation task using Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 8 of 13 Table 1 Reconstruction error of audio source separation using frequency filter banks as input Re_toep Re_inv Init Method M/V = 0.1 M/V =1M/V = 10 M/V = 0.1 M/V =1M/V = 10 – TriFB-Null 3.49 1.51 0.55 3.49 1.51 0.55 – GaussFB-Null 3.28 1.47 0.58 3.28 1.47 0.58 – TriFB-CNN-1layer 2.85 1.51 0.61 2.85 1.51 0.61 – GaussFB-CNN-1layer 2.91 1.50 0.64 2.91 1.50 0.64 – TriFB-GaussLTFB 2.66 1.38 0.50 3.65 1.80 0.74 – GaussFB-GaussLTFB 2.60 1.39 0.56 3.91 1.67 0.67 Random TriFB-FullLTFB 3.90 41.37 2.28 3.84 1.83 0.78 Random GaussFB-FullLTFB 3.55 1.99 0.86 3.85 1.64 0.66 Identity TriFB-FullLTFB 2.69 1.39 0.52 3.92 1.63 0.62 Identity GaussFB-FullLTFB 2.62 1.39 0.56 3.85 1.51 0.59 M/V represents the energy ratio between music and voice this can be attributed to the local sensitivity of recon- by relatively 5.0% by using Gaussian shape-constrained struction task. As a matter of fact, the time durations of long-term filter banks, this effect is less obvious than the long-term filter banks in most frequency bins we learned result in Table 1. This is because that the information of here are1.Thus, theconvolutionsize5 × 5istoo large. magnitude spectrograms is too rich, so the performance of Finally, Toeplitz inversion motivated reconstruction algo- the simplest NN model is also good. But when the energy rithm performs much better than the direct inverse matrix of interference is larger than music, the effectiveness of algorithm. When the direct inverse matrix algorithm is our long-term filter banks is obvious. utilized, the performance of our proposal of long-term fil- A direct perspective of the separation results can be ter banks becomes even worse than the frequency filter seen in Fig. 6. The figure shows the clean music spec- banks. trogram (a), mixed spectrogram (b)and theseparated We now test our methods on magnitude spectrograms spectrogram (c–e) when the energy ratio is 1. For this as described in [47]. In this situation, long-term filter example, (c) is the separated spectrogram from GaussFB- banks are used as one-dimensional filter banks to extract Null which has been defined at the beginning of this temporal information. The size of magnitude spectro- section, (d) is the separated spectrogram from GaussFB- grams is 513 × 128. The settings of NN structures in GaussLTFB and (e) is the separated spectrogram from Fig. 5a are modified correspondingly to adapt to this size. GaussFB-FullLTFB. When compared with (c), the results We also use the NN model in [47]and the1-layer CNN of our proposal of long-term filter banks (d)and (e)show model as our baseline models. The three variant modules significant temporal coherence in each frequency bin, GaussLTFB, FullLTFB-Random and FullLTFB-Identity are which is more approximate to the clean music spectro- utilized on magnitude spectrograms directly in this part. gram in (a). The results of these experiments are shown in Table 2. Compared with the results in Table 1,all theconclusions 4.2 Audio scene classification above remain unchanged. When the energy ratio between In this section, we apply the long-term filter banks to the music and voice is 1, the reconstruction error is reduced audio scene classification task. We employ LITIS ROUEN Table 2 Reconstruction error of audio source separation using magnitude spectrograms as input Re_toep Re_inv Init Method M/V = 0.1 M/V =1M/V = 10 M/V = 0.1 M/V =1M/V = 10 –Null[47] 2.58 0.99 0.033 2.58 0.99 0.033 – CNN-1layer [22] 2.83 0.96 0.047 2.83 0.96 0.047 – GaussLTFB 2.49 0.94 0.037 2.60 0.95 0.034 Random FullLTFB 2.77 1.12 0.080 2.85 1.03 0.043 Identity FullLTFB 2.50 0.94 0.037 2.82 0.95 0.034 M/V represents the energy ratio between music and voice Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 9 of 13 Fig. 6 Reconstructed spectrogram of audio source separation task. The clean music spectrogram in a is randomly selected from the dataset. b The corresponding music and vocal mixture. c–e The reconstructed music spectrograms from the mixture spectrogram using different configurations dataset [48] and DCASE2016 dataset [49]toconduct dataset is divided into fourfold. Our experiments acoustic scene classification experiments. obey this setting, and the average performance will be Details of these datasets are listed as follows. reported. - LITIS ROUEN dataset: This is the largest publicly For both datasets, the examples are 30 s long. In the available dataset for ASC to the best of our data preprocessing step, we first divide the 30-s exam- knowledge. The dataset contains about 1500 min of ples into 1-s clips with 50% overlap. Then each clip is acoustic scene recordings belonging to 19 classes. processed using neural networks as Fig. 5b. The classifi- Each audio recording is divided into 30-s examples cation results of all these clips will be averaged to get an without overlapping, thus obtain 3026 examples in ensemble result for the 30-s examples. The size of audio total. The sampling frequency of the audio is spectrograms is 64 × 128. For CNN structure in Fig. 5b, 22,050 Hz. The dataset is provided with 20 the window sizes of convolutional layers are 64 × 2 × 64, training/testing splits. In each split, 80% of the 64 × 3 × 64 and 64 × 4 × 64, the fully connected lay- examples are kept for training and the other 20% for ers are 196 × 128 × 19(15). For DCASE2016 dataset, we testing. We use the mean average accuracy over the use dropout rate of 0.5. For all these methods, the learn- −4 20 splits as the evaluation criterion. ing rate is 0.001, l weight is 1e , training is done using - DCASE2016 dataset: The dataset is released as Task the Adam [45] update method and is stopped after 100 1 of the DCASE2016 challenge. We use the training epochs. In order to compute the results for each development data in this paper. The development training-test split, we use the classification error over all data contains about 585 min of acoustic scene classes. The final classification error is its average value recordings belonging to 15 classes. Each audio over all splits. recording is divided into 30-s examples without We begin with experiments where we train different overlapping, thus obtain 1170 examples in total. The neural network models without long-term filter banks on sampling frequency of the audio is 44,100 Hz. The both datasets. As described at the beginning of Section 4, Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 10 of 13 Table 3 Average performance comparison with related works on LITIS Rouen dataset and DCASE2016 dataset DCASE2016 (%) LITIS Rouen (%) Method Error F-measure Error F-measure TriFB-Null 23.12 76.08 3.76 96.19 GaussFB-Null 22.69 76.56 3.48 96.44 CNN-multilayer [50] 26.45 72.44 4.00 95.80 CNN-1layer [22] 23.29 75.82 2.97 96.91 RNN-Gam [26]– – 3.4 – CNN-Gam [24]– – 4.2 – MFCC-GMM [49] 27.5 – – – DNN-CQT [51]– 78.1 – 96.6 DNN-Mel [53] 23.6 – – – CNN-Mel [54] 24.0 – – – our baseline systems take the outputs of frequency filter [52] feature representations. On DCASE2016 dataset, banks as input. TriFB and GaussFB are placed in the fre- only DNN model using CQT features performs better quency domain to integrate the frequency information. than our baseline models. Classical CNN model with Classical CNN models have the ability to learn two- three layers performs almost the same as [24]onLITIS dimensional filters on the spectrum directly. We introduce Rouen dataset, but gets a rapid deterioration of perfor- two CNN structures as a comparison. The first CNN mance on DCASE2016 dataset. This can also be attributed model is implemented as [50], which has multiple convo- to the lack of training data, especially on DCASE2016 lutional layers, pooling layers, and fully connected layers. dataset. CNN model with one convolutional layer per- The window size of convolutional kernels are 5 × 5, the forms a little better, but still worse than our baseline pooling size is 3, the output channels are [8, 16, 23], the models. These results show that the time-frequency struc- fully connected layers are 196 × 128 × 19(15). Another ture of the spectrum is difficult to be learned using CNN structure is the same as the one-layer CNN model two-dimensional convolution kernels in classical CNN described in Section 4.1, the outputs of this model is then models. For the two baseline models, GaussFB per- processed as Fig. 5b. forms better than TriFB on both datasets, because of The results of these experiments are shown in Table 3. that Gaussian frequency filter banks can extract more Comparing with other CNN related works, our baseline global information. In conclusion, the results of our models on both datasets achieve gains in accuracy. On baseline models are in line with expectations on both LITIS Rouen dataset, recurrent neural network (RNN) datasets. [26] performs better than our baseline models, because We now test our long-term filter banks on both datasets. of the powerful sequence modelling capabilities of RNN. We also test three variant modules in this part: Gaus- DNN model in [51] is the best-performing single model sLTFB, FullLTFB-Random and FullLTFB-Identity. These on both datasets, this can be attributed to the lack of train- three variant modules can be injected into neural net- ing data and the stability of Constant Q-transform (CQT) works directly as Fig. 5b. Table 4 Average performance comparison using different configurations on LITIS Rouen dataset and DCASE2016 dataset DCASE2016 (%) LITIS Rouen (%) Init Method Error F-measure Error F-measure – TriFB-Null 23.12 76.08 3.76 96.19 – GaussFB-Null 22.69 76.56 3.48 96.44 – TriFB-GaussLTFB 22.40 76.79 2.82 97.05 – GaussFB-GaussLTFB 22.15 77.11 2.97 96.91 Random TriFB-FullLTFB 22.67 76.49 3.47 96.35 Random GaussFB-FullLTFB 21.21 78.05 2.96 96.92 Identity TriFB-FullLTFB 23.35 75.69 3.67 96.18 Identity GaussFB-FullLTFB 23.13 75.83 3.21 96.61 Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 11 of 13 Fig. 7 Validation curves on LITIS ROUEN dataset and DCASE2016 dataset. a, b The proposed methods on LITIS ROUEN dataset. d, e The proposed methods on DCASE2016 dataset. c The classical CNN nodels on LITIS ROUEN dataset. f The classical CNN nodels on DCASE2016 dataset Table 4 is the performance comparison on both datasets. Figure 7a–e shows consistent results with datasets. Models with GaussLTFB module perform con- Table 4. sistently better than the corresponding baseline models. Although the performance fluctuates for different vari- 4.3 Reconstruction vs classification ants, the performance gain is obvious. For FullLTFB situ- In the experiment of audio source separation task, when ation, random initialization obtains performance gain on the parameters of totally independent long-term filter both datasets, but identity initialization degrades the per- banks are initialized randomly, the result seems to be formance on DCASE2016 dataset. This can be attributed unable to converge effectively. However, it is completely that in classification tasks, we need to extract a global rep- the opposite in audio scene classification task. resentation of all frames, more details will be discussed in Figure 8 is an explanation of the unconformity between Section 4.3. On LITIS Rouen dataset, TriFB-GaussLTFB the above two tasks. Figure 8a, b is the filters learned on model performs significantly better than the state-of- MIR-1K dataset. At low frequencies, the time duration of the-art result in [51] and obtains 2.82% on classification filters are almost equal to 1, only at very high frequen- error. On DCASE2016 dataset, GaussFB-FullLTFB model cies, the time durations become large. But for Fig. 8c, d with random initialization reduces the classification error which is learned on DCASE2016 dataset, the time dura- by relatively 6.5% and reaches the performance of DNN tion is much larger. It is intuitive that in audio source model using CQT features in [51], meaning that the separation task, the time duration of the filters is much long-term filter banks make up for the lack of feature smaller than in audio scene classification task, especially extractions. at low frequencies. When the parameters of totally inde- Validation curves on both datasets are shown in Fig. 7. pendent long-term filter banks are initialized randomly, After 100 training epochs, experiments on DCASE2016 the implicit assumption is that the time durations of the dataset encounter overfitting problem; experiments on filters is as large as the number of all frames, which is not LITIS ROUEN dataset have almost converged. Figure 7c, e applicable. In reconstruction related tasks, for example, shows that the performance of classical CNN model the long-term correlation is much more limited because is significantly worse than models with only the fre- our goal is to reconstruct the spectrogram frame by frame. quency filter banks, which is consistent with the However, in classification tasks, we need to extract a results in Table 3. The performance of one-layer CNN global representation of all frames, which is exactly in line model is between TriFB and GaussFB models on both with our hypothesis. Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 12 of 13 Fig. 8 Time durations of long-term filter banks in different tasks. a, b The long-term filters learned on MIR-1K dataset. c, d The long-term filters learned on DCASE2016 dataset 5Conclusions Authors’ contributions TZ designed the core methodology of the study, carried out the A novel framework of filter banks that can extract long- implementation and experiments, and he drafted the manuscript. JW term time and frequency correlation is proposed in this participated in the study and helped to draft the manuscript. Both authors paper. The new filters are constructed after traditional read and approved the final manuscript. frequency filters and can be implemented using Toeplitz Competing interests matrix motivated neural networks. Gaussian shape con- The authors declare that they have no competing interests. straint is introduced to limit the time duration of the filters, especially in reconstruction-related tasks. Then a Publisher’s Note spectrogram reconstruction method using the Toeplitz Springer Nature remains neutral with regard to jurisdictional claims in matrix inversion is implemented using neural networks. published maps and institutional affiliations. The spectrogram reconstruction error in audio source separation task is reduced by relatively 6.7% and the classi- Received: 21 November 2017 Accepted: 30 April 2018 fication error in audio scene classification task is reduced by relatively 6.5%. This paper provides a practical and References complete framework to learn long-term filter banks for 1. AS Bregman, Auditory scene analysis: the perceptual organization of sound. different tasks. (MIT Press, Cambridge, 1994) The former frequency filter banks are somehow interre- 2. S McAdams, A Bregman, Hearing musical streams. Comput. Music J. 3(4), 26–60 (1979) lated with the long-term filter banks. Combining the idea 3. AS Bregman, Auditory streaming is cumulative. J. Exp. Psychol. Hum. of these two types of filter banks, future work will be an Percept. Perform. 4(3), 380 (1978) investigation on two-dimensional filter banks. 4. GA Miller, GA Heise, The trill threshold. J. Acoust. Soc. Am. 22(5), 637–638 (1950) Funding 5. MA Bee, GM Klump, Primitive auditory stream segregation: a This work was partly funded by National Natural Science Foundation of China neurophysiological study in the songbird forebrain. J. Neurophysiol. (Grant No: 61571266). 92(2), 1088–1104 (2004) Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 13 of 13 6. D Pressnitzer, M Sayles, C Micheyl, IM Winter, Perceptual organization of 30. S Rosen, RJ Baker, A Darling, Auditory filter nonlinearity at 2 khz in normal sound begins in the auditory periphery. Curr. Biol. 18(15), 1124–1128 hearing listeners. J. Acoust. Soc. Am. 103(5), 2539–2550 (1998) (2008) 31. R Patterson, I Nimmo-Smith, J Holdsworth, P Rice, in a Meeting of the IOC 7. H Attias, CE Schreiner, in Advances in Neural Information Processing Speech Group on Auditory Modelling at RSRE, vol. 2. An efficient auditory Systems. Temporal low-order statistics of natural sounds (MIT Press, filterbank based on the gammatone function, (1987) Cambridge, 1997), pp. 27–33 32. S Young, G Evermann, M Gales, T Hain, D Kershaw, X Liu, G Moore, J Odell, 8. NC Singh, FE Theunissen, Modulation spectra of natural sounds and D Ollason, D Povey, et al, The htk book. Cambridge university engineering ethological theories of auditory processing. J. Acoust. Soc. Am. 114(6), department. 3, 175 (2002) 3394–3411 (2003) 33. DE Rumelhart, GE Hinton, RJ Williams, et al, Learning representations by 9. SA Shamma, M Elhilali, C Micheyl, Temporal coherence and attention back-propagating errors. Cogn. Model. 5(3), 1 (1988) in auditory scene analysis. Trends. Neurosci. 34(3), 114–123 34. EH Bareiss, Numerical solution of linear equations with Toeplitz and (2011) vector Toeplitz matrices. Numerische Mathematik. 13(5), 404–424 (1969) 10. DL Donoho, De-noising by soft-thresholding. IEEE Trans. Inf. Theory. 41(3), 35. N Deo, M Krishnamoorthy, Toeplitz networks and their properties. IEEE 613–627 (1995) Trans. Circuits Syst. 36(8), 1089–1092 (1989) 11. B Gao, W Woo, L Khor, Cochleagram-based audio pattern separation 36. YN Dauphin, R Pascanu, C Gulcehre, K Cho, S Ganguli, Y Bengio, in using two-dimensional non-negative matrix factorization with automatic Advances in Neural Information Processing Systems. Identifying and sparsity adaptation. J. Acoust. Soc. Am. 135(3), 1171–1185 (2014) attacking the saddle point problem in high-dimensional non-convex 12. A Biem, S Katagiri, B-H Juang, in Neural Networks for Processing [1993] III. optimization (Curran Associates, Inc., 2014), pp. 2933–2941 Proceedings of the 1993 IEEE-SP Workshop. Discriminative feature extraction 37. O Shamir, Distribution-specific hardness of learning neural networks for speech recognition (IEEE, 1993), pp. 392–401 (2016). arXiv preprint arXiv:1609.01037 13. Á de la Torre, AM Peinado, AJ Rubio, VE Sánchez, JE Diaz, An application 38. J Jia, T Sogabe, M El-Mikkawy, Inversion of k-tridiagonal matrices with of minimum classification error to feature space transformations for toeplitz structure. Comput. Math. Appl. 65(1), 116–125 (2013) speech recognition. Speech Commun. 20(3-4), 273–290 (1996) 39. A Ben-Israel, TN Greville, Generalized inverses: theory and applications, 14. S Akkarakaran, P Vaidyanathan, in Acoustics, Speech, and Signal Processing, vol. 15. (Springer Science & Business Media, 2003) 1999. Proceedings, 1999 IEEE International Conference On. New results and 40. ST Lee, H-K Pang, H-W Sun, Shift-invert arnoldi approximation to the open problems on nonuniform filter-banks, vol. 3 (IEEE, 1999), Toeplitz matrix exponential. SIAM J. Sci. Comput. 32(2), 774–792 (2010) pp. 1501–1504 41. X Zhao, Y Shao, D Wang, Casa-based robust speaker identification. IEEE 15. S Davis, P Mermelstein, Comparison of parametric representations for Trans. Audio Speech Lang. Process. 20(5), 1608–1616 (2012) monosyllabic word recognition in continuously spoken sentences. IEEE 42. RO Duda, PE Hart, DG Stork, Pattern classification. (Wiley, New York, 1973) Trans. Acoustics Speech Signal Process. 28(4), 357–366 (1980) 43. Y Kim, Convolutional neural networks for sentence classification (2014). 16. A Biem, S Katagiri, E McDermott, B-H Juang, An application of arXiv preprint arXiv:1408.5882 discriminative feature extraction to filter-bank-based speech recognition. 44. R Collobert, J Weston, L Bottou, M Karlen, K Kavukcuoglu, P Kuksa, Natural IEEE Trans. Speech Audio Process. 9(2), 96–110 (2001) language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 17. TN Sainath, B Kingsbury, A-R Mohamed, B Ramabhadran, in Automatic 2493–2537 (2011) Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop On. 45. D Kingma, J Ba, Adam: A method for stochastic optimization (2014). arXiv Learning filter banks within a deep neural network framework (IEEE, preprint arXiv:1412.6980 2013), pp. 297–302 46. C-L Hsu, JSR Jang, MIR Database (2010). http://sites.google.com/site/ 18. H Yu, Z-H Tan, Y Zhang, Z Ma, J Guo, Dnn filter bank cepstral coefficients unvoicedsoundseparation/mir-1k/. Retrieved 10 Sept 2017 for spoofing detection. IEEE Access. 5, 4779–4787 (2017) 47. EM Grais, G Roma, AJ Simpson, MD Plumbley, Two-stage single-channel 19. H Seki, K Yamamoto, S Nakagawa, in Acoustics, Speech and Signal audio source separation using deep neural networks. IEEE/ACM Trans. Processing (ICASSP), 2017 IEEE International Conference On. A deep neural Audio Speech Lang. Process. 25(9), 1773–1783 (2017) network integrated with filterbank learning for speech recognition (IEEE, 48. A Rakotomamonjy, G Gasso, IEEE/ACM Trans. Audio Speech Lang. 2017), pp. 5480–5484 Process. 23(1), 142–153 (2015) 20. H Yu, Z-H Tan, Z Ma, R Martin, J Guo, Spoofing detection in automatic 49. A Mesaros, T Heittola, T Virtanen, in Signal Processing Conference speaker verification systems using dnn classifiers and dynamic acoustic (EUSIPCO), 2016 24th European. Tut database for acoustic scene features. IEEE Trans. Neural Netw. Learn. Syst. 1–12 (2017) classification and sound event detection (IEEE, 2016), pp. 1128–1132 21. H Yu, Z-H Tan, Z Ma, J Guo, Adversarial network bottleneck features for 50. Y LeCun, Y Bengio, et al., Convolutional networks for images, speech, and noise robust speaker verification (2017). arXiv preprint arXiv:1706.03397 time series. The handbook of brain theory and neural networks. 3361(10), 22. D Wang, P Chang, An oscillatory correlation model of auditory streaming. 1995 (1995) Cogn. Neurodynamics. 2(1), 7–19 (2008) 51. V Bisot, R Serizel, S Essid, G Richard, Feature learning with matrix 23. S Lawrence, CL Giles, AC Tsoi, AD Back, Face recognition: a convolutional factorization applied to acoustic scene classification. IEEE/ACM Trans. neural-network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997) Audio Speech Lang. Process. 25(6), 1216–1229 (2017) 24. H Phan, L Hertel, M Maass, P Koch, R Mazur, A Mertins, Improved audio 52. JC Brown, Calculation of a constant q spectral transform. J. Acoust. Soc. scene classification based on label-tree embeddings and convolutional Am. 89(1), 425–434 (1991) neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 53. Q Kong, I Sobieraj, W Wang, M Plumbley, in Proceedings of DCASE 2016. 1278–1290 (2017) Deep neural network baseline for dcase challenge 2016 (Tampere 25. S Hochreiter, J Schmidhuber, Long short-term memory. Neural Comput. University of Technology. Department of Signal Processing, 2016) 9(8), 1735–1780 (1997) 54. D Battaglino, L Lepauloux, N Evans, F Mougins, F Biot, Acoustic scene 26. H Phan, P Koch, F Katzberg, M Maass, R Mazur, A Mertins, Audio scene classification using convolutional neural networks. DCASE2016 Challenge, classification with deep recurrent neural networks (2017). arXiv preprint Tech. Rep. (Tampere University of Technology. Department of Signal arXiv:1703.04770 Processing, 2016) 27. S Umesh, L Cohen, D Nelson, in Acoustics, Speech, and Signal Processing, 1999. Proceedings, 1999 IEEE International Conference On. Fitting the Mel scale, vol. 1 (IEEE, 1999), pp. 217–220 28. J Allen, Short term spectral analysis, synthesis, and modification by discrete fourier transform. IEEE Trans. Acoustics Speech Signal Process. 25(3), 235–238 (1977) 29. RF Lyon, AG Katsiamis, EM Drakakis, in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium On. History and future of auditory filter models (IEEE, 2010), pp. 3809–3812 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png EURASIP Journal on Audio, Speech, and Music Processing Springer Journals

Learning long-term filter banks for audio source separation and audio scene classification

Free
13 pages

Loading next page...
 
/lp/springer_journal/learning-long-term-filter-banks-for-audio-source-separation-and-audio-4g9gazDTJn
Publisher
Springer International Publishing
Copyright
Copyright © 2018 by The Author(s)
Subject
Engineering; Signal,Image and Speech Processing; Mathematics in Music; Acoustics; Engineering Acoustics
eISSN
1687-4722
D.O.I.
10.1186/s13636-018-0127-7
Publisher site
See Article on Publisher Site

Abstract

Filter banks on short-time Fourier transform (STFT) spectrogram have long been studied to analyze and process audios. The frameshift in STFT procedure determines the temporal resolution. However, in many discriminative audio applications, long-term time and frequency correlations are needed. The authors in this work use Toeplitz matrix motivated filter banks to extract long-term time and frequency information. This paper investigates the mechanism of long-term filter banks and the corresponding spectrogram reconstruction method. The time duration and shape of the filter banks are well designed and learned using neural networks. We test our approach on different tasks. The spectrogram reconstruction error in audio source separation task is reduced by relatively 6.7% and the classification error in audio scene classification task is reduced by relatively 6.5%, when compared with the traditional frequency filter banks. The experiments also show that the time duration of long-term filter banks in classification task is much larger than in reconstruction task. Keywords: Long-term filter banks, Deep neural network, Audio scene classification, Audio source separation 1 Introduction makes it very different from natural images. For example Audios in a realistic environment are typically composed in Fig. 1,(a)and (b) are two audio fragments randomly of different sound sources. Yet humans have no problem selected from an audio of “cafe” scene. We first calculate in organizing the elements into their sources to recognize the average energy distribution of the two examples in the acoustic environment. This process is called auditory the frequency direction, which is shown in (c). And then scene analysis [1]. Studies in the central auditory sys- the temporal coherence of salient audio elements in each tem [2–4] have inspired numerous hypotheses and models frequency bin is measured as (d). It is obvious that the concerning the separation of audio elements. One promi- energy distribution and temporal coherence vary tremen- nent hypothesis that underlies most investigations is that dously in different frequency bins, but are similar in the audio elements are segregated whenever they activate same frequency bin of different spectrograms. Thus for well-separated populations of auditory neurons that are audio signals, the spectrogram structure is not equivalent selective to frequency [5, 6], which emphasizes the audio in time and frequency direction. In this paper, we propose distinction on the frequency dimension. At the same time, a novel network structure to learn the energy distribution other studies [7, 8] also suggest that auditory scenes are and temporal coherence in different frequency bins. essentially dynamic, containing many fast-changing, rela- tively brief acoustic events. Therefore an essential aspect 1.1 Related work of auditory scene analysis is the linking over time [9]. For audio separation [10, 11] and recognition [12, 13] Problems inherent to auditory scene analysis are sim- tasks, the time and frequency analysis is usually imple- ilar to those found in visual scene analysis. However, mented using well designed filter banks. the time and frequency characteristic of a spectrogram Filter banks are traditionally composed of finite or infi- nite response filters in principle [14], but the stability *Correspondence: teng-zhang10@mails.tsinghua.edu.cn of the filters is usually difficult to be guaranteed. For Department of Electronic Engineering, Tsinghua University, Beijing, China simplicity, filter banks on STFT spectrogram have been © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 2 of 13 Fig. 1 Spectrogram examples of “cafe” scene. a, b Two audio fragments randomly selected from “cafe” scene. c The average energy distribution of the two examples in frequency direction. d The temporal coherence of the two examples in different frequency bins investigated for a long time [15]. In this case, the time 1.2 Contribution of this paper resolution is determined by the frameshift in the STFT As shown in Fig. 1, when perceptual frequency scale procedure and the frequency resolution is modelled by is utilized to map the linear frequency domain to the the frequency response of the filter banks. Frequency fil- nonlinear perceptual frequency domain [27], the major ter banks can be parameterized in the frequency domain concern comes to be how to model the energy distri- with filter centre, bandwidth, gain and shapes [16]. If these bution and temporal coherence in different frequency parameters are learnable, deep neural networks (DNNs) bins. can be utilized to learn them discriminatively [17–19]. To obtain better time and frequency analysis results, These frequency filter banks are usually used to model the we divide the audio processing procedure into two stages. frequency selectivity of the auditory system, but cannot In the first stage, traditional frequency filter banks are represent the temporal coherence of audio elements. implemented on STFT spectrogram to extract frequency DNNs are often used as classifiers when the inputs features. Without loss of generality, the parameters of the are dynamic acoustic features such as filter bank-based frequency filter banks are set experimentally. In the sec- cepstral features and Mel-frequency cepstral coefficients ond stage, a novel long-term filter bank spanning several [20, 21]. When the input to DNNs is a magnitude spec- frames is constructed in each frequency bin. The long- trogram, time-frequency structure of the spectrogram term filter banks proposed here can be implemented by can be learned. Neural networks organized into a two- neural networks and trained jointly with the target of the dimensional space have been proposed to model the time specific task. and frequency organization of audio elements by Wang The major contributions are summarized as follows: and Chang [22]. They utilized two-dimensional Gaussian lateral connectivity and global inhibition to parameter- - Toeplitz matrix motivated long-term filter banks: ize the network, where the two dimensions correspond Unlike filter banks in frequency domain, our proposal to frequency and time respectively. In this model, time is of long-term filter banks spreads over the time converted into a spatial dimension, temporal coherence dimension. They can be parameterized with the time can take place in auditory organization much like in visual duration and shape constraints. For each frequency organization where an object is naturally represented in bin, the time duration is different, but for each frame, spatial dimensions. However, these two dimensions are the filter shape is constant. This mechanism can be not equivalent in a spectrogram according to our analysis. implemented using a Toeplitz matrix motivated And what is more, the parameters of the network are set network. empirically and not learnable, which is still significantly - Spectrogram reconstruction from filter bank dependent on domain knowledge and modelling skill. coefficients: Consistent with the audio processing In recent years, neural networks with special structures procedure, we also divide the reconstruction such as convolutional neural network (CNN) [23, 24]and procedure into two stages. The first stage is a dual long short-term memory (LSTM) [25, 26]havebeenused inverse process of the long-term filter banks and the to extract the long-term information of audios. But in both second stage is a dual inverse process of the network structures, the temporal coherence is considered frequency filter banks. This paper investigates the to be the same in different frequency bins, which is in spectrogram reconstruction problem using an contradiction with Fig. 1. elaborate neural network. Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 3 of 13 This paper is organized as follows. The next section When the number of frequency filters is equal to m,the describes the detailed mechanism of the long-term fil- long-term filter banks can be parameterized by m linear ter banks and the spectrogram reconstruction method. transformations. The parameters will be labelled as θ and Then network structures used in our proposed method discussed in the following part of this section in detail. are introduced in Section 3.Section 4 conducts several The back-end processing modules vary from different experiments to show the performance of long-term filter applications. For audio scene classification task, they will banks regarding source separation and audio scene classi- be deep convolutional neural networks followed by a fication. Finally, we conclude our paper and give directions softmax layer to convert the feature maps to the corre- for future work in Section 5. sponding categories. However, for audio source separation task, the modules will be composed by a binary gating 2 Long-term filter banks layer and some spectrogram reconstruction layers. We For generality, we consider in this section a long-term fil- define them as nonlinear functions f .The long-termfilter ter bank learning framework based on neural networks as bank parameters θ can be trained jointly with the back- Fig. 2. end parameters γ using back propagation method [33]in The input audio signal is first transformed to a sequence neural networks. of vectors using STFT [28]; the STFT result can be repre- sented as X ={x , x , ..., x }. T is determined by the 2.1 Toeplitz motivation 1...T 1 2 T frame shift in STFT, the dimension of each vector x can be The long-term filter banks in our proposed method are labelled as N, which is determined by the frame length. used to extract the energy distribution and temporal The frequency filter banks can be simplified as a lin- coherence in different frequency bins which have been T T T discussed in Section 1. As shown in Fig. 4,the long-term ear transformation y = f x , f x , ..., f x ,where f t t t t 1 2 m k filter banks can be implemented by a series of filters with is the weights of the k-th frequency filter. In the his- different time durations. If the output of the frequency tory of auditory frequency filter banks [29], the rounded filter banks is y , and the long-term filter banks are param- exponential family [30] and the gammatone family [31] eterized as W = {w , w , ..., w }, the operation of the 1 2 m are the most widely used families. We use the sim- long-term filter banks can be mathematically represented plest form of these two families, triangular shape for the as Eq. 1. T is the length of the STFT output, m is the rounded exponential family and Gaussian shape for the dimension of y , which also represents the number of fre- gammatone family. For triangular filter banks, the band- quency bins, w is a set of T positive weights to represent width is 50% overlapped between neighbouring filters. For the time duration and shape of the k-th filter. In Fig. 4 Gaussian filter banks, the bandwidth is 4σ,where σ rep- for example, w is a rectangular window with individual resents the standard deviation in the Gaussian function. width, each row of the spectrogram is convolved by the These two types of frequency filter banks are the base- corresponding filter. lines in this paper, respectively named TriFB and GaussFB. The triangular and gaussian examples distributed uni- formly in the Mel-frequency scale [32] can be seen z = y ∗ w ,1 ≤ k ≤ m (1) t,k i,k k,i−t in Fig. 3. i=1 Fig. 2 Long-term filter banks learning framework. The left part of the framework is the feature analysis procedure including STFT, frequency filter banks and long-term filter banks. The right part is the application examples of the extracted feature map, such as audio scene classification and audio source separation. Long-term filter banks in the feature analysis procedure and the back-end application modules are stacked into a deep neural network Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 4 of 13 Fig. 3 Shape of frequency filter banks. a The triangular filter banks. b The gaussian filter banks As a matter of fact, the operation in Eq. 1 is a series of frequency bin is T. This assumption is unreasonable espe- one-dimensional convolutions along time axis. We rewrite cially when T is extremely large. The long-term correla- it using the Toeplitz matrix [34] for simplicity. In Eq. 2,the tion should be limited to a certain range according to our tensor S ={S , S , ..., S } represents the linear transfor- intuition. Inspired by traditional frequency filter banks, 1 2 m mation form of long-term filter banks in each frequency we attempt to use the parameterized window shape to bin. z in Eq. 2 is equivalent to {z , z , ..., z } in Eq. 1. limit the time duration of long-term filter banks. k 1,k 2,k T,k In this case, long-term filter banks can be represented In Fig. 4, rectangular shapes with time durations of 3, as a simple form of tensor operation, which can be eas- 2, 1 and 2 frames are utilized as an interpretation. From ily implemented by a Toeplitz motivated network layer. the theory of frequency filter banks, triangular and gaus- According to [35], Toeplitz networks are mathematically sian shapes are also commonly used options. However, tractable and can be easily computed. rectangular and triangular shapes are not differentiable and unable to be incorporated into a scheme of a back- propagation algorithm. Thus in this paper, the shape of z = y ˆ S ,1 ≤ k ≤ m k k long-term filter banks is constrained using the Gaussian y ˆ = y , y , ..., y k 1,k 2,k T,k function as Eq. 3. The time duration of long-term filter ⎡ ⎤ w w ··· w banksislimited by σ , the strength of each frequency bin k,0 k,−1 k,1−T k ⎢ ⎥ w w ··· w is reconstructed by α , the total number of parameters k,1 k,0 k,2−T k ⎢ ⎥ S = (2) ⎢ ⎥ k . . . . reduces from 2mT in Eq. 2 to 2m in Eq. 3. . . . . ⎣ ⎦ . . . w w ··· w k,T −1 k,T −2 k,0 t w = α · exp − ,1 ≤ k ≤ m (3) k,t k 2.2 Shape constraint If W is totally independent, S is a dense Toeplitz matrix, When we initialize the parameters α and σ randomly, k k which means that the time duration of the filter in each we believe that the learning will be well behaved, which Fig. 4 Model architecture of long-term filter banks. Each row of the spectrogram is convolved by a filter bank with individual width. In this sketch map, time durations of the filter banks in the highest four frequency bins are 3, 2, 1 and 2 frames Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 5 of 13 −1 is the so-called “no bad local minim” hypothesis [36]. y = z S ,1 ≤ k ≤ m (5) However, a different view presented in [37] is that the However, considering that S is a Toeplitz matrix, R underlying easiness of optimizing deep networks is rather k k canberepresented in asimpleway [40]. R is given by tightly connected to the intrinsic characteristics of the k ˆ ˆ Eq. 6,where A , B , A and B all are lower triangular data these models are run on. Thus for us, the initializa- k k k k Toeplitz matrices given by Eq. 7. tion of parameters is a tricky problem, especially when α and σ have clear physical meanings. 1 T If σ in Eq. 3 is initialized with a value larger than ˆ ˆ R = A B − B A (6) k k k k 1.0, the corresponding S is approximately equal to a k- tridiagonal Toeplitz matrix [38], where k is less than 3. ⎛ ⎞ Thus, if the totally independent W is initialized with an ⎛ ⎞ 0 ··· 00 a 0 ··· 0 identity matrix, similar results with limited time dura- ⎜ ⎟ a a ··· 0 a ··· 00 ⎜ ⎟ 2 1 n ⎜ ⎟ tions should be obtained. Whether it is the Gaussian ⎜ ⎟ A = . . . , A = ⎜ ⎟ k . k . . . ⎝ . . . . ⎠ . . . . . ⎝ ⎠ shape-constrained algorithm as Eq. 3 or is the totally . . . . . . . a a ··· a n n−1 1 independent W in Eq. 2, the initialization of parameters a ··· a 0 2 n ⎛ ⎞ is important and intractable when adapting to differ- ⎛ ⎞ 0 ··· 00 b 0 ··· 0 ent tasks. More details will be discussed and tested in ⎜ ⎟ b b ··· 0 b ··· 00 ⎜ n−1 n ⎟ 1 ⎜ ⎟ Section 4. ⎜ ⎟ ˆ B = , B = . . . ⎜ ⎟ k . k . . . ⎝ ⎠ . . . . . . . . ⎝ ⎠ . . . . . . . b b ··· b 1 2 n 2.3 Spectrogram reconstruction b ··· b 0 n−1 1 In our proposal of learning framework as Fig. 2,STFT (7) spectrogram is transformed into subband coefficients Note that a and b canalsoberegardedasthe solutions after frequency filter banks and long-term filter banks. of two linear systems, which can be learned using a fully The dimension of subband coefficients z is usually much connected neural work layer. In this case, the number of less than x to reduce computational cost and extract sig- parameters reduces from mT to 2mT. nificant features. In this case, the subband coefficients In conclusion, the spectrogram reconstruction proce- are incomplete, perfect spectrogram reconstruction from dure can be implemented using a two-layer neural net- subband coefficients is impossible. work. When the first layer is implemented as Eq. 5,the The spectrogram vector x is firstly transformed using total number of parameters is mN + mT . While when frequency filter banks described at the beginning of this section. Then long-term filter banks work as Eq. 2 to get the first layer is represented as Eq. 6,the totalnumber thesubband coefficients.Thusthe processofthe con- is mN + 2mT. Experiments in Section 4.1 will show the version from spectrogram vector to filter subband coef- difference between these two methods. ficients and the dual reconversion can be represented as Eq. 4. The operation of frequency filter banks f can be 3 Training the models As described in Section 2, the long-term filter banks we simplified as a singular matrix F where the number of proposed here can be integrated into a neural network rows is much less than columns. The reconversion process −1 (NN) structure. The parameters of the models are learned f is approximately the Moore-Penrose pseudoinverse jointly with the target of the specific task. In this section, [39]of F; this module can be easily implemented using a we introduce two NN-based structures respectively for fully connected network layer. However, the tensor opera- audio source separation and audio scene classification tion of long-term filter banks f is much more intractable. tasks. z = f (f (x )) t 2 1 t 3.1 Audio source separation −1 −1 x = f f (z ) (4) t t 1 2 In Fig. 5a, the procedures of STFT and frequency fil- ter banks in Fig. 2 are excluded from the NN structure because they are implemented empirically and have no Without regard to the special structure of Toeplitz −1 parameters. The NN structure for audio source separation matrix, f can be mathematically represented as Eq. 5. task is divided into four steps, in which three steps have S is a nonsingular matrix which has been defined in −1 been discussed in Section 2. The layers of long-term filter Eq. 2. In general, S is another nonsingular matrix R banks and inverse of long-term filter banks are imple- which can be learned using a fully connected network mented respectively as Eqs. 2 and 5, which can be denoted layer independently. There are m frequency bins in total, as h and h . The reconstruction layer is constructed using so m parallel fully connected network layers are needed 1 2 a fully connected layer and can be denoted as h . and the number of parameters is mT . 4 Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 6 of 13 Fig. 5 NN-based structures with proposed method. a The NN structure for audio source separation task. b The NN structure for audio scene classification task We attempt the audio separation from an audio mixture 3.2 Audio scene classification using a simple masking method [41], which can be repre- In early pattern recognition studies [42], the input is first sented as the binary gating layer in Eq. 8 and denoted as converted into some features, which are usually defined h . The output of this layer is a linear projection modu- empirically by experts and believed to be identified with lated by the gates g . These gates multiply each element of the recognition targets. In Fig. 5b, a feature extraction the matrix Z and control the information passed on in the structure including the long-term filter banks is proposed hierarchy. Stacking these four layers on the top of input Y to systematically train the overall recognizer in a manner gives a representation of the separated clean spectrogram consistent with the minimization of recognition errors. X = h ◦ h ◦ h ◦ h (Y ). The NN structure for audio scene classification task can 4 3 2 1 also be divided into four steps, where the first layer of N long-term filter banks is implemented using Eq. 2.The g = sigmoid z v ti tj ji j=1 (8) convolutional layer and the pooling layer are conducted o = z g ti ti ti using the network structure described in [43]. In general, let z refer to the concatenation of frames after long- i:i+j Neural networks are trained on a frame error (FE) min- term filter banks z , z , ...z . The convolution operation i i+1 i+j hm imization criterion and the corresponding weights are involves a filter w ∈ R , which is applied to a window of adjusted to minimize the error squares over the whole h frames to produce a new feature. For example, a feature training data set. The error of the mapping is given c is generated from a window of frames z by Eq. 10, i:i+h−1 by Eq. 9,where x is the targeted clean spectrogram where b ∈ R is a bias term and f is a non-linear function. and x ˆ is the corresponding separated representation. This filter is applied to each possible window of frames to As commonly used, L2-regularization is typically chosen produce a feature map c =[ c , c , ...c ]. Then a max- 1 2 T −h+1 to impose a penalty on the complexity of the mapping, overtime pooling operation [44] over the feature map is which is the λ term in Eq. 9. However, when the layer applied and the maximum value c ˆ = max(c) is taken as of long-term filter banks is implemented by Eq. 3,the the feature corresponding to this filter. Thus one feature is elements of w have definitude physical meanings. Thus, extracted using one filter. This model uses multiple filters L2-regularization is operated only on the upper three lay- with varying window sizes to obtain multiple features. ers in this model. In this case, the network in Fig. 5a can be optimized by the back-propagation method. c = f (w · z + b) (10) i i:i+h−1 T 4 The features extracted from the convolutional and pool- 2 2 ˆ ing layers are then passed to a fully connected soft- =  x − x  +λ  w  (9) t t l t=1 max layer to output the probability distribution over l=2 Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 7 of 13 categories. The classification loss of this model is given by the MIR-1K dataset [46]. The dataset consists of 1000 Eq. 11,where n is the number of audios, k is the num- song clips recorded at a sample rate of 16kHz, with berofcategories, y is the category labels and p is the durations ranging from 4 to 13 s. The dataset is then probability distribution produced by the NN structure. In utilized with 4 training/testing splits. In each split, 700 this case, the network in Fig. 5b can be optimized by the of the examples are randomly selected for training and back-propagation method. the others for testing. We use the mean average accu- racy over the 4 splits as the evaluation criterion. In order n k 4 to achieve a fair comparison, we use this dataset to cre- = y · log(p ) + λ  w  (11) i,j i,j l ate3sets of mixtures.For each clip,wemix thevocal i=1 j=1 l=2 and music track under various conditions, where the energy ratio between music and voice takes 0.1, 1 and 10 4 Experiments respectively. To illustrate the properties and performance of long-term We first test our methods on the outputs of frequency filter banks proposed in this paper, we conduct two groups filter banks. In this case, the combination of classical of experiments respectively on audio source separation frequency filter banks and our proposed temporal filter and audio scene classification. To achieve a fair compar- banks work as two-dimensional filter banks on mag- ison with traditional frequency filter banks, all experi- nitude spectrograms. Classical CNN models can learn ments conducted in this section utilize the same settings two-dimensional filters on spectrograms directly. Thus we and structures except for the items listed below. introduce a 1-layer CNN model as a comparison. The CNN model is implemented as [22], but the convolutional - Models: The models tested in this section are layer here is composed of learnable parameters, instead of different from each other in two aspects. The variants constant Gaussian lateral connectivity in [22]. This con- of frequency filter banks include TriFB and GaussFB, volution layer works as a two-dimensional filter whose as described in Section 2. For long-term filter banks, size is set to be 5 × 5, the outputs of this layer is then Gaussian shape-constrained filters introduced in processed as Fig. 5a.Weuse theNNmodel in [47]and Section 2.2 are named GaussLTFB and totally the one-layer CNN model as our baseline models. For independent filters are named FullLTFB. The our proposed long-term filter banks, we test two vari- baseline of our experiments has no long-term filter ant modules: GaussLTFB and FullLTFB which have been banks, which is labelled as Null. The initials of the defined at the beginning of Section 4. For FullLTFB situ- names are used to differentiate models. For example, ation, two initialization methods discussed in Section 2.2 when TriFB and FullLTFB are used in the model, the are tested respectively. The three variant modules Gaus- model is named TriFB-FullLTFB. sLTFB, FullLTFB-Random and FullLTFB-Identity can be - Initialization: When we use totally independent utilized on two types of frequency filter banks TriFB and filters as the long-term filter banks, two initialization GaussFB respectively, thus a total of six long-term filter methods discussed in Section 2.2 are tested in this banks related experiments are conducted in this part. section. When the parameters are initialized Table 1 shows the results of these experiments. From randomly, the method is named Random, while when the results, we can get conclusions as follows. First, the the parameters are initialized using an identity best results in the table are obtained using long-term filter matrix, the method is named Identity. banks, which demonstrates the effectiveness of our pro- - Reconstruction: When the spectrogram posal, especially when the energy of interference is larger reconstruction is implemented as Eq. 5, the method than music. As an example, when we use gaussian fre- is named Re_inv, while when the reconstruction is quency filter banks and the energy ratio between music implemented as Eq. 6, the method is named Re_toep. and voice is 1, the reconstruction error is reduced by In all experiments, the audio signal is first transformed relatively 6.7% by using Gaussian shape-constrained long- using short-time Fourier transform with a frame length of term filter banks. Second, totally independent filters are 1024 and a frameshift of 220. The number of frequency severely influenced by the initialization. When the param- filters is set to be 64; the detailed settings of NN structures eters are initialized using an identity matrix, the perfor- are shown in Fig. 5. All parameters in the neural network mance is close to the Gaussian shape-constrained filters are trained jointly using Adam [45] optimizer; the learning in this task. However, when the parameters are initialized rate is initialized with 0.001. randomly, the reconstruction error seems to be unable to converge effectively. This result has to do with the task itself, which will be further tested in Section 4.3.Then, 4.1 Audio source separation the one-layer CNN model improves the performance only In this experiment, we investigate the application of long- when the energy ratio between music and voice is 0.1, term filter banks in audio source separation task using Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 8 of 13 Table 1 Reconstruction error of audio source separation using frequency filter banks as input Re_toep Re_inv Init Method M/V = 0.1 M/V =1M/V = 10 M/V = 0.1 M/V =1M/V = 10 – TriFB-Null 3.49 1.51 0.55 3.49 1.51 0.55 – GaussFB-Null 3.28 1.47 0.58 3.28 1.47 0.58 – TriFB-CNN-1layer 2.85 1.51 0.61 2.85 1.51 0.61 – GaussFB-CNN-1layer 2.91 1.50 0.64 2.91 1.50 0.64 – TriFB-GaussLTFB 2.66 1.38 0.50 3.65 1.80 0.74 – GaussFB-GaussLTFB 2.60 1.39 0.56 3.91 1.67 0.67 Random TriFB-FullLTFB 3.90 41.37 2.28 3.84 1.83 0.78 Random GaussFB-FullLTFB 3.55 1.99 0.86 3.85 1.64 0.66 Identity TriFB-FullLTFB 2.69 1.39 0.52 3.92 1.63 0.62 Identity GaussFB-FullLTFB 2.62 1.39 0.56 3.85 1.51 0.59 M/V represents the energy ratio between music and voice this can be attributed to the local sensitivity of recon- by relatively 5.0% by using Gaussian shape-constrained struction task. As a matter of fact, the time durations of long-term filter banks, this effect is less obvious than the long-term filter banks in most frequency bins we learned result in Table 1. This is because that the information of here are1.Thus, theconvolutionsize5 × 5istoo large. magnitude spectrograms is too rich, so the performance of Finally, Toeplitz inversion motivated reconstruction algo- the simplest NN model is also good. But when the energy rithm performs much better than the direct inverse matrix of interference is larger than music, the effectiveness of algorithm. When the direct inverse matrix algorithm is our long-term filter banks is obvious. utilized, the performance of our proposal of long-term fil- A direct perspective of the separation results can be ter banks becomes even worse than the frequency filter seen in Fig. 6. The figure shows the clean music spec- banks. trogram (a), mixed spectrogram (b)and theseparated We now test our methods on magnitude spectrograms spectrogram (c–e) when the energy ratio is 1. For this as described in [47]. In this situation, long-term filter example, (c) is the separated spectrogram from GaussFB- banks are used as one-dimensional filter banks to extract Null which has been defined at the beginning of this temporal information. The size of magnitude spectro- section, (d) is the separated spectrogram from GaussFB- grams is 513 × 128. The settings of NN structures in GaussLTFB and (e) is the separated spectrogram from Fig. 5a are modified correspondingly to adapt to this size. GaussFB-FullLTFB. When compared with (c), the results We also use the NN model in [47]and the1-layer CNN of our proposal of long-term filter banks (d)and (e)show model as our baseline models. The three variant modules significant temporal coherence in each frequency bin, GaussLTFB, FullLTFB-Random and FullLTFB-Identity are which is more approximate to the clean music spectro- utilized on magnitude spectrograms directly in this part. gram in (a). The results of these experiments are shown in Table 2. Compared with the results in Table 1,all theconclusions 4.2 Audio scene classification above remain unchanged. When the energy ratio between In this section, we apply the long-term filter banks to the music and voice is 1, the reconstruction error is reduced audio scene classification task. We employ LITIS ROUEN Table 2 Reconstruction error of audio source separation using magnitude spectrograms as input Re_toep Re_inv Init Method M/V = 0.1 M/V =1M/V = 10 M/V = 0.1 M/V =1M/V = 10 –Null[47] 2.58 0.99 0.033 2.58 0.99 0.033 – CNN-1layer [22] 2.83 0.96 0.047 2.83 0.96 0.047 – GaussLTFB 2.49 0.94 0.037 2.60 0.95 0.034 Random FullLTFB 2.77 1.12 0.080 2.85 1.03 0.043 Identity FullLTFB 2.50 0.94 0.037 2.82 0.95 0.034 M/V represents the energy ratio between music and voice Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 9 of 13 Fig. 6 Reconstructed spectrogram of audio source separation task. The clean music spectrogram in a is randomly selected from the dataset. b The corresponding music and vocal mixture. c–e The reconstructed music spectrograms from the mixture spectrogram using different configurations dataset [48] and DCASE2016 dataset [49]toconduct dataset is divided into fourfold. Our experiments acoustic scene classification experiments. obey this setting, and the average performance will be Details of these datasets are listed as follows. reported. - LITIS ROUEN dataset: This is the largest publicly For both datasets, the examples are 30 s long. In the available dataset for ASC to the best of our data preprocessing step, we first divide the 30-s exam- knowledge. The dataset contains about 1500 min of ples into 1-s clips with 50% overlap. Then each clip is acoustic scene recordings belonging to 19 classes. processed using neural networks as Fig. 5b. The classifi- Each audio recording is divided into 30-s examples cation results of all these clips will be averaged to get an without overlapping, thus obtain 3026 examples in ensemble result for the 30-s examples. The size of audio total. The sampling frequency of the audio is spectrograms is 64 × 128. For CNN structure in Fig. 5b, 22,050 Hz. The dataset is provided with 20 the window sizes of convolutional layers are 64 × 2 × 64, training/testing splits. In each split, 80% of the 64 × 3 × 64 and 64 × 4 × 64, the fully connected lay- examples are kept for training and the other 20% for ers are 196 × 128 × 19(15). For DCASE2016 dataset, we testing. We use the mean average accuracy over the use dropout rate of 0.5. For all these methods, the learn- −4 20 splits as the evaluation criterion. ing rate is 0.001, l weight is 1e , training is done using - DCASE2016 dataset: The dataset is released as Task the Adam [45] update method and is stopped after 100 1 of the DCASE2016 challenge. We use the training epochs. In order to compute the results for each development data in this paper. The development training-test split, we use the classification error over all data contains about 585 min of acoustic scene classes. The final classification error is its average value recordings belonging to 15 classes. Each audio over all splits. recording is divided into 30-s examples without We begin with experiments where we train different overlapping, thus obtain 1170 examples in total. The neural network models without long-term filter banks on sampling frequency of the audio is 44,100 Hz. The both datasets. As described at the beginning of Section 4, Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 10 of 13 Table 3 Average performance comparison with related works on LITIS Rouen dataset and DCASE2016 dataset DCASE2016 (%) LITIS Rouen (%) Method Error F-measure Error F-measure TriFB-Null 23.12 76.08 3.76 96.19 GaussFB-Null 22.69 76.56 3.48 96.44 CNN-multilayer [50] 26.45 72.44 4.00 95.80 CNN-1layer [22] 23.29 75.82 2.97 96.91 RNN-Gam [26]– – 3.4 – CNN-Gam [24]– – 4.2 – MFCC-GMM [49] 27.5 – – – DNN-CQT [51]– 78.1 – 96.6 DNN-Mel [53] 23.6 – – – CNN-Mel [54] 24.0 – – – our baseline systems take the outputs of frequency filter [52] feature representations. On DCASE2016 dataset, banks as input. TriFB and GaussFB are placed in the fre- only DNN model using CQT features performs better quency domain to integrate the frequency information. than our baseline models. Classical CNN model with Classical CNN models have the ability to learn two- three layers performs almost the same as [24]onLITIS dimensional filters on the spectrum directly. We introduce Rouen dataset, but gets a rapid deterioration of perfor- two CNN structures as a comparison. The first CNN mance on DCASE2016 dataset. This can also be attributed model is implemented as [50], which has multiple convo- to the lack of training data, especially on DCASE2016 lutional layers, pooling layers, and fully connected layers. dataset. CNN model with one convolutional layer per- The window size of convolutional kernels are 5 × 5, the forms a little better, but still worse than our baseline pooling size is 3, the output channels are [8, 16, 23], the models. These results show that the time-frequency struc- fully connected layers are 196 × 128 × 19(15). Another ture of the spectrum is difficult to be learned using CNN structure is the same as the one-layer CNN model two-dimensional convolution kernels in classical CNN described in Section 4.1, the outputs of this model is then models. For the two baseline models, GaussFB per- processed as Fig. 5b. forms better than TriFB on both datasets, because of The results of these experiments are shown in Table 3. that Gaussian frequency filter banks can extract more Comparing with other CNN related works, our baseline global information. In conclusion, the results of our models on both datasets achieve gains in accuracy. On baseline models are in line with expectations on both LITIS Rouen dataset, recurrent neural network (RNN) datasets. [26] performs better than our baseline models, because We now test our long-term filter banks on both datasets. of the powerful sequence modelling capabilities of RNN. We also test three variant modules in this part: Gaus- DNN model in [51] is the best-performing single model sLTFB, FullLTFB-Random and FullLTFB-Identity. These on both datasets, this can be attributed to the lack of train- three variant modules can be injected into neural net- ing data and the stability of Constant Q-transform (CQT) works directly as Fig. 5b. Table 4 Average performance comparison using different configurations on LITIS Rouen dataset and DCASE2016 dataset DCASE2016 (%) LITIS Rouen (%) Init Method Error F-measure Error F-measure – TriFB-Null 23.12 76.08 3.76 96.19 – GaussFB-Null 22.69 76.56 3.48 96.44 – TriFB-GaussLTFB 22.40 76.79 2.82 97.05 – GaussFB-GaussLTFB 22.15 77.11 2.97 96.91 Random TriFB-FullLTFB 22.67 76.49 3.47 96.35 Random GaussFB-FullLTFB 21.21 78.05 2.96 96.92 Identity TriFB-FullLTFB 23.35 75.69 3.67 96.18 Identity GaussFB-FullLTFB 23.13 75.83 3.21 96.61 Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 11 of 13 Fig. 7 Validation curves on LITIS ROUEN dataset and DCASE2016 dataset. a, b The proposed methods on LITIS ROUEN dataset. d, e The proposed methods on DCASE2016 dataset. c The classical CNN nodels on LITIS ROUEN dataset. f The classical CNN nodels on DCASE2016 dataset Table 4 is the performance comparison on both datasets. Figure 7a–e shows consistent results with datasets. Models with GaussLTFB module perform con- Table 4. sistently better than the corresponding baseline models. Although the performance fluctuates for different vari- 4.3 Reconstruction vs classification ants, the performance gain is obvious. For FullLTFB situ- In the experiment of audio source separation task, when ation, random initialization obtains performance gain on the parameters of totally independent long-term filter both datasets, but identity initialization degrades the per- banks are initialized randomly, the result seems to be formance on DCASE2016 dataset. This can be attributed unable to converge effectively. However, it is completely that in classification tasks, we need to extract a global rep- the opposite in audio scene classification task. resentation of all frames, more details will be discussed in Figure 8 is an explanation of the unconformity between Section 4.3. On LITIS Rouen dataset, TriFB-GaussLTFB the above two tasks. Figure 8a, b is the filters learned on model performs significantly better than the state-of- MIR-1K dataset. At low frequencies, the time duration of the-art result in [51] and obtains 2.82% on classification filters are almost equal to 1, only at very high frequen- error. On DCASE2016 dataset, GaussFB-FullLTFB model cies, the time durations become large. But for Fig. 8c, d with random initialization reduces the classification error which is learned on DCASE2016 dataset, the time dura- by relatively 6.5% and reaches the performance of DNN tion is much larger. It is intuitive that in audio source model using CQT features in [51], meaning that the separation task, the time duration of the filters is much long-term filter banks make up for the lack of feature smaller than in audio scene classification task, especially extractions. at low frequencies. When the parameters of totally inde- Validation curves on both datasets are shown in Fig. 7. pendent long-term filter banks are initialized randomly, After 100 training epochs, experiments on DCASE2016 the implicit assumption is that the time durations of the dataset encounter overfitting problem; experiments on filters is as large as the number of all frames, which is not LITIS ROUEN dataset have almost converged. Figure 7c, e applicable. In reconstruction related tasks, for example, shows that the performance of classical CNN model the long-term correlation is much more limited because is significantly worse than models with only the fre- our goal is to reconstruct the spectrogram frame by frame. quency filter banks, which is consistent with the However, in classification tasks, we need to extract a results in Table 3. The performance of one-layer CNN global representation of all frames, which is exactly in line model is between TriFB and GaussFB models on both with our hypothesis. Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 12 of 13 Fig. 8 Time durations of long-term filter banks in different tasks. a, b The long-term filters learned on MIR-1K dataset. c, d The long-term filters learned on DCASE2016 dataset 5Conclusions Authors’ contributions TZ designed the core methodology of the study, carried out the A novel framework of filter banks that can extract long- implementation and experiments, and he drafted the manuscript. JW term time and frequency correlation is proposed in this participated in the study and helped to draft the manuscript. Both authors paper. The new filters are constructed after traditional read and approved the final manuscript. frequency filters and can be implemented using Toeplitz Competing interests matrix motivated neural networks. Gaussian shape con- The authors declare that they have no competing interests. straint is introduced to limit the time duration of the filters, especially in reconstruction-related tasks. Then a Publisher’s Note spectrogram reconstruction method using the Toeplitz Springer Nature remains neutral with regard to jurisdictional claims in matrix inversion is implemented using neural networks. published maps and institutional affiliations. The spectrogram reconstruction error in audio source separation task is reduced by relatively 6.7% and the classi- Received: 21 November 2017 Accepted: 30 April 2018 fication error in audio scene classification task is reduced by relatively 6.5%. This paper provides a practical and References complete framework to learn long-term filter banks for 1. AS Bregman, Auditory scene analysis: the perceptual organization of sound. different tasks. (MIT Press, Cambridge, 1994) The former frequency filter banks are somehow interre- 2. S McAdams, A Bregman, Hearing musical streams. Comput. Music J. 3(4), 26–60 (1979) lated with the long-term filter banks. Combining the idea 3. AS Bregman, Auditory streaming is cumulative. J. Exp. Psychol. Hum. of these two types of filter banks, future work will be an Percept. Perform. 4(3), 380 (1978) investigation on two-dimensional filter banks. 4. GA Miller, GA Heise, The trill threshold. J. Acoust. Soc. Am. 22(5), 637–638 (1950) Funding 5. MA Bee, GM Klump, Primitive auditory stream segregation: a This work was partly funded by National Natural Science Foundation of China neurophysiological study in the songbird forebrain. J. Neurophysiol. (Grant No: 61571266). 92(2), 1088–1104 (2004) Zhang and Wu EURASIP Journal on Audio, Speech, and Music Processing (2018) 2018:4 Page 13 of 13 6. D Pressnitzer, M Sayles, C Micheyl, IM Winter, Perceptual organization of 30. S Rosen, RJ Baker, A Darling, Auditory filter nonlinearity at 2 khz in normal sound begins in the auditory periphery. Curr. Biol. 18(15), 1124–1128 hearing listeners. J. Acoust. Soc. Am. 103(5), 2539–2550 (1998) (2008) 31. R Patterson, I Nimmo-Smith, J Holdsworth, P Rice, in a Meeting of the IOC 7. H Attias, CE Schreiner, in Advances in Neural Information Processing Speech Group on Auditory Modelling at RSRE, vol. 2. An efficient auditory Systems. Temporal low-order statistics of natural sounds (MIT Press, filterbank based on the gammatone function, (1987) Cambridge, 1997), pp. 27–33 32. S Young, G Evermann, M Gales, T Hain, D Kershaw, X Liu, G Moore, J Odell, 8. NC Singh, FE Theunissen, Modulation spectra of natural sounds and D Ollason, D Povey, et al, The htk book. Cambridge university engineering ethological theories of auditory processing. J. Acoust. Soc. Am. 114(6), department. 3, 175 (2002) 3394–3411 (2003) 33. DE Rumelhart, GE Hinton, RJ Williams, et al, Learning representations by 9. SA Shamma, M Elhilali, C Micheyl, Temporal coherence and attention back-propagating errors. Cogn. Model. 5(3), 1 (1988) in auditory scene analysis. Trends. Neurosci. 34(3), 114–123 34. EH Bareiss, Numerical solution of linear equations with Toeplitz and (2011) vector Toeplitz matrices. Numerische Mathematik. 13(5), 404–424 (1969) 10. DL Donoho, De-noising by soft-thresholding. IEEE Trans. Inf. Theory. 41(3), 35. N Deo, M Krishnamoorthy, Toeplitz networks and their properties. IEEE 613–627 (1995) Trans. Circuits Syst. 36(8), 1089–1092 (1989) 11. B Gao, W Woo, L Khor, Cochleagram-based audio pattern separation 36. YN Dauphin, R Pascanu, C Gulcehre, K Cho, S Ganguli, Y Bengio, in using two-dimensional non-negative matrix factorization with automatic Advances in Neural Information Processing Systems. Identifying and sparsity adaptation. J. Acoust. Soc. Am. 135(3), 1171–1185 (2014) attacking the saddle point problem in high-dimensional non-convex 12. A Biem, S Katagiri, B-H Juang, in Neural Networks for Processing [1993] III. optimization (Curran Associates, Inc., 2014), pp. 2933–2941 Proceedings of the 1993 IEEE-SP Workshop. Discriminative feature extraction 37. O Shamir, Distribution-specific hardness of learning neural networks for speech recognition (IEEE, 1993), pp. 392–401 (2016). arXiv preprint arXiv:1609.01037 13. Á de la Torre, AM Peinado, AJ Rubio, VE Sánchez, JE Diaz, An application 38. J Jia, T Sogabe, M El-Mikkawy, Inversion of k-tridiagonal matrices with of minimum classification error to feature space transformations for toeplitz structure. Comput. Math. Appl. 65(1), 116–125 (2013) speech recognition. Speech Commun. 20(3-4), 273–290 (1996) 39. A Ben-Israel, TN Greville, Generalized inverses: theory and applications, 14. S Akkarakaran, P Vaidyanathan, in Acoustics, Speech, and Signal Processing, vol. 15. (Springer Science & Business Media, 2003) 1999. Proceedings, 1999 IEEE International Conference On. New results and 40. ST Lee, H-K Pang, H-W Sun, Shift-invert arnoldi approximation to the open problems on nonuniform filter-banks, vol. 3 (IEEE, 1999), Toeplitz matrix exponential. SIAM J. Sci. Comput. 32(2), 774–792 (2010) pp. 1501–1504 41. X Zhao, Y Shao, D Wang, Casa-based robust speaker identification. IEEE 15. S Davis, P Mermelstein, Comparison of parametric representations for Trans. Audio Speech Lang. Process. 20(5), 1608–1616 (2012) monosyllabic word recognition in continuously spoken sentences. IEEE 42. RO Duda, PE Hart, DG Stork, Pattern classification. (Wiley, New York, 1973) Trans. Acoustics Speech Signal Process. 28(4), 357–366 (1980) 43. Y Kim, Convolutional neural networks for sentence classification (2014). 16. A Biem, S Katagiri, E McDermott, B-H Juang, An application of arXiv preprint arXiv:1408.5882 discriminative feature extraction to filter-bank-based speech recognition. 44. R Collobert, J Weston, L Bottou, M Karlen, K Kavukcuoglu, P Kuksa, Natural IEEE Trans. Speech Audio Process. 9(2), 96–110 (2001) language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 17. TN Sainath, B Kingsbury, A-R Mohamed, B Ramabhadran, in Automatic 2493–2537 (2011) Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop On. 45. D Kingma, J Ba, Adam: A method for stochastic optimization (2014). arXiv Learning filter banks within a deep neural network framework (IEEE, preprint arXiv:1412.6980 2013), pp. 297–302 46. C-L Hsu, JSR Jang, MIR Database (2010). http://sites.google.com/site/ 18. H Yu, Z-H Tan, Y Zhang, Z Ma, J Guo, Dnn filter bank cepstral coefficients unvoicedsoundseparation/mir-1k/. Retrieved 10 Sept 2017 for spoofing detection. IEEE Access. 5, 4779–4787 (2017) 47. EM Grais, G Roma, AJ Simpson, MD Plumbley, Two-stage single-channel 19. H Seki, K Yamamoto, S Nakagawa, in Acoustics, Speech and Signal audio source separation using deep neural networks. IEEE/ACM Trans. Processing (ICASSP), 2017 IEEE International Conference On. A deep neural Audio Speech Lang. Process. 25(9), 1773–1783 (2017) network integrated with filterbank learning for speech recognition (IEEE, 48. A Rakotomamonjy, G Gasso, IEEE/ACM Trans. Audio Speech Lang. 2017), pp. 5480–5484 Process. 23(1), 142–153 (2015) 20. H Yu, Z-H Tan, Z Ma, R Martin, J Guo, Spoofing detection in automatic 49. A Mesaros, T Heittola, T Virtanen, in Signal Processing Conference speaker verification systems using dnn classifiers and dynamic acoustic (EUSIPCO), 2016 24th European. Tut database for acoustic scene features. IEEE Trans. Neural Netw. Learn. Syst. 1–12 (2017) classification and sound event detection (IEEE, 2016), pp. 1128–1132 21. H Yu, Z-H Tan, Z Ma, J Guo, Adversarial network bottleneck features for 50. Y LeCun, Y Bengio, et al., Convolutional networks for images, speech, and noise robust speaker verification (2017). arXiv preprint arXiv:1706.03397 time series. The handbook of brain theory and neural networks. 3361(10), 22. D Wang, P Chang, An oscillatory correlation model of auditory streaming. 1995 (1995) Cogn. Neurodynamics. 2(1), 7–19 (2008) 51. V Bisot, R Serizel, S Essid, G Richard, Feature learning with matrix 23. S Lawrence, CL Giles, AC Tsoi, AD Back, Face recognition: a convolutional factorization applied to acoustic scene classification. IEEE/ACM Trans. neural-network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997) Audio Speech Lang. Process. 25(6), 1216–1229 (2017) 24. H Phan, L Hertel, M Maass, P Koch, R Mazur, A Mertins, Improved audio 52. JC Brown, Calculation of a constant q spectral transform. J. Acoust. Soc. scene classification based on label-tree embeddings and convolutional Am. 89(1), 425–434 (1991) neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 53. Q Kong, I Sobieraj, W Wang, M Plumbley, in Proceedings of DCASE 2016. 1278–1290 (2017) Deep neural network baseline for dcase challenge 2016 (Tampere 25. S Hochreiter, J Schmidhuber, Long short-term memory. Neural Comput. University of Technology. Department of Signal Processing, 2016) 9(8), 1735–1780 (1997) 54. D Battaglino, L Lepauloux, N Evans, F Mougins, F Biot, Acoustic scene 26. H Phan, P Koch, F Katzberg, M Maass, R Mazur, A Mertins, Audio scene classification using convolutional neural networks. DCASE2016 Challenge, classification with deep recurrent neural networks (2017). arXiv preprint Tech. Rep. (Tampere University of Technology. Department of Signal arXiv:1703.04770 Processing, 2016) 27. S Umesh, L Cohen, D Nelson, in Acoustics, Speech, and Signal Processing, 1999. Proceedings, 1999 IEEE International Conference On. Fitting the Mel scale, vol. 1 (IEEE, 1999), pp. 217–220 28. J Allen, Short term spectral analysis, synthesis, and modification by discrete fourier transform. IEEE Trans. Acoustics Speech Signal Process. 25(3), 235–238 (1977) 29. RF Lyon, AG Katsiamis, EM Drakakis, in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium On. History and future of auditory filter models (IEEE, 2010), pp. 3809–3812

Journal

EURASIP Journal on Audio, Speech, and Music ProcessingSpringer Journals

Published: May 30, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off