A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Yan-Hui Tu; Jun Du; Chin-Hui Lee

doi:10.1007/s11265-017-1295-x

Loading next page...

References (33)

J. Barker, Ning Ma, André Coy, M. Cooke (2010)
Speech fragment decoding techniques for simultaneous speaker identification and speech recognition
Comput. Speech Lang., 24
Ron Weiss, D. Ellis (2007)
Monaural Speech Separation using Source-Adapted Models
2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
M. Radfar, R. Dansereau (2007)
Single-Channel Speech Separation Using Soft Mask Filtering
IEEE Transactions on Audio, Speech, and Language Processing, 15
M. Gales (1998)
Maximum likelihood linear transformations for HMM-based speech recognition
Comput. Speech Lang., 12
(2009)
Foundat . and
T. Virtanen (2006)
Speech recognition using factorial hidden Markov models for separation in the feature space
Geoffrey Hinton, L. Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, N. Jaitly, A. Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, Brian Kingsbury (2017)
Top Downloads in IEEE Xplore [Reader's Choice]
IEEE Signal Processing Magazine, 34
Geoffrey Hinton (2012)
A Practical Guide to Training Restricted Boltzmann Machines
Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton (2012)
Acoustic Modeling Using Deep Belief Networks
IEEE Transactions on Audio, Speech, and Language Processing, 20
(2010)
In Proc . annual conference of international speech communication association . ( INTERSPEECH ) .
Geoffrey Hinton, Simon Osindero, Y. Teh (2006)
A Fast Learning Algorithm for Deep Belief Nets
Neural Computation, 18
Chao Weng, Dong Yu, M. Seltzer, J. Droppo (2015)
Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23
Yang Shao, Soundararajan Srinivasan, Z. Jin, Deliang Wang (2010)
A computational auditory scene analysis system for speech segregation and robust speech recognition
Comput. Speech Lang., 24
D. Wang, G. Brown (2006)
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
IEEE Trans. Neural Networks, 19
Jun Du, Yanhui Tu, Lirong Dai, Chin-Hui Lee (2016)
A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24
Yaodong Zhang, James Glass (2009)
Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
M. Cooke, J. Barker, S. Cunningham, Xu Shao (2006)
An audio-visual corpus for speech perception and automatic speech recognition.
The Journal of the Acoustical Society of America, 120 5 Pt 1
Yong Wang, Yixin Yang, Shiduo Yu (2018)
Design of unidirectional acoustic probes with flexible directivity patterns using two acoustic particle velocity sensors.
The Journal of the Acoustical Society of America, 144 1
D. Reynolds, R. Rose (1995)
Robust text-independent speaker identification using Gaussian mixture speaker models
IEEE Trans. Speech Audio Process., 3
(2009)
This PDF file includes: Materials and Methods
C. Nadeu, Dusan Macho, J. Hernando (2000)
Time and frequency filtering of filter-bank energies for robust HMM speech recognition
Speech Commun., 34
M. Cooke, J. Hershey, Steven Rennie (2010)
Monaural speech separation and recognition challenge
Comput. Speech Lang., 24
Zoubin Ghahramani, Michael Jordan (2001)
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES
(2016)
http://staffwww.dcs.shef.ac.uk/ people/M.Cooke/SpeechSeparationChallenge.htm
Po-Sen Huang, Minje Kim, M. Hasegawa-Johnson, P. Smaragdis (2015)
Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23
F. Seide, Gang Li, Xie Chen, Dong Yu (2011)
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription
2011 IEEE Workshop on Automatic Speech Recognition & Understanding
T. Kristjansson, J. Hershey, P. Olsen, Steven Rennie, R. Gopinath (2006)
Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system
J. Ming, Timothy Hazen, James Glass (2006)
Combining missing-feature theory, speech enhancement, and speaker-dependent/-independent modeling for speech separation
Comput. Speech Lang., 24
Mehryar Mohri, Fernando Pereira, M. Riley (2002)
Weighted finite-state transducers in speech recognition
Comput. Speech Lang., 16
(2007)
In Proc. IEEE workshop on applications of signal processing to audio and acoustics (WASPAA) (pp. 114–117)
Matthias Zöhrer, Robert Peharz, F. Pernkopf (2015)
Representation Learning for Single-Channel Source Separation and Bandwidth Extension
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23
Yong Xu, Jun Du, Lirong Dai, Chin-Hui Lee (2014)
An Experimental Study on Speech Enhancement Based on Deep Neural Networks
IEEE Signal Processing Letters, 21
P. Boer, Dirk Kroese, Shie Mannor, R. Rubinstein (2005)
A Tutorial on the Cross-Entropy Method
Annals of Operations Research, 134

Publisher: Springer Journals
Copyright: Copyright © 2017 by Springer Science+Business Media, LLC
Subject: Engineering; Signal,Image and Speech Processing; Circuits and Systems; Electrical Engineering; Image Processing and Computer Vision; Pattern Recognition; Computer Imaging, Vision, Pattern Recognition and Graphics
ISSN: 1939-8018
eISSN: 1939-8115
DOI: 10.1007/s11265-017-1295-x
Publisher site: See Article on Publisher Site

Abstract

We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-channel setting. First, an MC acoustic modeling framework is established to train a SD-DNN model in multi-talker scenarios. Such a recognizer significantly reduces the decoding complexity and improves the recognition accuracy over those using speaker-independent DNN models with a complicated joint decoding structure assuming the speaker identities in mixed speech are known. In addition, a SD regression DNN for mapping the acoustic features of mixed speech to the speech features of a target speaker is jointly trained with the SD-DNN based acoustic models. Experimental results on Speech Separation Challenge (SSC) small-vocabulary recognition show that the proposed approach under multi-condition training achieves an average word error rate (WER) of 3.8%, yielding a relative WER reduction of 65.1% from a top performance, DNN-based pre-processing only approach we proposed earlier under clean-condition training (Tu et al. 2016). Furthermore, the proposed joint training DNN framework generates a relative WER reduction of 13.2% from state-of-the-art systems under multi-condition training. Finally, the effectiveness of the proposed approach is also verified on the Wall Street Journal (WSJ0) task with medium-vocabulary continuous speech recognition in a simulated multi-talker setting.

Journal

Journal of Signal Processing Systems – Springer Journals

Published: Oct 4, 2017

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

References (33)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies