Speech recognition is the core technology for achieving human-computer interaction, among which English speech recognition has extremely high practical value in global communication scenarios. Although CNN-based speech recognition models are good at extracting local features, they cannot effectively capture global semantics. In contrast, transformer-based models outperform CNN in extracting global semantics, but their model parameters and computational complexity are high, making it difficult to deploy and run on resource constrained devices. Inspired by this, we proposes a lightweight CNN-transformer hybrid network (LwCTHNet) for English speech recognition. LwCTHNet effectively integrates local feature extraction, frequency domain detail supplementation, and global semantic capture capabilities by alternately stacking 3 × 3 convolution layers, wavelet enhanced convolution modules, and lightweight transformer modules. In addition, it also achieves multi-scale feature learning through skip connections and enhances feature discriminability by using a mixed loss function that combines cross entropy loss and contrastive loss. The experimental results on three English speech recognition datasets show that the proposed method not only has the minimum parameter size, but also achieves an approximately optimal word error rate. This indicates that the proposed LwCTHNet method has achieved a good balance in recognition performance, computational complexity, and parameter size.

Showing 1 to 1 of 1 Articles

Articles per page

Browse All Journals

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Related Journals: