journal article
Open Access Collection
Lightweight CNN-transformer hybrid network for English speech recognition
Li, Yan; Huang, Weiguo; Gu, Cui
doi: 10.1504/ijbidm.2026.153309pmid: N/A
Speech recognition is the core technology for achieving human-computer interaction, among which English speech recognition has extremely high practical value in global communication scenarios. Although CNN-based speech recognition models are good at extracting local features, they cannot effectively capture global semantics. In contrast, transformer-based models outperform CNN in extracting global semantics, but their model parameters and computational complexity are high, making it difficult to deploy and run on resource constrained devices. Inspired by this, we proposes a lightweight CNN-transformer hybrid network (LwCTHNet) for English speech recognition. LwCTHNet effectively integrates local feature extraction, frequency domain detail supplementation, and global semantic capture capabilities by alternately stacking 3 × 3 convolution layers, wavelet enhanced convolution modules, and lightweight transformer modules. In addition, it also achieves multi-scale feature learning through skip connections and enhances feature discriminability by using a mixed loss function that combines cross entropy loss and contrastive loss. The experimental results on three English speech recognition datasets show that the proposed method not only has the minimum parameter size, but also achieves an approximately optimal word error rate. This indicates that the proposed LwCTHNet method has achieved a good balance in recognition performance, computational complexity, and parameter size.