Inicio  /  Applied Sciences  /  Vol: 14 Par: 3 (2024)  /  Artículo
ARTÍCULO
TITULO

A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR

Hyeon-Kyu Noh and Hong-June Park    

Resumen

A convolutional neural network (CNN) transducer decoder was proposed to reduce the decoding time of an end-to-end automatic speech recognition (ASR) system while maintaining accuracy. The CNN of 177 k parameters and a kernel size of 6 generates the probabilities of the current token at the token level, at the token transition of the output token sequence. Two probabilities of the current token, one from the encoder and the other from the CNN are added to the frame level to reduce the decoding step to the number of input frames. An encoder composed of an 18-layer conformer was combined with the proposed decoder for training with the Librispeech data set. The forward-backward algorithm was used for training. The space and re-appearance tokens are added to the 300-word piece tokens to represent the token string. A space token appears at a frame between two words. A comparison with the autoregressive decoders such as transformer and RNN-T decoders demonstrates that this work provides comparable WERs with much less decoding time. A comparison with non-autoregressive decoders such as CTC indicates that this work enhanced WERs.

 Artículos similares

       
 
Jiawei Zhang, Xin Zhao, Tao Jiang, Md Mamunur Rahaman, Yudong Yao, Yu-Hao Lin, Jinghua Zhang, Ao Pan, Marcin Grzegorzek and Chen Li    
This paper proposes a novel pixel interval down-sampling network (PID-Net) for dense tiny object (yeast cells) counting tasks with higher accuracy. The PID-Net is an end-to-end convolutional neural network (CNN) model with an encoder?decoder architecture... ver más
Revista: Applied Sciences

 
Gang Sun, Hancheng Yu, Xiangtao Jiang and Mingkui Feng    
Edge detection is one of the fundamental computer vision tasks. Recent methods for edge detection based on a convolutional neural network (CNN) typically employ the weighted cross-entropy loss. Their predicted results being thick and needing post-process... ver más
Revista: Information

 
Shiyu Zhang, Jianguo Kong, Chao Chen, Yabin Li and Haijun Liang    
The rise of end-to-end (E2E) speech recognition technology in recent years has overturned the design pattern of cascading multiple subtasks in classical speech recognition and achieved direct mapping of speech input signals to text labels. In this study,... ver más
Revista: Aerospace

 
Zhigang Zhu, Zhijian Yi, Shiyao Li and Lin Li    
Radar data mining is the key module for signal analysis, where patterns hidden inside of signals are gradually available in the learning process and its superiority is significant for enhancing the security of the radar emitter classification (REC) syste... ver más
Revista: Aerospace

 
Bo Zhao, Xiaoyan Luo, Panpan Tang, Yang Liu, Haoming Wan and Ninglei Ouyang    
Change detection (CD) is in demand in satellite imagery processing. Inspired by the recent success of the combined transformer-CNN (convolutional neural network) model, TransCNN, originally designed for image recognition, in this paper, we present STDeco... ver más
Revista: Applied Sciences