Non-Autoregressive Sequential Modeling for Speech Processing

INTERSPEECH 2021 Special Session

About the session

Non-autoregressive (NAR) end-to-end (E2E) modeling is an emergent research topic in speech processing. It was originally proposed for machine translation aiming at faster decoding. Conventional autoregressive (AR) models require left-to-right beam search for decoding which tends to be a complicated implementation and requires some heuristics to suppress insertion error or too short results. In addition, the left-to-right nature of the AR model limits the efficient use of parallel computation capability of GPU or other SoC specialized for inference of a neural network because of the sequential operation. On the other hand, the decoding of NAR is just forward propagation of a neural network hence does not require beam search nor sequential operation. These characteristics make the inference speed much faster than the AR model and are suitable for an on-device application.

In the ASR area, connectionist temporal classification (CTC) is one of the NAR models. CTC using CNN realizes efficient forward propagation and is suitable for low-latency applications [1, 2]. Some techniques proposed for NAR based machine translation have been applied to ASR. Mask-predict [3] is the first work of NAR without CTC which realized dramatic real time factor (RTF) improvement with small degradation of accuracy. Accuracy degradation is mitigated by recent works which make use of CTC and Transformer [4, 5, 6, 7, 8, 17]. These advances make NAR based ASR promising for on-device application. In addition, NAR has also been applied to speech translation [9].

In the TTS area, NAR models are applied in spectrogram generation and vocoder. In the spectrogram generation part, FastSpeech [10] achieved significant speedup while maintaining competitive audio quality to AR models. For the vocoder, WaveGlow [11] can generate audio samples from a spectrogram in a non-autoregressive manner and achieved the comparative audio quality to AR models with much faster inference speed. Improvement of audio quality and speedup of inference is further investigated in [12, 13, 14, 15].

On-device speech processing is gaining attention because of privacy concerns. Speech is regarded as sensitive personal data so many users do not want their voice to be sent to a server for processing like automatic speech recognition (ASR). In addition, sending voice to a server makes latency, which is an important factor of the user’s objective impression on the performance, longer. NAR is promising for on-device applications because of above-mentioned characteristics making the decoding faster.

The importance of NAR comes from some scientific aspects too. First, it breaks a kind of consensus between the researchers i.e. speech is generated in left-to-right order hence the AR model is legitimate. NAR models do not assume left-to-right generation order. The advance of the research of NAR might lead to another generation order suitable for each application like ASR or translation. Second, NAR can solve the problem of mismatch between training and inference because the decoding of NAR is just a forward propagation of a neural network. The objective function mismatch is one of the most fundamental ASR problems and NAR methods can provide one solution for this ultimate ASR research goal.

The session aims to share knowledge between researchers working on this topic. NAR models for speech processing are mainly investigated in ASR, speech translation and TTS. Gathering all the papers from different areas into a single session and having intensive discussion can boost the research of the topic.

Call for papers

Topics of this special session are (but is not limited to):

If you submit a paper about ASR, it is recommended to include the error rate and any metric to compare the decoding speed (e.g. real-time factor) using at least one of the following publicly available corpora:

The results table will be like:

Corpus Model Training Condition Decoding Condition dev (WER) eval (WER) RTF
TEDLIUM2 AR Transformer +SpecAug +LM - - -
  Novel-method   no LM - - -

If you have any questions, please feel free to contact yuyfujit [at]

Important Dates

(Same to the regular INTERSPEECH 2021 submission)

Session format

Panel discussions followed by a poster session.


[1] Pratap et al, “Scaling Up Online Speech Recognition Using ConvNets.”, Proc. INTERSPEECH 2020
[2] Kriman et al, “QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions.”, Proc. ICASSP 2020
[3] Chen et al, “Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition.”, arXiv preprint arXiv:1911.04908 (2019)
[4] Fujita et at, “Insertion-Based Modeling for End-to-End Automatic Speech Recognition.”, Proc. INTERSPEECH 2020
[5] Higuchi et al, “Improved Mask-CTC for Non-Autoregressive End-to-End ASR.”, arXiv preprint arXiv:2010.13270 (2020)
[6] Chan et al, “Imputer: Sequence Modelling via Imputation and Dynamic Programming.”, arXiv preprint arXiv:2002.08926 (2020)
[7] Chi et al, “Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment.”, arXiv preprint arXiv:2010.14233 (2020)
[8] Bai et al, “Listen Attentively, and Spell Once: Whole Sentence Generation viaa Non-Autoregressive Architecture for Low-Latency Speech Recognition.”, Proc. INTERSPEECH 2020
[9] Inaguma et al, “Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder.”, arXiv preprint arXiv:2010.13047 (2020)
[10] Ren et al, “FastSpeech: Fast, Robust and ControllableText to Speech.”, Proc. NeurIPS 2019
[11] Prenger et al, “WaveGlow: A Flow-based Generative Network for Speech Synthesis.”, Proc. ICASSP 2019
[12] Peng et al, “Non-Autoregressive Neural Text-to-Speech.”, Proc. ICML 2020
[13] Miao et al, “Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow.”, Proc. ICASSP 2020
[14] Ping et al, “Clarinet: Parallel wave generation in end-to-end text-to-speehc”, Pro.n ICRL 2019
[15] Oord et al, “Parallel WaveNet: Fast high-fidelity speech synthesis.”, Proc. PMLR 2018
[16] Kim et al, “FloWaveNet: A generative flow for raw audio.”, Proc. ICML 2019
[17] Fan et al, “CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition,” arXiv preprint arXiv:2010.14725 (2020)

Organizing committee

(Ordered alphabetically)