Skip to main content
Passa alla visualizzazione normale.

SABATO MARCO SINISCALCHI

Bayesian Unsupervised Batch and Online Speaker Adaptation of Activation Function Parameters in Deep Models for Automatic Speech Recognition

Abstract

We present a Bayesian framework to obtain maximum a posteriori (MAP) estimation of a small set of hidden activation function parameters in CD-DNN-HMM based automatic speech recognition (ASR) systems. When applied to speaker adaptation, we aim at transfer learning from a well-trained deep model for a “general” usage to a “personalized” model geared towards a particular talker using a collection of speakerspecific data. To make the framework applicable to practical situations, we perform adaptation in an unsupervised manner assuming the transcriptions of the adaptation utterances are not readily available to the ASR system. We conduct a series of comprehensive batch adaptation experiments on the Switchboard ASR task and show that the proposed approach is effective even with CD-DNN-HMM built with discriminative sequential training. Indeed, MAP speaker adaptation reduces the word error rate (WER) to 20.1% from an initial 21.9% on the full NIST 2000 Hub5 benchmark test set. Moreover, MAP speaker adaptation compares favourably with other techniques evaluated on the same speech tasks. We also demonstrate its complementarity to other approaches by applying MAP adaptation to CD-DNNHMM trained with speaker adaptive features generated through constrained maximum likelihood linear regression (fMLLR) and further reduces the WER to 18.6%. Leveraging upon the intrinsic recursive nature in Bayesian adaptation and mitigating possible system constraints on batch learning, we also proposed an incremental approach to unsupervised online speaker adaptation by simultaneously updating the hyperparameters of the approximate posterior densities and the DNN parameters sequentially. The advantage of such a sequential learning algorithm over a batch method is not necessarily in the final performance, but in computational efficiency and reduced storage needs, without having to wait for all the data to be processed. So far the experimental results are promising.