Resumen:
The aim of this paper is to illustrate an overview of the automatic speech recognition
(ASR) module in a spoken dialog system and how it has evolved from the conventional
GMM-HMM (Gaussian mixture model - hidden Markov model) architecture toward the
recent nonlinear DNN-HMM (deep neural network) scheme. GMMs have dominated for a
long time the baseline of speech recognition, but in the past years with the resurgence of
artificial neural networks (ANNs), the former models have been surpassed in most recognition tasks. An outstanding consideration for ANNs-based acoustic model is the fact that their weights can be adjusted in two training steps: i) initialization of the weights (with or without pre-training) and ii) fine-tuning.